Using Time in Software Defined Networks

9 downloads 0 Views 6MB Size Report
2.12 The number of packets lost in a flow swap vs. ...... A flow in our context, can be seen as a session between the source and destination that runs ...... [48] “Broadcom BCM56850 StrataXGS Trident II Switching Technology,” product brief,.
Using Time in Software Defined Networks Tal Mizrahi

Using Time in Software Defined Networks Research Thesis In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Tal Mizrahi

Submitted to the Senate of the Technion - Israel Institute of Technology

Tamuz, 5776

Haifa

July, 2016

This Research Thesis Was Done Under The Supervision of Prof. Yoram Moses in the Department of Electrical Engineering

Acknowledgement I gratefully thank my advisor, Prof. Yoram Moses, for his guidance, support, and encouragement throughout my graduate studies. I have been very fortunate to have an advisor who allowed me the freedom to explore on my own, and at the same time provided inspiring guidance that kept me on the right track. The generous financial assistance of the Technion is gratefully acknowledged. I would like to gratefully acknowledge the help and support of Marvell. Many thanks to David Melman for believing in me and giving me the opportunity to pursue my passion. Many thanks to my parents, Tsipi and Joe, for their encouragement and support. Finally, my warm thanks to my lovely wife, Hagit, and my two dear children, Edan and Neta, for supporting me and putting up with me throughout this long journey.

List of Publications The results presented in this dissertation have been previously published in: [1] T. Mizrahi and Y. Moses, “Software defined networks: It’s about time,” in IEEE INFOCOM, 2016. [2] T. Mizrahi and Y. Moses, “Time4: Time for SDN,” in IEEE Transactions on Network and Service Management (TNSM), under major revision, 2016. [3] T. Mizrahi and Y. Moses, “OneClock to rule them all: Using time in networked applications,” in IEEE/IFIP Network Operations and Management Symposium (NOMS) miniconference, 2016. [4] T. Mizrahi and Y. Moses, “Time capability in NETCONF,” RFC 7758, IETF, 2016. [5] T. Mizrahi, E. Saat and Y. Moses, “ReversePTP: A clock synchronization scheme for software defined networks,” International Journal of Network Management (IJNM), accepted, 2016. [6] T. Mizrahi and Y. Moses, “The Case for Data Plane Timestamping in SDN”, in IEEE INFOCOM Workshop on Software-Driven Flexible and Agile Networking (SWFAN), 2016. [7] T. Mizrahi, E. Saat and Y. Moses, “Timed consistent network updates in software defined networks,” IEEE/ACM Transactions on Networking (ToN), 2016. [8] T. Mizrahi, E. Saat and Y. Moses, “Timed consistent network updates,” in ACM SIGCOMM Symposium on SDN Research (SOSR), 2015. [9] T. Mizrahi, O. Rottenstreich and Y. Moses, “TimeFlip: Scheduling network updates with timestamp-based TCAM ranges,” in IEEE INFOCOM, 2015. [10] T. Mizrahi and Y. Moses, “Using ReversePTP to distribute time in software defined networks,” in International IEEE Symposium on Precision Clock Synchronization for Measurement Control and Communication (ISPCS), 2014. [11] T. Mizrahi and Y. Moses, “ReversePTP: A software defined networking approach to clock synchronization,” in ACM SIGCOMM Workshop on Hot topics in Software Defined Networks (HotSDN), 2014. [12] T. Mizrahi and Y. Moses, “On the necessity of time-based updates in SDN,” in Open Networking Summit (ONS), 2014. [13] T. Mizrahi and Y. Moses, “Time-based updates in software defined networks,” in ACM SIGCOMM Workshop on Hot topics in Software Defined Networks (HotSDN), 2013.

Contents

1

2

Abstract

1

List of Abbreviations

3

Introduction

5

1.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2

Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Research Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

T IME 4: Time for SDN

11

2.1

abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3

2.2.1

It’s About Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2

The Challenge of Dynamic Traffic Engineering in SDN . . . . . . . . . . 13

2.2.3

Timed Network Updates . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

The Lossless Flow Allocation (LFA) Problem . . . . . . . . . . . . . . . . . . . 18 2.3.1

Inevitable Flow Swaps . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2

Model and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3

The LFA Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4

2.5

3

2.3.4

The Impact of Flow Swaps . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.5

Network Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.6

n-Swaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1

Protocol Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.2

Prototype Design and Implementation . . . . . . . . . . . . . . . . . . . 33

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.1

Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.2

Performance Attribute Measurement . . . . . . . . . . . . . . . . . . . . 37

2.5.3

Microbenchmark: Video Swapping . . . . . . . . . . . . . . . . . . . . 37

2.5.4

Flow Swap Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.6

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.7

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.8

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Timed Consistent Network Updates in SDN

49

3.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3

3.4

3.2.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.2

Time for Consistent Updates . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.3

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.4

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Time-based Consistent Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1

Ordered Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.2

Two-phase Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.3

k-Phase Consistent Updates . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.4

The Overhead of Network Updates . . . . . . . . . . . . . . . . . . . . 57

Terminology and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1

The Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5

3.6

3.7

3.8

3.9

3.4.2

Network Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4.3

Delay-related Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Upper and Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.1

Delay Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5.2

Explicit acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.5.3

Delay Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5.4

Scheduling Accuracy Bound . . . . . . . . . . . . . . . . . . . . . . . . 64

Worst-case Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6.1

Worst-case Update Duration . . . . . . . . . . . . . . . . . . . . . . . . 65

3.6.2

Worst-case Analysis of Untimed Updates . . . . . . . . . . . . . . . . . 66

3.6.3

Worst-case Analysis of Timed Updates . . . . . . . . . . . . . . . . . . 69

3.6.4

Timed vs. Untimed Updates . . . . . . . . . . . . . . . . . . . . . . . . 71

3.6.5

Using Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Time as a Consistency Knob . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.7.1

An Inconsistency Metric . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.7.2

Fine Tuning Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8.1

Experiment 1: Timed vs. Untimed Updates . . . . . . . . . . . . . . . . 77

3.8.2

Experiment 2: Fine Tuning Consistency . . . . . . . . . . . . . . . . . . 79

3.8.3

Simulation: Using ACKs . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4

T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

85

4.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2.2

Introducing T IME F LIPs . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2.3

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.2.4 4.3

4.4

4.5

Understanding T IME F LIP via a Simple Example . . . . . . . . . . . . . . . . . . 92 4.3.1

Timestamp Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3.2

A Path Reroute Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3.3

The Intuition Behind the Example . . . . . . . . . . . . . . . . . . . . . 94

Model and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.1

TCAM Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.4.2

T IME F LIP: Theory of Operation . . . . . . . . . . . . . . . . . . . . . . 96

4.4.3

Timed Installation: Formal Definition . . . . . . . . . . . . . . . . . . . 98

Optimal Time-based Rule Installation . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.1

Optimal Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.5.2

Average Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.5.3

Installation Bounds and Periodic Ranges . . . . . . . . . . . . . . . . . 105

4.5.4

Timestamp Field Size in Bits . . . . . . . . . . . . . . . . . . . . . . . . 108

4.6

Optimal Time-based Action Updates . . . . . . . . . . . . . . . . . . . . . . . . 110

4.7

Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.8

4.9 5

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.7.1

Simulation-based Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 113

4.7.2

Microbenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.8.1

Scheduling Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.8.2

Timestamp Size in Real-Life . . . . . . . . . . . . . . . . . . . . . . . . 120

4.8.3

TCAM Update Performance . . . . . . . . . . . . . . . . . . . . . . . . 120

4.8.4

Timed Updates of Non-TCAM Memories . . . . . . . . . . . . . . . . . 121

4.8.5

On the TCAM Encoding Scheme . . . . . . . . . . . . . . . . . . . . . 121

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

OneClock to Rule Them All: Using Time in Networked Applications

125

5.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.3

5.4

5.5

5.6

6

5.2.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.2.2

The OneClock Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.2.3

OneClock: Accurate Scheduling . . . . . . . . . . . . . . . . . . . . . . 127

5.2.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.2.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Using OneClock in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.3.1

Coordinated Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.3.2

Coordinated Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.3.3

Network-wide Atomic Commit . . . . . . . . . . . . . . . . . . . . . . 131

NETCONF Time Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.4.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.4.2

Applying the Time Primitives to Various Applications . . . . . . . . . . 134

5.4.3

Notifications and Cancellation Messages . . . . . . . . . . . . . . . . . 134

5.4.4

Clock Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.4.5

Acceptable Scheduling Range . . . . . . . . . . . . . . . . . . . . . . . 135

Prediction-based Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.5.1

ETE Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.5.2

ETE Prediction Algorithms

. . . . . . . . . . . . . . . . . . . . . . . . 137

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.6.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.6.2

Experiment I: Performance on different platforms . . . . . . . . . . . . . 142

5.6.3

Experiment II: Periodic vs. bursty measurement . . . . . . . . . . . . . . 143

5.6.4

Experiment III: Performance under synthetic workload . . . . . . . . . . 143

5.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.8

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.9

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

ReversePTP: A Clock Synchronization Scheme for Software Defined Networks 6.1

147

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.2

6.3

6.2.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.2.2

R EVERSE PTP in a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . 150

6.2.3

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.2.4

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.3.1

A Brief Overview of PTP . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.3.2

A Model for using Time in SDN . . . . . . . . . . . . . . . . . . . . . . 154

6.4

R EVERSE PTP: Theory of Operation . . . . . . . . . . . . . . . . . . . . . . . . 155

6.5

The R EVERSE PTP Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.6

Using R EVERSE PTP in SDNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.7

6.8

6.9 7

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.6.1

The R EVERSE PTP Architecture in SDN . . . . . . . . . . . . . . . . . . 162

6.6.2

Time-based Updates using R EVERSE PTP . . . . . . . . . . . . . . . . . 164

6.6.3

Time Distribution over SDNs using R EVERSE PTP . . . . . . . . . . . . 165

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.7.1

Time-triggered Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

6.7.2

Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.8.1

Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

6.8.2

Scalability-Programmability Tradeoff . . . . . . . . . . . . . . . . . . . 172

6.8.3

Synchronizing Clocks using R EVERSE PTP . . . . . . . . . . . . . . . . 173

6.8.4

R EVERSE PTP in an SDN with Multiple Controllers . . . . . . . . . . . 173

6.8.5

Security aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Conclusion

177

7.1

Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

List of Figures 2.1

Flow Swapping—Flows need to convert from the “before” configuration to the “after”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2

Modeling a Clos topology as an unsplittable flow graph. . . . . . . . . . . . . . . 18

2.3

The LFA game: the source’s procedure. . . . . . . . . . . . . . . . . . . . . . . 22

2.4

The LFA game: the controller’s procedure. . . . . . . . . . . . . . . . . . . . . . 22

2.5

A Scheduled Bundle: the Bundle Commit message may include Ts , the scheduled time of execution. The controller can use a Bundle Discard message to cancel the Scheduled Bundle before time Ts . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6

R EVERSE PTP in SDN: switches distribute their time to the controller. Switches’ clocks are not synchronized. For every switch i, the controller knows offseti between switch i’s clock and its local clock. . . . . . . . . . . . . . . . . . . . . 32

2.7

T IME 4 prototype design: the black blocks are the components implemented in the context of this work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.8

Measurement of the three performance attributes: (a) ∆, (b) IR , and (c) δ. . . . . . 36

2.9

Microbenchmark: video swapping. . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.10 Experimental evaluation: every host and switch was emulated by a Linux machine in the DeterLab testbed. All links have a capacity of 10 Mbps. The controller is connected to the switches by an out-of-band network. . . . . . . . . . . 40

2.11 Flow swap performance: in large networks (a) T IME 4 allows significantly less packet loss than untimed approaches. The packet loss of T IME 4 is slightly higher than SWAN and B4 (b), while the latter two methods incur higher overhead. Combining T IME 4 with SWAN or B4 provides the best of both worlds; low packet loss (b) and low overhead (c and d). . . . . . . . . . . . . . . . . . . . . . 41 2.12 The number of packets lost in a flow swap vs. ∆. The packet loss in T IME 4 is not affected by the controller’s performance (∆). . . . . . . . . . . . . . . . . . . 43 2.13 Performance as a function of IR and δ. Untimed updates are affected by the installation latency variation (IR ), whereas T IME 4 is affected by the scheduling error (δ). T IME 4 is advantageous since typically δ < IR .

. . . . . . . . . . . . . 43

3.1

Update procedure examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2

Ordered update procedure for the scenario of Fig. 3.1a. . . . . . . . . . . . . . . 55

3.3

Timed Ordered update procedure for the scenario of Fig. 3.1a. . . . . . . . . . . 55

3.4

Two-phase update procedure for the scenario of Fig. 3.1b. . . . . . . . . . . . . . 56

3.5

Timed two-phase update procedure for the scenario of Fig. 3.1b. . . . . . . . . . 56

3.6

Long-tail latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.7

A PERT chart of a k-phase update. . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.8

A PERT chart of a two-phase update with garbage collection, performed after phase 2 is completed. Garbage collection removes the ‘before’ configuration (see Fig. 3.1) from the switches that took part in phase 1. . . . . . . . . . . . . . 68

3.9

A PERT chart of a timed two-phase update with garbage collection. . . . . . . . 70

3.10 PERT charts of the garbage collection phase of an ACK-based update. . . . . . . 73 3.11 Example 3.16: PERT chart of a timed two-phase update. The delay d (red in the figure) is a knob for consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.12 Leaf-spine topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.13 Timed updates vs. untimed updates. Each figure shows the experimental values, and the theoretical worst-case values, based on Lemmas 3.5 and 3.8. . . . . . . . 78

3.14 Publicly available topologies [14] used in our experiments. Each path of the test flows in our experiment is depicted by a different color. Black nodes are OpenFlow switches. White nodes represent the external source and destination of the test flows in the experiment. . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.15 Inconsistency as a function of the update duration. Modifying the update duration controls the degree of inconsistency. Two graphs are shown for each of the three topologies: exponential delay, constant delay. . . . . . . . . . . . . . . . . 81 3.16 Update duration of the garbage collection phase. . . . . . . . . . . . . . . . . . . 82 4.1

TCAM lookup: conventional vs. T IME F LIP. T IME F LIP uses a timestamp field, representing the time range T ≥ T0 . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2

Scheduling tolerance: T0 ∈ [Tmin , Tmax ]. . . . . . . . . . . . . . . . . . . . . . . . 89

4.3

Time range examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.4

Flows need to convert from the ‘before’ configuration to the ‘after’. . . . . . . . 93

4.5

Scheduling timelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.6

A timed TCAM Update. Every line in the figure is a time range rule, represented by one or more TCAM entries. (i) Time-oblivious entry. (ii) Installation. (iii) Removal. (iv) Rule update. (v) Action update. (vi) Action update using a complementary timestamp range. . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.7

Optimal scheduling algorithm; no other scheduling algorithm produces an extremal range with a lower expansion. . . . . . . . . . . . . . . . . . . . . . . . . 101

4.8

Installation bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.9

Periodic ranges: the 2V -periodic continuation of [T0 , T1 ]. (i) For T1V > T0V . (ii)

-

-

-

-

For T1V < T0V . In B OUNDED R ANGE T1 = T0 + 2V −1 − 1. . . . . . . . . . . . . 106 4.10 Determining a range with installation bounds ∆. . . . . . . . . . . . . . . . . . . 107 4.11 Example of 1-bit timestamp, per Theorem 4.13. . . . . . . . . . . . . . . . . . . 109 4.12 Algorithm for finding reduced range with installation bounds. . . . . . . . . . . . 112 4.13 R EDUCED R ANGE: proof of Lemma 4.17. . . . . . . . . . . . . . . . . . . . . . 113 4.14 Expansion as a function of TOL . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.15 Expansion as a function of ∆ with B OUNDED R ANGE in a timed installation. . . . 115 4.16 The number of bits as a function of ∆ for various values of TOL, using B OUND ED R ANGE

in a timed installation. The star-shaped markers indicate the points

where TOL = 2dlog2 (∆)e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.17 Timed action updates: R EDUCED R ANGE vs. B OUNDED R ANGE. . . . . . . . . . 116 4.18 Microbenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.19 Timed updates in non-TCAM lookups . . . . . . . . . . . . . . . . . . . . . . . 122 5.1

Elapsed Time of Execution (ETE): ETE = Te − Ts . . . . . . . . . . . . . . . . . . 128

5.2

Prediction-based scheduling: by predicting the ETE, a client can control when the RPC will be completed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.3

Coordinated operations and coordinated snapshots. . . . . . . . . . . . . . . . . 131

5.4

The time capability in NETCONF. . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.5

Atomic commit: (a) NETCONF confirmed commit, without using time. (b) Time-triggered commit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.6

Cancellation message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.7

Acceptable scheduling range: defined by two configurable parameters: schedmax-future and sched-max-past. . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.8

Prediction-based scheduling approach. . . . . . . . . . . . . . . . . . . . . . . . 137

5.9

Performance on various machine types (a). Type V machines were used in (b) and (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.10 Instantaneous prediction error viewed over a 150 second period. The behavior shows peaks under synthetic workload. (a) was measured on Azure, and (b), (c) on Type V machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.1

Time distribution in PTP and R EVERSE PTP. . . . . . . . . . . . . . . . . . . . 150

6.2

The Precision Time Protocol (PTP). . . . . . . . . . . . . . . . . . . . . . . . . 153

6.3

A protocol for coordinated network updates. . . . . . . . . . . . . . . . . . . . . 155

6.4

R EVERSE PTP: each master determines a separate domain. . . . . . . . . . . . . 161

6.5

The R EVERSE PTP architecture in SDNs: every switch runs a R EVERSE PTP master, and the controller runs multiple R EVERSE PTP slave instances. In an SDN that runs conventional PTP, a typical approach would be for the controller to run a PTP master, and for each switch to be a PTP slave. . . . . . . . . . . . . 163

6.6

Coordinated updates using R EVERSE PTP. . . . . . . . . . . . . . . . . . . . . . 164

6.7

SDN as a Boundary Clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

6.8

Network setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

6.9

Accuracy measurements of a coordinated Ping. The timestamped event experiment (b) provides a rough estimate of the clock accuracy. . . . . . . . . . . . . . 169

6.10 R EVERSE PTP vs. PTP: rate of PTP messages sent or received by each node. . . 170 6.11 CPU Utilization in R EVERSE PTP and in PTP as a function of the number of nodes. The figures are presented for two machines types: Type I is a low performance machine, and Type II is high performance. . . . . . . . . . . . . . . . . . 171 7.1

Summary of results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Abstract This dissertation analyzes the use of accurate time to coordinate network configuration updates. Specifically, this work focuses on centralized network architectures, such as Software Defined Networks (SDN). Time can be beneficial in a wide variety of network update scenarios. The current work focuses on two key scenarios in which using time has a significant advantage over state-of-theart approaches. First, we characterize a set of update scenarios called flow swaps, for which timed updates are the optimal update approach, yielding less packet loss than existing update approaches. Second, we analyze the use of accurate time to schedule multi-phase update procedures, allowing updates to be performed consistently, while requiring less resource overhead than existing network update methods. The current work also introduces a clock synchronization scheme that is adapted to the centralized SDN environment. However, even if network devices have perfectly synchronized clocks, how can we guarantee that events are executed at the exact time for which they were scheduled? In this work we present and analyze two accurate scheduling methods. The first uses Ternary Content Addressable Memory (TCAM) ranges in hardware switches. The second is a prediction-based scheduling approach that uses timing information collected at runtime to accurately schedule future operations. Both methods are shown to be practical and efficient. Finally, this thesis defines extensions to standard network protocols, enabling practical implementations of our concepts. We define a new feature in OpenFlow called Scheduled Bundles, which has been incorporated into the OpenFlow 1.5 protocol. A similar capability was defined for the NETCONF protocol, and has been published as an RFC. Left blank intentionally.

2 Left blank intentionally.

List of Abbreviations ACK

Acknowledgment

ETE

Elapsed Time of Execution

Gbps

Gigabits per second

IETF

Internet Engineering Task Force

IoT

Internet of Things

LFA

Lossless Flow Allocation

Mbps

Megabits per second

MEF

Metro Ethernet Forum

NETCONF

Network Configuration Protocol

NFV

Network Function Virtualization

NTP

Network Time Protocol

ONF

Open Networking Foundation

PERT

Program Evaluation and Review Technique

PTP

Precision Time Protocol

RPC

Remote Procedure Call

SDN

Software Defined Networking

SNMP

Simple Network Management Protocol

TCAM

Ternary Content Addressable Memories

List of Symbols and Abbreviations TOL

Scheduling Tolerance

VM

Virtual Machine

VNF

Virtual Network Function

WAN

Wide Area Network

4

Chapter 1 Introduction 1.1

Background

The use of synchronized clocks was first introduced in the 19th century by the Great Western Railway company in Great Britain. Clock synchronization has significantly evolved since then, and is now a mature technology that is being used by various different applications, from mobile backhaul networks [15] to distributed databases [16]. Network configuration updates are a routine necessity, and must be performed in a way that minimizes transient effects caused by intermediate states of the network. This challenge is especially critical in the context of Software Defined Networks, where the control plane is managed by a logically centralized controller, which frequently sends configuration updates to the network switches. These updates modify the switches’ forwarding rules, and thus directly affect how packets are forwarded through the network. The controller must take care to minimize network anomalies during update procedures, such as packet drops or misroutes caused by temporary inconsistencies. Updates must also be planned with performance in mind; update procedures must scale with the size of the network, and thus cannot be too complex. The current work analyzes the use of clocks and time as a tool in network reconfiguration. While the notion of using time to trigger events in distributed systems is certainly not new (e.g., [17]), time-based triggers were typically considered impractical in the context of network management due to the inaccuracy of network time synchronization. Prior to the current work, neither the OpenFlow protocol [18] nor common management and configuration protocols, such

1. Introduction

6

as SNMP [19] and NETCONF [20], made use of accurate time for scheduling or coordinating configuration updates. However, network clock synchronization has evolved over the last few years. The Precision Time Protocol (PTP), defined in the IEEE 1588 standard [21], can synchronize clocks in a network to a very high degree of accuracy, typically on the order of microseconds. Moreover, in the last few years PTP has become a common feature in commodity network devices. Thus, accurate time appears to be an accessible and useful tool for coordinating configuration changes.

1.2

Research Goals

In this thesis we study the use of time in centralized network environments, with an emphasis on SDN. Specifically, four aspects of this problem are analyzed. Use cases that benefit from using time. A key goal of our research is to identify and analyze network scenarios in which a time-based approach is useful and beneficial. We start by analyzing T IME 4 (Chapter 2), which is an update approach that performs multiple changes at different switches at the same time. We then study timed multi-phase updates, where each phase is scheduled to be performed at a different execution time (Chapter 3). We then consider the use of time as a generic approach that can be applied not only to switches and routers, but to any managed device: sensors, actuators, Internet of Things (IoT) devices, routers, or toasters (Chapter 5). Network protocols. One of the goals of this work is to extend standard network configuration protocols with the ability to use time-triggered operations. These extensions are defined for OpenFlow (Chapter 2), and for NETCONF (Chapter 5). Accurate scheduling methods. Even if network devices have perfectly synchronized clocks, it is potentially challenging to guarantee that updates are performed at the exact time for which they were scheduled; a scheduling mechanism that relies on the switch’s software may be affected by the switch’s operating system and by other running tasks. The current work analyzes scheduling methods that allow a high degree of accuracy. Specifically, two methods are studied: (i) a method that uses Ternary Content Addressable Memory (TCAM) ranges in hardware

1. Introduction

7

switches (Chapter 4), and (ii) a prediction-based scheduling approach that uses timing information collected at runtime to accurately schedule future operations (Chapter 5). Clock synchronization in SDN. Accurate timekeeping requires a clock synchronization method, such as the Precision Time Protocol (PTP) [21]. Contrary to the centralized SDN paradigm, PTP is by nature a decentralized protocol, in which every node is required to run complex algorithmic logic. In this work we explore a clock synchronization scheme that is adapted to the centralized SDN environment (Chapter 6).

1.3

Research Methods

The research described in this dissertation involves several methods and disciplines, both theoretical and experimental. Theoretical tools. The study of flow swapping (Chapter 2) uses a game theoretic analysis in the context of network flow problems. The analysis of timed consistent updates (Chapter 3) uses the network update abstraction of [22], and analyzes the worst-case duration of network updates, including the use of Program Evaluation and Review Technique (PERT) graphs [23]. Timestamp-based TCAM ranges (Chapter 4) are studied using a combinatorial approach; we present algorithms that minimize the number of TCAM entries and the number of bits used to represent a timestamp range in a TCAM. The work on OneClock (Chapter 5) uses a time-series analysis; periodic measurements of the execution time of a Remote Procedure Call (RPC) are used to predict the next execution time. Network protocols. This research work defines extensions to standard network protocols; a time extension to OpenFlow (Chapter 2), and a similar extension to NETCONF (Chapter 5). In order to pursue these extensions we actively participated in the two standard organizations that define these protocols, the Open Networking Foundation (ONF), and the Internet Engineering Task Force (IETF). Open source prototypes were implemented for the two extensions, and these prototypes were used in our experimental evaluation. Evaluation methods. This dissertation includes various evaluation methods, both experimental and simulation-based. The lion’s share of the experiments presented in this work were

1. Introduction

8

performed using two academic testbeds, Emulab [24] and DeterLab [25]. The experiments were run over a large number of nodes, up to 70 in some of the experiments. The experiments allowed an emulated network environment, where each node ran one of our time-enabled prototypes. Various network topologies were used, including well-known publicly available topologies [14]. Public cloud networks were also used for some of the experiments (Chapter 5), namely Amazon’s AWS and Microsoft’s Azure. Some of the analysis was assisted by simulation-based evaluation (Chapters 3 and 4). In the context of timestamp-based TCAM ranges (Chapter 4), the evaluation also included an experiment on a real-life network switch. Publicly available real-life measurements [26, 27] were also used in some of the analysis (Chapter 3).

1.4

Related Work

Consistent network updates. A network configuration update is per-packet consistent [22] if it guarantees that every packet sent through the network is processed according to a single configuration version, either the previous or the current one. A common approach to avoiding inconsistencies that may result from configuration updates is to use a sequence of configuration commands (e.g., [28, 29, 30, 31, 32]), whereby the order of execution guarantees that no anomalies are caused in intermediate states of the procedure. This sequential approach is fairly complex, as it requires the SDN programmer to carefully consider all intermediate states, and to find a successful sequence of commands. Moreover, since this approach requires updates to take place in a specific order, the controller uses a series of request-acknowledge handshakes, yielding a long execution time. Moreover, the efficiency of the update process is very sensitive to the load and specific conditions at runtime. Thus, while this sequential approach guarantees consistency, it is costly in terms of performance. Another approach for consistent updates [22] uses configuration version tags to guarantee consistency; all packets are stamped with a configuration version tag that indicates whether they should be processed by the new configuration or the old one, in order to guarantee that each packet is processed by a single configuration at all switches along its path. This version-based approach is considerably simpler than the sequential approach from the programmer’s perspec-

1. Introduction

9

tive, but is still complex in terms of the number of messages exchanged between the controller and switches. Moreover, this approach implies that during intermediate states of the update switches must maintain the configurations of both the previous and the current configuration, thus consuming costly memory space in the switch, as discussed in [33]. Using time in distributed applications. The use of time in distributed applications has been widely analyzed, both in theory and in practice. Analysis of the usage of time and synchronized clocks, e.g., Lamport [34, 17] dates back to the late 1970s and early 1980s. In recent years, as accurate time has become an accessible and affordable tool, it is used in various different applications; Google’s Spanner [16] uses synchronized clocks as a tool for synchronizing a distributed database. Industrial automation systems [35] use synchronized clocks to allow deterministic response times of machines to external events, and to enforce coordinated orchestration in a factory product line. The Time Sensitive Networking (TSN) technology [36] is used in automotive networks and in audio/video streaming applications. The well-known particle accelerators at CERN use state-of-the-art clock synchronization [37], allowing sub-nanosecond accuracy in response to management messages that control the accelerator experiments. While the usage of accurate time in distributed systems has been widely discussed in the literature, we are not aware of similar analyses of the usage of accurate time as a means for performing accurately scheduled configuration updates in computer networks. Using time in computer networks. Time is used in networks to schedule events at a coarse resolution. Periodic backups and power-save policies are often invoked at a scheduled time-ofday. Time-of-day routing [38, 39] routes traffic to different destinations based on the time-of-day. Such updates are typically performed at a low rate and do not place demanding requirements on consistency or performance. Hence, time-of-day routing does not require the usage of accurate time; a time accuracy on the order of seconds is typically more than enough for this purpose. In [40] the authors briefly mentioned that it would be interesting to explore the use of time synchronization to instruct routers or switches to change from one configuration to another at a specific time, but did not pursue the idea beyond this observation. Using time in software defined networks. Information about time is used in SDNs for various purposes; In OpenFlow [41, 18], which is the best-known SDN control protocol, the lifetime

1. Introduction

10

of traffic flows is monitored by measuring and logging the start-time and end-time of flows in the network. OpenFlow also uses timeouts for expiring old forwarding rules; the controller can define a timeout for a flow rule, causing it to be removed when the timeout expires. However, timeouts are defined in OpenFlow as a means to age out unused rules, and not as a means to schedule configuration updates, and are therefore defined with a coarse granularity of 1 second. Prior to the current work, neither the OpenFlow protocol [18, 42] nor common management and configuration protocols, such as SNMP [19] and NETCONF [20], used accurate time for scheduling or coordinating configuration updates.

Chapter 2 T IME 4: Time for SDN This chapter is a preprinted version of the paper: [2] T. Mizrahi and Y. Moses, “Time4: Time for SDN,” in IEEE Transactions on Network and Service Management (TNSM), under major revision, 2016. An early version of this paper was published in IEEE INFOCOM 2016 [1]. Preliminary versions of this work were published as short papers, one in HotSDN 2013 [13], and the other in the Open Networking Summit (ONS) 2014 [12].

2.1

abstract

With the rise of Software Defined Networks (SDN), there is growing interest in dynamic and centralized traffic engineering, where decisions about forwarding paths are taken dynamically from a network-wide perspective. Frequent path reconfiguration can significantly improve the network performance, but should be handled with care, so as to minimize disruptions that may occur during network updates. Network updates are especially challenging when the network is heavily utilized; some of the existing approaches suggest that spare capacity should be reserved in the network in order to allow updates in such scenarios, or that the network load should be temporarily reduced prior to a network update.

2. T IME 4: Time for SDN

12

In this paper we introduce T IME 4, an approach that uses accurate time to coordinate network updates. T IME 4 is a powerful tool in softwarized environments, that can be used for various network update scenarios, including in heavily utilized networks. Specifically, we characterize a set of update scenarios called flow swaps, for which T IME 4 is the optimal update approach, yielding less packet loss than existing update approaches without requiring spare capacity, and without temporarily reducing the network’s bandwidth. We define the lossless flow allocation problem, and formally show that in environments with frequent path allocation, scenarios that require simultaneous changes at multiple network devices are inevitable. We present the design, implementation, and evaluation of a T IME 4-enabled OpenFlow prototype. The prototype is publicly available as open source. Our work includes an extension to the OpenFlow protocol that has been adopted by the Open Networking Foundation (ONF), and is now included in OpenFlow 1.5. Our experimental results show the significant advantages of T IME 4 compared to other network update approaches, and demonstrate an SDN use case that is infeasible without T IME 4.

2.2 2.2.1

Introduction It’s About Time

The use of synchronized clocks was first introduced in the 19th century by the Great Western Railway company in Great Britain. Clock synchronization has significantly evolved since then, and is now a mature technology that is being used by various different applications, including MBH (Mobile Backhaul) networks [15], industrial automation systems [35], power grid networks [43] and distributed databases [16]. The Precision Time Protocol (PTP), defined in the IEEE 1588 standard [21], can synchronize clocks to a very high degree of accuracy, typically on the order of 1 microsecond [44, 15, 45]. PTP is a common and affordable feature in commodity switches. Notably, 9 out of the 13 SDNcapable switch silicons listed in the Open Networking Foundation (ONF) SDN Product Directory [46] have native IEEE 1588 support [47, 48, 49, 50, 51, 52, 53, 54, 55]. We argue that since SDN products already have built-in capabilities for accurate timekeeping

2. T IME 4: Time for SDN

13

and clock synchronization, it is only natural to harness this powerful technology to coordinate events in SDNs.

2.2.2

The Challenge of Dynamic Traffic Engineering in SDN

Defining network routes dynamically, based on a complete view of the network, can significantly improve the network performance compared to the use of distributed routing protocols. SDN and OpenFlow [41, 18] have been leading trends in this context, but several other ongoing efforts offer similar concepts. The Interface to the Routing System (I2RS) working group [56], and the Forwarding and Control Element Separation (ForCES) working group [57] are two examples of such ongoing efforts in the Internet Engineering Task Force (IETF). Centralized network updates, whether they are related to network topology, security policy, or other configuration attributes, often involve multiple network devices. Hence, updates must be performed in a way that strives to minimize temporary anomalies such as traffic loops, congestion, or disruptions, which may occur during transient states where the network has been partially updated. While SDN was originally considered in the context of campus networks [41] and data centers [58], it is now also being considered for Wide Area Networks (WANs) [33, 59], carrier networks, and MBH (Mobile Backhaul) networks [60]. WAN and carrier-grade networks require a very low packet loss rate. Carrier-grade performance is often associated with the term five nines, representing an availability of 99.999%. MBH networks require a Frame Loss Ratio (FLR) of no more than 10−4 for voice and video traffic, and no more than 10−3 for lower priority traffic [61]. Other types of carrier network applications, such as storage and financial trading require even lower loss rates [62], on the order of 10−5 . Several recent works have explored the realm of dynamic path reconfiguration, with frequent updates on the order of minutes [33, 59, 30], enabled by SDN. Interestingly, for voice and video traffic, a frame loss ratio of up to 10−4 implies that service must not be disrupted for more than 6 milliseconds per minute. Hence, if path updates occur on a per-minute basis, then transient disruptions must be limited to a short period of no more than a few milliseconds.

2. T IME 4: Time for SDN

2.2.3

14

Timed Network Updates

We explore the use of accurate time as a tool for performing coordinated network updates in a way that minimizes packet loss. Softwarized management can significantly benefit from using time for coordinating network-wide orchestration, and for enforcing a given order of events. We introduce T IME 4, which is an update approach that performs multiple changes at different switches at the same time. Example 2.1. Fig. 2.1 illustrates a flow swapping scenario. In this scenario, the forwarding paths of two flows, f1 and f2 , need to be reconfigured, as illustrated in the figure. It is assumed that all links in the network have an identical capacity of 1 unit, and that both f1 and f2 require a bandwidth of 1 unit. In the presence of accurate clocks, by scheduling S1 and S3 to update their paths at the same time, there is no congestion during the update procedure, and the reconfiguration is smooth. As clocks will typically be reasonably well synchronized, albeit not perfectly synchronized, such a scheme will result in a very short period of congestion.

S1

f1

S2

S1 S5

S3

f2 before

S3

S4

f1

S2 S5

f2

S4 after

Figure 2.1: Flow Swapping—Flows need to convert from the “before” configuration to the “after”. In this paper we show that in a dynamic environment, where flows are frequently added, removed or rerouted, flow swaps are inevitable. One of our key results is that simultaneous updates are the optimal approach in scenarios such as Example 1, whereas other update approaches may yield considerable packet loss, or incur higher resource overhead. Note that such packet loss can be reduced either by increasing the capacity of the communication links, or by increasing the buffer memories in the switches. We show that for a given amount of resources, T IME 4 yields lower packet loss than other approaches.

2. T IME 4: Time for SDN

15

The importance of flow swaps. The necessity of flow swapping is not confined to the specific example of Fig. 2.1. More generally, in some update scenarios, known as deadlocks [30], it has been shown that it is not possible to complete the update without incurring congestion. Simultaneous flow swapping is generally applicable to all deadlock scenarios. The need to rearrange flows in heavily utilized networks was discussed both in SWAN [33] and in B4 [59]. As we show in this paper, timed flow swapping can address the scenarios of SWAN and B4 without requiring extra network capacity, and without temporarily reducing the traffic bandwidth. Another notable example of the importance of flow swaps is a recently published work by Fox Networks [63], in which accurately timed flow swaps are essential in the context of video switching. Accuracy is a key requirement in T IME 4; since updates cannot be applied at the exact same instant at all switches, they are performed within a short time interval called the scheduling error. This error is affected by two factors: (i) the clock accuracy, and (ii) the switch’s ability to execute the update as close as possible to its scheduled time. The switches’ clocks can be synchronized in typical systems with a sub-microsecond accuracy (e.g., [44]). As for factor (ii), the latency of rule installations has been shown to range from milliseconds to seconds [64, 30]. In contrast, timed updates, using the TCAM-based hardware solution of TimeFlip [9], have been shown to allow a sub-microsecond scheduling error. The experiments we present in Section 2.5 show that the scheduling error in software switches is on the order of 1 millisecond. Accurate hardware-based solutions such as TimeFlip can execute scheduled events in existing switches with an accuracy on the order of 1 microsecond. Accurate time is a powerful abstraction for SDN programmers, not only for flow swaps, but also for timed consistent updates, as discussed by [8].

2.2.4

Related Work

Time and synchronized clocks have been used in various distributed applications, e.g., [15, 35, 43, 16]. Time-of-day routing [38] routes traffic to different destinations based on the time-of-

2. T IME 4: Time for SDN

16

day. Path calendaring [65] can be used to configure network paths based on scheduled or foreseen traffic changes. The two latter examples are typically performed at a low rate and do not place demanding requirements on accuracy. Various network update approaches have been analyzed in the literature. A common approach is to use a sequence of configuration commands [28, 31, 32, 30], whereby the order of execution guarantees that no anomalies are caused in intermediate states of the procedure. However, as observed by [30], in some update scenarios, known as deadlocks, there is no order that guarantees a consistent transition. Two-phase updates [22] use configuration version tags to guarantee consistency during updates. However, as per [22], two-phase updates cannot guarantee congestion freedom, and are therefore not effective in flow swap scenarios, such as Fig. 2.1. Hence, in flow swap scenarios the order approach and the two-phase approach produce the same result as the simple-minded approach, in which the controller sends the update commands as close as possible to instantaneously, and hopes for the best. In this paper we present T IME 4, an update approach that is most effective in flow swaps and other deadlock [30] scenarios, such as Fig. 2.1. We refer to update approaches that do not use time as untimed update approaches. In SWAN [33], the authors suggest that reserving unused scratch capacity of 10-30% on every link can allow congestion-free updates in most scenarios. The B4 [59] approach prevents packet loss during path updates by temporarily reducing the bandwidth of some or all of the flows. Our approach does not require scratch capacity, and does not reduce the bandwidth of flows during network updates. Furthermore, in this paper we show that variants of SWAN and B4 that make use of T IME 4 can perform better than the original versions. A recently published work by Fox Networks [63] shows that accurately timed path updates are essential for video swapping. We analyze this use case further in Section 2.5. Rearrangeably non-blocking topologies (e.g., [66]) allow new traffic flows to be added to the network by rearranging existing flows. The analysis of flow swaps presented in this paper emphasizes the requirement to perform simultaneous reroutes during the rearrangement procedure, an aspect which has not been previously studied. Preliminary work-in-progress versions of the current paper introduced the concept of using

2. T IME 4: Time for SDN

17

time in SDN [13] and the flow swapping scenario [12]. The use of time for consistent updates was discussed in [8]. TimeFlip [9] presented a practical method of implementing timed updates. The current work is the first to present a generic protocol for performing timed updates in SDN, and the first to analyze flow swaps, a natural application in which timed updates are the optimal update approach.

2.2.5

Contributions

The main contributions of this paper are as follows: • We consider a class of network update scenarios called flow swaps, and show that simultaneous updates using synchronized clocks are provably the optimal approach of implementing them. In contrast, existing approaches for consistent updates (e.g., [22, 30]) are not applicable to flow swaps, and other update approaches such as SWAN [33] and B4 [59] can perform flow swaps, but at the expense of increased resource overhead. • We use game-theoretic analysis to show that flow swaps are inevitable in the dynamic nature of SDN. • We present the design, implementation and evaluation of a prototype that performs timed updates in OpenFlow. • Our work includes an extension to the OpenFlow protocol that has been approved by the ONF and integrated into OpenFlow 1.5 [67], and into the OpenFlow 1.3.x extension package [68]. The source code of our prototype is publicly available [69]. • We present experimental results that demonstrate the advantage of timed updates over existing approaches. Moreover, we show that existing update approaches (SWAN and B4) can be improved by using accurate time. • Our experiments include an emulation of an SDN-controlled video swapping scenario, a real-life use case that has been shown [63] to be infeasible with previous versions of OpenFlow, which did not include our time extension.

2. T IME 4: Time for SDN

18

d

d

c c e2 e1 q1

q1

q2

c em

q2

c c o1

o2

o1

qm c

o2 ∞

on ∞



s (a) Clos network.

(b) Unsplittable flow graph.

Figure 2.2: Modeling a Clos topology as an unsplittable flow graph.

2.3 2.3.1

The Lossless Flow Allocation (LFA) Problem Inevitable Flow Swaps

Fig. 2.1 presents a scenario in which it is necessary to swap two flows, i.e., to update two switches at the same time. In this section we discuss the inevitability of flow swaps; we show that there does not exist a controller routing strategy that avoids the need for flow swaps. Our analysis is based on representing the flow-swap problem as an instance of an unsplittable flow problem, as illustrated in Fig. 2.2b. The topology of the graph in Fig. 2.2b models the traffic behavior to a given destination in common multi-rooted network topologies such as fat-tree and Clos (Fig. 2.2a). The unsplittable flow problem [70] has been thoroughly discussed in the literature; given a directed graph, a source node s, a destination node d, and a set of flow demands (commodities) between s and d, the goal is to maximize the traffic rate from the source to the destination. In

2. T IME 4: Time for SDN

19

this paper we define a game between two players: a source1 that generates traffic flows (commodities) and a controller that reconfigures the network forwarding rules in a way that allows the network to forward all traffic generated by the source without packet losses. Our main argument, phrased in Theorem 2.2, is that the source has a strategy that forces the controller to perform a flow swap, i.e., to reconfigure the path of two or more flows at the same time. Thus, a scenario in which multiple flows must be updated at the same time is inevitable, implying the importance of timed updates. Moreover, we show that the controller can be forced to invoke n individual commands that should optimally be performed at the same time. Update approaches that do not use time, also known as untimed approaches, cause the updates to be performed over a long period of time, potentially resulting in slow and possibly erratic response times and significant packet loss. Timed coordination allows us to perform the n updates within a short time interval that depends on the scheduling error. Although our analysis focuses on the topology of Fig.2.2b, it can be shown that the results are applicable to other topologies as well, where the source can force the controller to perform a swap over the edges of the min-cut of the graph.

2.3.2

Model and Definitions

We now introduce the lossless flow allocation (LFA) problem; it is not presented as an optimization problem, but rather as a game between two players: a source and a controller. As the source adds or removes flows (commodities), the controller reconfigures the forwarding rules so as to guarantee that all flows are forwarded without packet loss. The controller’s goal is to find a forwarding path for all the flows in the system without exceeding the capacity of any of the edges, i.e., to completely avoid loss of packets from the given flows. The source’s goal is to progressively add flows, without exceeding the network’s capacity, forcing the controller to perform a flow swap. We shall show that the source has a strategy that forces the controller to 1 The

scenario.

source player does not represent a malicious attacker; it is an ‘adversary’, representing the worst-case

2. T IME 4: Time for SDN

20

swap traffic flows simultaneously in order to avoid packet loss. Our model makes three basic assumptions: (i) each flow has a fixed bandwidth, (ii) the controller strives to avoid packet loss, and (iii) flows are unsplittable. We discuss these assumptions further in Sec. 2.6. The term flow in classic flow problems typically refers to the amount of traffic that is forwarded through each edge of the graph. Since our analysis focuses on SDN, we slightly divert from the common flow problem terminology, and use the term flow in its OpenFlow sense, i.e., a set of packets that share common properties, such as source and destination network addresses. A flow in our context, can be seen as a session between the source and destination that runs traffic at a fixed rate. The network is represented by a directed weighted acyclic graph (Fig. 2.2b), G = (V, E, c), with a source s, a destination d, and a set of intermediate nodes, Vin . Thus, V = Vin ∪ {s, d}. The nodes directly connected to s are denoted by O = {o1 , o2 , . . . , on }. Each of the outgoing edges from the source s has an infinite capacity, whereas the rest of the edges have a capacity c. For the sake of simplicity, and without loss of generality, throughout this section we assume that c = 1. Such a graph G is referred to as an LFA graph. The source node progressively transmits traffic flows towards the destination node. Each flow represents a session between s and d; every flow has a constant bandwidth, and cannot be split between two paths. A centralized controller configures the forwarding policy of the intermediate nodes, determining the path of each flow. Given a set of flows from s to d, the controller’s goal is to configure the forwarding policy of the nodes in a way that allows all flows to be forwarded to d without exceeding the capacity of any of the edges. The set of flows that are generated by s is denoted by F ::= {F1 , F2 , . . . , Fk }. Each flow Fi is defined as Fi ::= (i, fi , ri ), where i is a unique flow index, fi is the bandwidth satisfying 0 < fi ≤ c, and ri denotes the node that the controller forwards the flow to, i.e., ri ∈ {o1 , o2 , . . . , on }. It is assumed that the controller monitors the network, and thus it is aware of the flow set F. The controller maintains a forwarding function, Rcon : F × Vin −→ Vin ∪ {d}. Every node (switch) has a flow table, consisting of a set of entries; an element w ∈ F × Vin is referred to as an entry for short. An update of Rcon is defined to be a partial function u : F × Vin * Vin ∪ {d}.

2. T IME 4: Time for SDN

21

We define a reroute as an update u that has a single entry in its domain. We call an update that has more than one entry in its domain a swap, and it is assumed that all updates in a swap are performed at the same time. We define a k-swap for k ≥ 2 as a swap that updates entries in at least k different nodes. Note that a k-swap is possible only if n ≥ k, where n is the number of nodes in O. We focus our analysis on 2-swaps, and throughout the section we assume that n ≥ 2. In Section 2.3.6 we discuss k-swaps for values of k > 2.

2.3.3

The LFA Game

The lossless flow allocation problem can be viewed as a game between two players, the source and the controller. The game proceeds by a sequence of steps; in each step the source either adds or removes a single flow (Fig. 2.3), and then waits for the controller to perform a sequence of updates (Fig. 2.4). The source’s strategy Ss (F, Rcon ) = (a, F), is a function that defines for each flow set F and forwarding function Rcon for F, a pair (a, F) representing the source’s next step, where a ∈ {Add, Remove} is the action to be taken by the source, and F = ( j, f j , r j ) is a single flow to be added or removed. The controller’s strategy is defined by Scon (Rcon , a, F) = U, where U = {u1 , . . . , u` } is a sequence of updates, such that (i) at the end of each update no edge exceeds its capacity, and (ii) at the end of the last update, u` , the forwarding function Rcon defines a forwarding path for all flows in F. Notice that when a flow is to be removed, the controller’s update is trivial; it simply removes all the relevant entries from the domain of Rcon . Hence our analysis focuses on adding new flows. The following theorem, which is the crux of this section, argues that the source has a strategy that forces the controller to perform a swap, and thus that flow swaps are inevitable from the controller’s perspective. Theorem 2.2. Let G be an LFA graph. In the LFA game over G, there exists a strategy, Ss , for the source that forces every controller strategy, Scon , to perform a 2-swap. Proof. Let m be the number of incoming edges to the destination node d in the LFA graph (see Fig 2.2b). For m = 1 the claim is trivial. Hence, we start by proving the claim for m = 2, i.e., there are two edges connected to node d, edges e1 and e2 . We show that the source has a

2. T IME 4: Time for SDN

22

S OURCE P ROCEDURE 1 F ← 0/ 2 repeat at every step 3 (a, F) ← Ss (F, Rcon ) 4 if a = Add 5 F ← F∪F 6 Wait for the controller to complete updates 7 else // a = Remove 8 F ← F\F Figure 2.3: The LFA game: the source’s procedure.

C ONTROLLER P ROCEDURE 1 repeat at every step 2 {u1 , . . . , u` } ← Scon (Rcon , a, F) 3 for j ∈ [1, `] 4 Update Rcon according to u j Figure 2.4: The LFA game: the controller’s procedure.

strategy that, regardless of the controller’s strategy, forces the controller to use a swap. In the first four steps of the game, the source generates four flows, F1 = (1, 0.35, o1 ), F2 = (2, 0.35, o1 ), F3 = (3, 0.45, o2 ), and F4 = (4, 0.45, o2 ), respectively. According to the Source Procedure of Fig. 2.3, after each flow is added, the source waits for the controller to update Rcon before adding the next flow. After the flows are added, there are two possible cases: (a) The controller routes symmetrically through e1 and e2 , i.e. a flow of 0.35 and a flow of 0.45 through each of the edges. In this case the source’s strategy at this point is to generate a new flow F5 = (5, 0.3, o1 ) with a bandwidth of 0.3. The only way the controller can accommodate F5 is by routing F1 and F2 through the same edge, allowing the new 0.3 flow to be forwarded through that edge. Since there is no sequence of reroute updates that allows the controller to reach the desired Rcon , the only way to reach a state where F1 and

2. T IME 4: Time for SDN

23

F2 are routed through the same edge is to swap a 0.35 flow with a 0.45 flow. Thus, by issuing F5 the controller forces a flow swap as claimed. (b) The controller routes F1 and F2 through one edge, and F3 and F4 through the other edge. In this case the source’s strategy is to generate two flows, F6 and F7 , with a bandwidth of 0.2 each. The controller must route F6 through the edge with F1 and F2 . Now each path sustains a bandwidth of 0.9 units. Thus, when F7 is added by the source, the controller is forced to perform a swap between one of the 0.35 flows and one of the 0.45 flows. In both cases the controller is forced to perform a 2-swap, swapping a flow from o1 with a flow from o2 . This proves the claim for m = 2. The case of m > 2 is obtained by reduction to m = 2: the source first generates m − 2 flows with a bandwidth of 1 each, causing the controller to saturate m − 2 edges connected to node d (without loss of generality e3 , . . . , em ). At this point there are only two available edges, e1 and e2 . From this point, the proof is identical to the case of m = 2. The proof of Theorem 2.2 showed that the controller can be forced to perform a flow swap that involves m = 2 paths. For m > 2, we assumed that the source saturates m − 2 paths, reducing the analysis to the case of m = 2. In the following theorem we show that for m > 2 the controller can be forced to perform b m2 c swaps. Theorem 2.3. Let G be an LFA graph. In the LFA game over G, if m > 2 then there exists a strategy, Ss , for the source that forces every controller strategy, Scon , to perform b m2 c 2-swaps. Proof. Assume that m is even. The source generates m flows with a bandwidth of 0.35, m flows with a bandwidth of 0.45, and m flows with a bandwidth of 0.2. The only way the controller can route these flows without packet loss is as follows: each path sustains three flows with three different bandwidths, 0.2, 0.35, and 0.45. Now the source removes the m flows of 0.2, and adds m 2

flows of 0.3. As in case (a) of the proof of Theorem 2.2, adding each flow of 0.3 causes a

2-swap. The controller is thus forced to perform

m 2

= b m2 c swaps.

2. T IME 4: Time for SDN

24

If m is odd, then the source can saturate one of the edges by generating a flow with a bandwidth of 1, and then repeat the procedure above for the remaining m − 1 edges, yielding m−1 2

= b m2 c swaps.

For simplicity, throughout the rest of this section we assume that m = 2. However, as in Theorem 2.3, the analysis can be extended to the case of m > 2.

2.3.4

The Impact of Flow Swaps

We define a metric for flow swaps, by considering the oversubscription that is caused if the flows are not swapped simultaneously, but updated using an untimed approach. We define the oversubscription of an edge, e, with respect to a forwarding function, Rcon , to be the difference between the total bandwidth of the flows forwarded through e according to Rcon , and the capacity of e. If the total bandwidth of the flows through e is less than the capacity of e, the oversubscription is defined to be zero. Definition 2.4 (Flow swap impact). Let F be a flow set, and Rcon be the corresponding forwarding function. Consider a 2-swap u : F × V * V ∪ {d}, such that u = u1 ∪ u2 , where ui = (wi , vi ), for wi ∈ F × V, vi ∈ V ∪ {d}, and i ∈ {1, 2}. The impact of u is defined to be the minimum of: (i) The oversubscription caused by applying u1 to Rcon , or (ii) the oversubscription caused by applying u2 to Rcon . Example 2.5. We observe the scenario described in the proof of Theorem 2.2, and consider what would happen if the two flows had not been swapped simultaneously. The scenario had two cases; in the first case, the bandwidth through each edge is 0.8 before the controller swaps a 0.35 flow with a 0.45 flow. Thus, if the 0.35 flow is rerouted and then the 0.45 flow, the total bandwidth through the congested edge is 0.8 + 0.35 = 1.15, creating a temporary oversubscription of 0.15. Thus, the flow swap impact in the first case is 0.15. In the second case, one edge sustains a bandwidth of 0.7, and the other a bandwidth of 0.9. The controller needs to swap a 0.35 flow with a 0.45 flow. If the controller first reroutes the 0.45 flow, then during the intermediate transition period, the congested edge sustains a bandwidth of 0.7 + 0.45 = 1.15, and thus it is oversubscribed by 0.15. Hence, the impact in the second case is also 0.15.

2. T IME 4: Time for SDN

25

The following theorem shows that in the LFA game, the source can force the controller to perform a flow swap with a swap impact of roughly 0.5. Theorem 2.6. Let G be an LFA graph, and let 0 < α < 0.5. In the LFA game over G, there exists a strategy, Ss , for the source that forces every controller strategy, Scon , to perform a swap with an impact of α. Proof. Let ε = 0.1 − 0.2 · α. We use the source’s strategy from the proof of Theorem 2.2, with the exception that the bandwidths f1 , . . . , f7 of flows F1 , . . . , F7 are: f1 = f2 = 0.5 − 2ε, f3 = f4 = 0.5 − ε, f5 = 4ε, and f6 = f7 = 3ε. As in the proof of Theorem 2.2, there are two possible cases. In case (a), the controller routes symmetrically through the two paths, utilizing 1 − 3ε of the bandwidth of each path. The source adds F5 in response. To accommodate F5 the controller swaps F1 and F3 . We determine the impact of this swap by considering the oversubscription of performing an untimed update; the controller first reroutes F1 , and only then reroutes F3 . Hence, the temporary oversubscription is 1 − 3ε + 0.5 − 2ε − 1 = 1.5 − 5ε − 1. Thus, the impact is 0.5 − 5ε = α. In case (b), the controller forwards F1 through the same path as F2 , and F3 through the same path as F4 . The source responds by generating F6 and F7 . Again, the controller is forced to swap between F1 and F3 . We compute the impact by considering an untimed update, where the controller reroutes F3 first, causing an oversubscription of 1 − 4ε + 0.5 − ε − 1 = 0.5 − 5ε = α. In both cases the source inflicts a flow swap with an impact of α. Intuitively, Theorem 2.6 shows that not only are flow swaps inevitable, but they have a high impact on the network, as they can cause links to be congested by roughly 50% beyond their capacity.

2.3.5

Network Utilization

Theorem 2.2 demonstrates that regardless of the controller’s policy, flow swaps cannot be prevented. However, the proof of Theorem 2.2 uses a scenario in which the edges leading to node d are almost fully utilized, suggesting that perhaps flow swaps are inevitable only when the traffic

2. T IME 4: Time for SDN

26

bandwidth is nearly equal to the max-flow of the graph. Arguably, as suggested in [33], by reserving some scratch capacity ν · c through each of the edges, for 0 < ν < 1, it may be possible to avoid flow swaps. In the next theorem we show that if ν < 13 , then flow swaps are inevitable. Theorem 2.7. Let G be an LFA graph, in which a scratch capacity of ν is reserved on each of the edges e1 , . . . , em , and let ν < 31 . In the LFA game over G, there exists a strategy for the source, Ss , that forces every controller strategy, Scon , to perform a swap. Proof. We consider a graph G0 , in which the capacity of each of the edges e1 , . . . , em is 1 − ν. By Theorem 2.6, for every 0 < α < 0.5, there exists a strategy for the source that forces a flow swap with an impact of α. Thus, there exists a strategy that forces at least one of the edges to sustain a bandwidth of α · (1 − ν). Since ν < 31 , we have (1 − ν) > 23 , and thus there exists an α < 0.5 such that α · (1 − ν) > 1. It follows that in the original graph G, with scratch capacity ν, there exists a strategy for the source that forces the controller to perform a flow swap in order to avoid the oversubscribed bandwidth of α · (1 − ν) > 1. The analysis of [33] showed that a scratch capacity of 10% is enough to address the reconfiguration scenarios that were considered in that work. Theorem 2.7 shows that even a scratch capacity of 33 13 % does not suffice to prevent flow swaps scenarios. It follows that the 10% reserve that [33] suggest may not be sufficient in general for lossless reconfiguration.

2.3.6

n-Swaps

As defined above, a k-swap is a swap that involves k or more nodes. In previous subsections we discussed 2-swaps. The following theorem generalizes Theorem 2.2 to n-swaps, where n is the number of nodes in O. Theorem 2.8. Let G be an LFA graph. In the LFA game over G, there exists a strategy, Ss , for the source that forces every controller strategy, Scon , to perform an n-swap. Proof. For n = 1, the claim is trivial. For n = 2, the claim was proven in Theorem 2.2. Thus, we assume n ≥ 3.

2. T IME 4: Time for SDN

27

If m > 2, the source first generates m − 2 flows with a rate c each, and we assume without loss of generality that after the controller allocates these flows only e1 and e2 remain unused. Thus, we focus on the case where m = 2. We describe a strategy, Ss as required; s generates three types of flows: • Type A: two flows F1 , F2 , at a rate of h each: F1 = (1, h, o1 ), and F2 = (2, h, o1 ). • Type B: n flows, F3 , . . . , Fn+2 , with a total rate g, i.e., at a rate of

g n

each. The source sends

each of the n flows through a different node of O. • Type C: n − 1 flows, Fn+3 , . . . , F2n+1 with a total rate g, i.e.,

g n−1

each. The source sends

each of the n − 1 flows through a different node of o2 , . . . , on . We define h and g such that: 1 1 (n2 − n)(1 − 2h). The latter yields h > 12 − 2(n2α−n) . Since n ≥ 3, we have n2 − n ≥ 6, and thus

g 2(n2 −n)


0. It follows that every h

1 1 < h < 12 − 0, also satisfies h > 31 . Hence, every g and h in the range ( 11 that satisfies 21 − 24 24 , 2 )

that satisfy h < g, also satisfy (2.1) and (2.2). Intuitively, for h and g sufficiently close to

1 2

(but less than 12 ) (2.1) and (2.2) are satisfied.

We now prove that after generating the flows F1 , . . . , F2n+1 , the function Rcon forwards all type B flows through the same path, and all type C flows through the same path. Assume by way of contradiction that there is a forwarding function Rcon that forwards flows F1 , . . . , F2n+1 without

2. T IME 4: Time for SDN

28

loss, but does not comply to the latter claim. We consider two distinct cases: either the two type A flows are forwarded through the same edge, or they are forwarded through two different edges. • If the two type A flows are forwarded through two different paths, then we assume that F1 and the n type B flows are forwarded through e1 and that F2 and the n − 1 type C flows are forwarded through e2 . Thus, at this point each of the two edges sustains traffic at a rate of g + h. By the assumption, there exists an update that swaps i < n flows of type B with j < n − 1 flows of type C, such that after the swap none of the edges exceeds its g − i · gn | to one of the edges, and this capacity. Thus, the update adds the bandwidth | j · n−1

additional bandwidth must fit into the available bandwidth before the update, 1 − g − h. g Hence, | j · n−1 − i · ng | < c − g − h. Note that 1 − g − h < 1 − 2h < g and (2.2). Thus we get | j · n−1 − i · ng |
1 − 2h. We note that (i) > 1 − 2h, and also

g n

g n−1

>

g n−1

− gn , and (ii)

g n

>

g n−1

− gn . It follows that

> 1 − 2h, and thus none of the type B or type C flows fit on the

same path with F1 and F2 . Thus, all the type B and type C flows are on the same path, contradicting the assumption. We have shown that all flows of type B, denoted by FB , must be forwarded through the same path, and that all flows of type C, denoted by FC , are forwarded through the same path. Thus, after the source generates the 2 · n + 1 flows, there are two possible scenarios: • The two type A flows are forwarded through the same path, and the type B and type C flows are forwarded through the other path. In this case s generates two flows at a rate of 1 − h − g each. To accommodate both flows the controller must swap the flows of FB

2. T IME 4: Time for SDN

29

with F1 or the flows of FC with F2 . Both possible swaps involve n entries, and thus the controller is force to perform an n-swap. • One path is used for F1 and the flows of FC , and the other path is used for F2 and the flows of FB . In this case the source generates a flow with a bandwidth of 1 − 2h, again forcing the controller to swap the flows of FB with F1 or the flows of FC with F2 . In both cases the controller is forced to perform a swap that involves the n nodes, i.e., an nswap.

2.4 2.4.1

Design and Implementation Protocol Design

1) Overview A T IME 4-enabled system is comprised of two main components: • OpenFlow time extension. T IME 4 is built upon the OpenFlow protocol. We define an extension to the OpenFlow protocol that enables timed updates; the controller can attach an execution time to every OpenFlow command it sends to a switch, defining when the switch should perform the required command. It should be noted that the T IME 4 approach is not limited to OpenFlow; we have defined a similar time extension to the NETCONF protocol [3, 71], but in this paper we focus on T IME 4 in the context of OpenFlow, as described in the next subsection. • Clock synchronization. T IME 4 requires the switches and controller to maintain a local clock, allowing time-triggered events. Hence, the local clocks should be synchronized. The OpenFlow time extension we defined does not mandate a specific synchronization method. Various mechanisms may be used, e.g., the Network Time Protocol (NTP), the Precision Time Protocol (PTP) [21], or GPS-based synchronization. The prototype we

2. T IME 4: Time for SDN

30

designed and implemented uses R EVERSE PTP [10], as described below.2

2) OpenFlow Time Extension We present an extension that allows OpenFlow controllers to signal the time of execution of a command to the switches. This extension is described in full in [72].3 It should be noted that the T IME 4 approach is not limited to OpenFlow; we have defined a similar time extension to the NETCONF protocol [3, 71], but in this paper we focus on T IME 4 in the context of OpenFlow. Our extension makes use of the OpenFlow [18] Bundle feature; a Bundle is a sequence of OpenFlow messages from the controller that is applied as a single operation. Our time extension defines Scheduled Bundles, allowing all commands of a Bundle to come into effect at a predetermined time. This is a generic means to extend all OpenFlow commands with the scheduling feature. Using Bundle messages for implementing T IME 4 has two significant advantages: (i) It is a generic method to add the time extension to all OpenFlow commands without changing the format of all OpenFlow messages; only the format of Bundle messages is modified relative to the Bundle message format in [18], optionally incorporating an execution time. (ii) The Scheduled Bundle allows a relatively straightforward way to cancel scheduled commands, as described below. Fig. 2.5 illustrates the Scheduled Bundle message procedure. In step 1, the controller sends a Bundle Open message to the switch, followed by one or more Add messages (step 2). Every Add message encapsulates an OpenFlow message, e.g., a FLOW MOD message. A Bundle Close is sent in step 3, followed by the Bundle Commit (step 4), which optionally includes the scheduled time of execution, Ts . The switch then executes the desired command(s) at time Ts . The Bundle Discard message (step 50 ) allows the controller to enforce an all-or-none scheduled update; after the Bundle Commit is sent, if one of the switches sends an error message, indicating that it is unable to schedule the current Bundle, the controller can send a Discard 2 We

chose to use a variant of PTP, as this is a mature technology that has several open source implementations,

provides a very high degree of accuracy compared to NTP, and is affordable compared to using a GPS receiver. 3 A preliminary version of this extension was presented in [73].

2. T IME 4: Time for SDN 1

31

2

3

4

5'

reply

rd Disca

reply

Bundle

it Comm Bundle e Ts) (at tim

se le Clo Bund

Add N

Add 1

Open

...

le Bund

le Bund

le Bund

reply

controller

5

switch Time

Commit may include scheduled time Ts

Switch executes bundle at time Ts

Ts

Figure 2.5: A Scheduled Bundle: the Bundle Commit message may include Ts , the scheduled time of execution. The controller can use a Bundle Discard message to cancel the Scheduled Bundle before time Ts .

message to all switches, canceling the scheduled operation. Hence, when a switch receives a scheduled commit, to be executed at time Ts , the switch can verify that it can dedicate the required resources to execute the command as close as possible to Ts . If the switch’s resources are not available, for example due to another command that is scheduled to Ts , then the switch replies with an error message, aborting the scheduled commit. Significantly, this mechanism allows switches to execute the command with a guaranteed scheduling accuracy, avoiding the high variation that occurs when untimed updates are used. The OpenFlow time extension also defines Bundle Feature Request messages, which allow the controller to query switches about whether they support Scheduled Bundles, and to configure some of the switch parameters related to Scheduled Bundles. 3) Clock Synchronization: R EVERSE PTP In the last decade PTP, based on the IEEE 1588 [21] standard, has become a common feature in commodity switches, typically providing a clock accuracy on the order of 1 microsecond. In [10, 11] we introduced R EVERSE PTP a PTP variant for SDNs. R EVERSE PTP is based on PTP, but is conceptually reversed. In PTP a single node periodically distributes its time to the other nodes in the network. In R EVERSE PTP all nodes in the network (the switches) periodically distribute their time to a single node (the controller). The controller keeps track of the offsets,

2. T IME 4: Time for SDN

32

denoted by offseti for switch i, between its clock and each of the switches’ clocks, and uses them to send each switch individualized timed commands. R EVERSE PTP allows the complex clock algorithms to be implemented by the controller, whereas the ‘dumb’ switches only need to distribute their time to the controller. Following the SDN paradigm, the R EVERSE PTP algorithmic logic can be programmed and dynamically tuned at the controller without affecting the switches. Another advantage of R EVERSE PTP, which played an important role in our experiments, is that R EVERSE PTP allows the controller to keep track of the synchronization status of each clock; a clock synchronization protocol requires a long setup time, typically tens of minutes. R EVERSE PTP provides an indication of when the setup process has completed.

1

master

master 2

4

master

master 3

Figure 2.6: R EVERSE PTP in SDN: switches distribute their time to the controller. Switches’ clocks are not synchronized. For every switch i, the controller knows offseti between switch i’s clock and its local clock. As shown in [10], R EVERSE PTP can be effectively used to perform timed updates; in order to have switch i perform a command at time Ts , the controller instructs i to perform the command at time Tsi , where Tsi = Ts + offseti takes the offset between the controller and switch i into account,4 causing i to perform the action at time Ts according to the controller’s clock. 4T i, s

as described above is a first order approximation of the desired execution time. The controller can compute

a more accurate execution time by also considering the clock skew and drift, as discussed in [10].

2. T IME 4: Time for SDN

33

Controller SDN application using time-based updates Time-based update

offseti

Open source

Time extension OpenFlow Agent Dpctl

Clock

PTPd Slave i

OpenFlow protocol using time extension OpenFlow Switch CPqD OFSoftswitch Switch scheduling

REVERSEPTP

PTP PTPd Master Clock

REVERSEPTP

Switch i Figure 2.7: T IME 4 prototype design: the black blocks are the components implemented in the context of this work.

2.4.2

Prototype Design and Implementation

We have designed and implemented a software-based prototype of T IME 4, as illustrated in Fig. 2.7. The components we implemented are marked in black. These components run on Linux, and are publicly available as open source [69]. Our T IME 4-enabled OFSoftswitch prototype was adopted by the ONF as the official prototype of Scheduled Bundles.5 Switches. Every switch i runs an OpenFlow switch software module. Our prototype is based on the open source CPqD OFSoftswitch [74],6 incorporating the switch scheduling module (see 5 The

ONF process for adding new features to OpenFlow requires every new feature to be prototyped. is one of the two software switches used by the Open Networking Foundation (ONF) for proto-

6 OFSoftswitch

typing new OpenFlow features. We chose this switch since it was the first open source OpenFlow switch to include the Bundle feature.

2. T IME 4: Time for SDN

34

Fig. 2.7) that we implemented. When the switch receives a Scheduled Bundle from the controller, the switch scheduling module schedules the respective OpenFlow command to the desired time of execution. The switch scheduling module also handles Bundle Feature Request messages received from the controller. Each switch runs a R EVERSE PTP master, which distributes the switch’s time to the controller. Our R EVERSE PTP prototype is a lightweight set of Bash scripts that is used as an abstraction layer over the well-known open source PTPd [75] module. Our software-based implementation uses the Linux clock as the reference for PTPd, and for the switch’s scheduling module. To the best of our knowledge, ours is the first open source implementation of R E VERSE PTP.

Controller. The controller runs an OpenFlow agent, which communicates with the switches using the OpenFlow protocol. Our prototype uses the CPqD Dpctl (Datapath Controller), which is a simple command line tool for sending OpenFlow messages to switches. We have extended Dpctl by adding the time extension; the Dpctl command-line interface allows the user to define the execution time of a Bundle Commit. Dpctl also allows a user to send a Bundle Feature Request to switches. The controller runs R EVERSE PTP with n instances of PTPd in slave mode, where n is the number of switches in the network. One or more SDN applications can run on the controller and perform timed updates. The application can extract the offset, offseti , of every switch i from R EVERSE PTP, and use it to compute the scheduled execution time of switch i in every timed update. The Linux clock is used as a reference for PTPd, and for the SDN application(s).

2.5 2.5.1

Evaluation Evaluation Method

Environment. We evaluated our prototype on a 71-node testbed in the DeterLab [25] environment. Each machine (PC) in the testbed either played the role of an OpenFlow switch, running our T IME 4-enabled prototype, or the role of a host, sending and receiving traffic. A separate machine was used as a controller, which was connected to the switches using an out-of-band

2. T IME 4: Time for SDN

35

network. We remark that we did not use Mininet [76] in our evaluation, as Mininet is an emulation environment that runs on a single machine, making it impractical for emulating simultaneous or time-triggered events. We did, however, run our prototype over Mininet in some of our preliminary testing and verification. Performance attributes. Three performance attributes play a key role in our evaluation, as shown in Table 2.1. ∆

The average time elapsed between two consecutive messages sent by the controller.

IR

Installation latency range: the difference between the maximal rule installation latency and the minimal installation latency. Scheduling error: the maximal difference between the actual update time and the

δ

scheduled update time. Table 2.1: Performance Attributes. Intuitively, ∆ and IR determine the performance of untimed updates. ∆ indicates the controller’s performance; an OpenFlow controller can handle as many as tens of thousands [77] to millions [78] of packets per second, depending on the type of controller and the machine’s processing power. Hence, ∆ can vary from 1 microsecond to several milliseconds. IR indicates the installation latency variation. The installation latency is the time elapsed from the instant the controller sends a rule modification message until the rule has been installed. The installation latency of an OpenFlow rule modification (FLOW MOD) has been shown to range from 1 millisecond to seconds [64, 30], and grows dramatically with the number of installations per second. The attribute that affects the performance of timed updates is the switches’ scheduling error, δ. When an update is scheduled to be performed at time T0 , it is performed in practice at some time t ∈ [T0 , T0 + δ].7 The scheduling error, δ, is affected by two factors: the device’s clock 7 An

alternative representation of the accuracy, δ, assumes a symmetric error, T0 ± δ. The two approaches are

equivalent.

2. T IME 4: Time for SDN

36

accuracy, which is the maximal offset between the clock value and the value of an accurate time reference, and the execution accuracy, which is a measure of how accurately the device can perform a timed update, given run-time parameters such as the concurrently executing tasks and the load on the device. The achievable clock accuracy strongly depends on the network size and topology, and on the clock synchronization method. For example, the clock accuracy using the Precision Time Protocol [21] is typically on the order of 1 microsecond (e.g., [44]). Software-based evaluation. Our experiments measure the three performance attributes in a setting that uses software switches. The software-based experiments provide a qualitative evaluation of the scalability of T IME 4, and how it compares to untimed approaches. While the values we measured do not necessarily reflect on the performance of systems that use hardwarebased switches, the merit of our evaluation is that we vary these parameters and analyze how they affect the network update performance with untimed approaches and with T IME 4.

1.4

1.4

∆ [Type I]

1.4

IR [Type I]

δ [Type I]

1.2

1.2

1

1

0.8

0.8

0.6 0.4

Type I Type II Type III

0.2 0 0.0085

0.0105

0.0125

0.0145

0.0165

0.0185

Time between two Controller Messages [seconds]

(a) The empirical Cumulative Distribution Function (CDF) of

CDF

1 0.8

CDF

CDF

1.2

0.6 0.4

0.4

Type I Type II Type III

0.2 0 0.006

0.6

0.008

0.01

0.012

0.014

0.016

Flow Installation Latency [seconds]

Type I Type II Type III

0.2 0 0

0.001

0.002

0.003

0.004

0.005

Scheduling Error [seconds]

(b) The empirical CDF of the flow (c) The empirical CDF of the installation latency. IR is the

scheduling error, i.e., the

the time elapsed between two

difference between the max and difference between the actual consecutive controller messages. min values, as shown in the figure execution time and the scheduled execution time. δ is the maximal for Type I. ∆ is the average value, which is error value, as shown in the figure shown in the figure for Type I. for Type I.

Figure 2.8: Measurement of the three performance attributes: (a) ∆, (b) IR , and (c) δ.

2. T IME 4: Time for SDN Machine Type Intel Xeon E3 LP I 2.4 GHz, 16 GB RAM Intel Xeon II 2.1 GHz, 4 GB RAM Intel Dual Xeon II 3 GHz, 2 GB RAM

37 ∆

IR

δ

9.64

1.3

1.23

9.6

1.47

1.18

14.27

2.72

1.19

Table 2.2: Measured attributes in milliseconds.

2.5.2

Performance Attribute Measurement

Our experiments measured the three attributes, ∆, IR , and δ, illustrating how accurately updates can be applied in software-based OpenFlow implementations. It should be noted that these three values depend on the processing power of the testbed machine; we measured the parameters for three types of DeterLab machines, Type I, II, and III, listed in Table 2.2. Each attribute was measured 100 times on each machine type, and Fig. 2.8 illustrates our results. The figure graphically depicts the values ∆, IR , and δ of machine Type I as an example. The measured scheduling error, δ, was slightly more than 1 millisecond in all the machines we tested. Our experiments showed that the clock accuracy using R EVERSE PTP over the DeterLab testbed is on the order of 100 microseconds. The measured value of δ in Table 2.2 shows the execution accuracy, which is an order of magnitude higher. The installation latency range, IR , was slightly higher than δ, around 1 to 3 milliseconds. The measured value of ∆ was high, on the order of 10 milliseconds, as Dpctl is not optimized for performance. In software-based switches, the CPU handles both the data-plane traffic and the communication with the controller, and thus IR and δ can be affected by the rate of data-plane traffic through the switch. Hence, in our experiments we fixed the rate of traffic through each switch to 10 Mbps, allowing an ‘apples-to-apples’ comparison between experiments.

2.5.3

Microbenchmark: Video Swapping

To demonstrate how T IME 4 is used in a real-life scenario, we reconstructed the video swapping topology of [63], as illustrated in Fig. 2.9a. Two video cameras, A and B, transmit an uncom-

2. T IME 4: Time for SDN

38

pressed video stream to targets A and B, respectively. At a given point in time, the two video streams are swapped, so that the stream from source A is transmitted to target B, and the stream from B is sent to target A. As described in [63], the swap must be performed at a specific time instant, in which the video sources transmit data that is not visible to the viewer, making the swap unnoticeable. 1800

PDF

1600 1400 1200 1000 800 600 400 200 0 -1.5

-0.5

0.5

1.5

Scheduling Error [milliseconds]

(a) Topology.

(b) Video swapping accuracy.

Figure 2.9: Microbenchmark: video swapping. The authors of [63] noted that the precisely-timed swap cannot be performed by an OpenFlow switch, as currently OpenFlow does not provide abstractions for performing accurately timed changes. Instead, it uses source timing, where sources A and B are time-synchronized, and determine the swap time by using a swap indication in the packet header. The OpenFlow switch acts upon the swap indication to determine the correct path for each stream. We note that the main drawback of this source-timed approach is that the SMPTE 2022-6 video streaming standard [79], which was used in [63], does not currently define an indication about where in the video stream a packet comes from, and specifically does not include an indication about the correct swapping time. Hence, off-the-shelf streaming equipment does not provide this indication. In [63], the authors used a dedicated Linux server to integrate the non-standard swap indication. In this experiment we studied how T IME 4 can tackle the video swapping scenario, avoiding the above drawback. Each node in the topology of Fig. 2.9a was emulated by a DeterLab machine. We used two 10 Mbps flows, generated by Iperf [80], to simulate the video streams. Each swap was initiated by the controller 100 milliseconds in advance (as in [63]): the controller

2. T IME 4: Time for SDN

39

sent a Scheduled Bundle, incorporating two updates, one for each of the flows. We repeated the experiment 100 times, and measured the scheduling error. The measurement was performed by analyzing capture files taken at the sources and at the switch’s egress ports. A swap that was scheduled to be performed at time T , was considered accurate if every packet that was transmitted by each of the source before time T was forwarded according to the old configuration, and every packet that was transmitted after T was forwarded according to the new configuration. The scheduling error of each swap (measured in milliseconds) was computed as the number of misrouted packets, divided by the bandwidth of the traffic flow. The sign of the scheduling error indicates whether the swap was performed before the scheduled time (negative error) or after it (positive error). Fig. 2.9b illustrates the empirical Probability Density Function (PDF) of the scheduling error of the swap, i.e., the difference between the actual swapping time and the scheduled swapping time. As shown in the figure, the swap is performed within ±0.6 milliseconds of the scheduled swap time. We note that this is the achievable accuracy in a software-based OpenFlow switch, and that a much higher degree of accuracy, on the order of microseconds, can be achieved if two conditions are met: (i) A hardware switch is used, supporting timed updates with a microsecond accuracy, as shown in [9], and (ii) The cameras are connected to the switch over a single hop, allowing low latency variation, on the order of microseconds.

2.5.4

Flow Swap Evaluation

1) Experiment Setting We evaluated our prototype on a 71-node testbed. We used the testbed to emulate an OpenFlow network with 32 hosts and 32 leaf switches, as depicted in Fig. 2.10, with n = 32. Metric. A flow swap that is not performed in a coordinated way may bare a high cost: either packet loss, deep buffering, or a combination of the two. We use packet loss as a metric for the cost of flow swaps, assuming that deep buffering is not used. We used Iperf to generate flows from the sources to the destination, and to measure the number of packets lost between the source and the destination.

2. T IME 4: Time for SDN

40

d q2

q1

o1

o2

H1

H2

...

controller on

Hn

Figure 2.10: Experimental evaluation: every host and switch was emulated by a Linux machine in the DeterLab testbed. All links have a capacity of 10 Mbps. The controller is connected to the switches by an out-of-band network. The flow swap scenario. All experiments were flow swaps with a swap impact of 0.5.8 We used two static flows, which were not reconfigured in the experiment: H1 generates a 5 Mbps flow that is forwarded through q1 , and H2 generates a 5 Mbps flow that is generated through q2 . We generated n additional flows (where n is the number of switches at the bottom layer of the graph): (i) A 5 Mbps flow from H1 to the destination. (ii) n − 1 flows, each having a bandwidth of

5 n−1

Mbps. Every flow swap in our experiment required the flow of (i) to be swapped with the

n − 1 flows of (ii). Note that this swap has an impact of 0.5. 2) Experimental Results T IME 4 vs. other update approaches. In this experiment we compared the packet loss of T IME 4 to other update approaches described in Sec. 2.2.4. As discussed in Sec. 2.2.4, applying the order approach or the two-phase approach to flow swaps produces similar results. This observation is illustrated in Fig. 2.11b. In the rest of this section we refer to these two approaches collectively as the untimed approaches. 8 By

0.5.

Theorem 2.6, the source can force the controller to perform a flow swap with an impact as high as roughly

2. T IME 4: Time for SDN 100

35

Packet Loss (logarithmic)

Time4 Untimed

30

Packet Loss

41

25 20 15 10 5 0 0

10

20

30

B4 Untimed Time4SW Timed

10

1

0.1

0.01

Number of switches

Order

Two-phase

Time4

B4

SWAN

Time4B4

Time4SW

(a) The no. of packets lost in a flow swap vs. (b) The number of packets lost in a flow swap in different no. of switches involved in the update.

update approaches (with n = 32).

40

Time4 + B4 B4

35

30

Packet Loss

Packet Loss

40

Time4 + SWAN SWAN

35

25 20 15 10 5

30 25 20 15 10 5

0

0 0

2

4

6

8

10

Scratch Capacity [%]

0

2

4

6

8

10

Flow Bandwidth Reduction [%]

(c) The number of packets lost in a flow swap using (d) The number of packets lost in a flow swap using SWAN and T IME 4 + SWAN (with n = 32).

B4 and T IME 4 + B4 (with n = 32).

Figure 2.11: Flow swap performance: in large networks (a) T IME 4 allows significantly less packet loss than untimed approaches. The packet loss of T IME 4 is slightly higher than SWAN and B4 (b), while the latter two methods incur higher overhead. Combining T IME 4 with SWAN or B4 provides the best of both worlds; low packet loss (b) and low overhead (c and d).

In our experiments we also implemented a SWAN-based [33] update, and a B4-based [59] update. In SWAN, we used a 10% scratch on each of the links, and in B4 updates we temporarily reduced the bandwidth of each flow by 10% to avoid packet loss. As depicted in Fig. 2.11b, SWAN and B4 yield a slightly lower packet loss rate than T IME 4; the average number of packets lost in each T IME 4 flow swap is 0.2, while with SWAN and B4 only 0.1 packets are lost on average.

2. T IME 4: Time for SDN

42

To study the effect of using time in SWAN and in B4, we also performed hybrid updates, illustrated in Fig. 2.11c and 2.11d, and in the two right-most bars of Fig. 2.11b. We combined SWAN and T IME 4, by performing a timed update on a network with scratch capacity, and compared the packet loss to the conventional SWAN-based update. We repeated the experiment for various values of scratch capacity, from 0% to 10%. As illustrated in Fig. 2.11c, the T IME 4+SWAN approach can achieve the same level of packet loss as SWAN with less scratch capacity. We performed a similar experiment with a timed B4 update, varying the bandwidth reduction rate between 0% and 10%, and observed similar results. Number of switches. We evaluated the effect of n, the number of switches involved in the flow swap, on the packet loss. We performed an n-swap with n = 2, 4, 8, 16, 32. As illustrated in Fig. 2.11a, the number of packets lost during an untimed update grows linearly with the number of switches n, while the number of packets lost in a T IME 4 update is less than one on average, and is not affected by the number of switches. As n increases, the update duration9 is longer, and hence more packets are lost during the update procedure. Controller performance. In this experiment we explored how the controller’s performance, represented by ∆, affects the packet loss rate in an untimed update. As ∆ increases, the update procedure requires a longer period of time, and hence more packets are lost (Fig. 2.12) during the process. We note that although previous work has shown that ∆ can be on the order of microseconds in some cases [78], Dpctl is not optimized for performance, and hence ∆ in our experiments was on the order of milliseconds. As shown in Fig. 2.12, we synthetically increased ∆, and observed its effect on the packet loss during flow swaps. Installation latency variation. Our next experiment (Fig. 2.13a) examined how the installation latency variation, denoted by IR , affects the packet loss during an untimed update. We analyzed different values of IR : in each update we synthetically determined a uniformly distributed installation latency, I ∼ U[0, IR ]. As shown in Fig. 2.13a, the switch’s installation latency range, IR , dramatically affects the packet loss rate during an untimed update. Notably, when IR is on the order of 1 second, as in the extreme scenarios of [64, 30], T IME 4 has a significant advantage 9 The

update duration is the time elapsed from the instant the first switch is updated until the instant the last

switch is updated. In our setting the update duration is roughly (n − 1)∆.

2. T IME 4: Time for SDN

43 400

Time4 Untimed

Packet Loss

350 300 250 200 150 100 50 0 0

0.5

1

Δ [seconds]

Figure 2.12: The number of packets lost in a flow swap vs. ∆. The packet loss in T IME 4 is not affected by the controller’s performance (∆).

160 140 120 100 80 60 40 20 0

Untimed

0

0.5

Packet Loss

Packet Loss

over the untimed approach.

1

IR [seconds]

160 140 120 100 80 60 40 20 0

Time4

0

0.5

1

δ [seconds]

(a) The number of packets lost in a flow swap (b) The number of packets lost in a flow swap vs. the installation latency range, IR .

vs. the scheduling error, δ.

Figure 2.13: Performance as a function of IR and δ. Untimed updates are affected by the installation latency variation (IR ), whereas T IME 4 is affected by the scheduling error (δ). T IME 4 is advantageous since typically δ < IR . Scheduling error. Figure 2.13b depicts the packet loss as a function of the scheduling error of T IME 4. By Fig. 2.11a, 2.13a and 2.13b, we observe that if δ is sufficiently low compared to IR and (n − 1)∆, then T IME 4 outperforms the untimed approaches. Note that even if switches are not implemented with extremely low scheduling error δ, we expect T IME 4 to outperform the untimed approach, as typically δ < IR , as further discussed in Section 2.6.

2. T IME 4: Time for SDN

44

Summary. The experiments presented in this section demonstrate that T IME 4 performs significantly better than untimed approaches, especially when the update involves multiple switches, or when there is a non-deterministic installation latency. Interestingly, T IME 4 can be used in conjunction with existing approaches, such as SWAN and B4, allowing the same level of packet loss with less overhead than the untimed variants.

2.6

Discussion

1) Scheduling accuracy The advantage of timed updates greatly depends on the scheduling accuracy, i.e., on the switches’ ability to accurately perform an update at its scheduled time. Clocks can typically be synchronized on the order of 1 microsecond (e.g., [44]) using PTP [21]. However, a switch’s ability to accurately perform a scheduled action depends on its implementation. • Software switches: Our experimental evaluation showed that the scheduling error in the software switches we tested was on the order of 1 millisecond. • Hardware-based scheduling: The work of [9] has shown a method that allows the scheduling error of timed events in hardware switches to be as low as 1 microsecond. • Software-based scheduling in hardware switches: A scheduling mechanism that relies on the switch’s software may be affected by the switch’s operating system and by other running tasks. Measures can be taken to implement an accurate software-based scheduling in T IME 4: when a switch is aware of an update that is scheduled to take place at time Ts , it can avoid performing heavy maintenance tasks at this time, such as TCAM entry rearrangement. Update messages received slightly before time Ts can be queued and processed after the scheduled update is executed. Moreover, if a switch receives a timed command that is scheduled to take place at the same time as a previously received command, it can send an error message to the controller, indicating that the last received command cannot be executed.

2. T IME 4: Time for SDN

45

It is an important observation that in a typical system we expect the scheduling error to be lower than the installation latency variation, i.e., δ < IR . Untimed updates have a nondeterministic installation latency. On the other hand, timed updates are predictable, and can be scheduled in a way that avoids conflicts between multiple updates, allowing δ to be typically lower than IR . 2) Model assumptions Our model assumes a lossless network with unsplittable, fixed-bandwidth flows. A notable example of a setting in which these assumptions are often valid is a WAN or a carrier network. In carrier networks the maximal bandwidth of a service is defined by its bandwidth profile [62]. Thus, the controller cannot dynamically change the bandwidth of the flows, as they are determined by the SLA. The Frame Loss Ratio (FLR) is one of the key performance attributes [62] that a service provider must comply to, and cannot be compromised. Splitting a flow between two or more paths may result in packets being received out-of-order. Packet reordering is a key performance parameter in carrier-grade performance and availability measurement, as it affects various applications such as real-time media streaming [81]. Thus, all packets of a flow are forwarded through the same path. The game-theoretic model we analyzed uses two players, the controller, and the source. In real-life networks, it is often the case that the network operator or service provider controls the network configuration (using the controller), but has no control over the traffic (generated by the source). In contrast, some of the previous work by Microsoft [33] and Google [59] presented scenarios in which the network operators have access to the network configuration, as well as the endpoints (sources of traffic). However, it should be noted that our two-player setting still applies to these scenarios; the source player should be viewed as ‘mother nature’, representing the occurrence of applications that require network traffic that the network operator must take care to accommodate. 3) Short term vs. long term scheduling The OpenFlow time extension we presented in Section 2.4 is intended for short term scheduling; a controller should schedule an action to a near-future time, on the order of seconds in the

2. T IME 4: Time for SDN

46

future. The challenge in long term scheduling is that during the long period between the time at which the Scheduled Bundle was sent and the time at which it is meant to be executed various external events may occur: the controller may fail or reboot, or a second controller10 may try to perform a conflicting update. Near future scheduling guarantees that external events that may affect the scheduled operation such as a switch reboot have a low probability of occurring. Since near-future scheduling is on the order of seconds, this short potentially hazardous period is no worse than in conventional updates, where an OpenFlow command may be executed a few seconds after it was sent by the controller. 4) Network latency In Fig. 2.1, the switches S1 and S3 are updated at the same time, as it is implicitly assumed that all the links have the same latency. In the general case each link has a different latency, and thus S1 and S3 should not be updated at the same time, but at two different times, T1 and T3 , that account for the different latencies. 5) Failures A timed update may fail to be performed in a coordinated way at multiple switches if some of the switches have failed, or if some of the controller commands have failed to reach some of the switches. Therefore, the controller uses a reliable transport protocol (TCP), in which dropped packets are retransmitted. If the controller detects that a switch has failed, or failed to receive some of the Bundle messages, the controller can use the Bundle Discard to cancel the coordinated update. Note that the controller should send timed update messages sufficiently ahead of the scheduled time of execution, allowing enough time for possible retransmission and Discard message transmission. 6) Controller performance overhead The prototype design we presented (Fig. 2.7) uses R EVERSE PTP [10] to synchronize the switch and the controllers. A synchronization protocol may yield some performance overhead on the controller and switches, and some overhead on the network bandwidth. In our experiments 10 In

an SDN with a distributed control plane, where more than one controller is used.

2. T IME 4: Time for SDN

47

we observed that the CPU utilization of the PTP processes in the controller in an experiment with 32 switches was 5% on the weakest machine we tested, and significantly less than 1% on the stronger machines. As for the network bandwidth overhead, accurate synchronization using PTP typically requires the controller to exchange ∼ 5 packets per second per switch [82], a negligible overhead in high-speed networks.

2.7

Conclusion

Time and clocks are valuable tools for coordinating updates in a network. We have shown that dynamic traffic steering by SDN controllers requires flow swaps, which are best performed as close to instantaneously as possible. Time-based operation can help to achieve carrier-grade packet loss rate in environments that require rapid path reconfiguration. Our OpenFlow time extension can be used for implementing flow swaps and T IME 4. It can also be used for a variety of additional timed update scenarios that can help improve network performance during path and policy updates.

2.8

Acknowledgments

We gratefully acknowledge Oron Anschel and Nadav Shiloach, who implemented the T IME 4enabled OFSoftswitch prototype. We thank Jean Tourrilhes and the members of the Extensibility working group of the ONF for many helpful comments that contributed to the OpenFlow time extension. We also thank Nate Foster, Laurent Vanbever, Joshua Reich and Isaac Keslassy for helpful discussions. We gratefully acknowledge the DeterLab project [25] for the opportunity to perform our experiments on the DeterLab testbed. This work was supported in part by the ISF grant 1520/11.

2. T IME 4: Time for SDN

48

Chapter 3 Timed Consistent Network Updates in SDN This chapter is a reprint of the paper:

[7] T. Mizrahi, E. Saat and Y. Moses, “Timed consistent network updates in software defined networks,” IEEE/ACM Transactions on Networking (ToN), 2016.

A preliminary version of this paper appeared in the ACM SIGCOMM Symposium on SDN Research (SOSR) 2015 [8].

3.1

Abstract

Network updates such as policy and routing changes occur frequently in Software Defined Networks (SDN). Updates should be performed consistently, preventing temporary disruptions, and should require as little overhead as possible. Scalability is increasingly becoming an essential requirement in SDN. In this paper we propose to use time-triggered network updates to achieve consistent updates. Our proposed solution requires lower overhead than existing update approaches, without compromising the consistency during the update. We demonstrate that accurate time enables far more scalable consistent updates in SDN than previously available. In addition, it provides the SDN programmer with fine-grained control over the tradeoff between consistency and scalability.

3. Timed Consistent Network Updates in SDN

3.2 3.2.1

50

Introduction Background

Traditional network management systems are in charge of initializing the network, monitoring it, and allowing the operator to apply occasional changes when needed. Software Defined Networking (SDN), on the other hand, requires a central controller to routinely perform frequent policy and configuration updates in the network. The centralized approach used in SDN introduces challenges in terms of consistency and scalability. The controller must take care to minimize network anomalies during update procedures, such as packet drops or misroutes caused by temporary inconsistencies. Updates must also be planned with scalability in mind; update procedures must scale with the size of the network, and cannot be too complex. In the face of rapid configuration changes, the update mechanism must allow a high update rate. Two main methods for consistent network updates have been thoroughly studied in the last few years. • Ordered updates. This approach uses a sequence of phases of configuration commands, whereby the order of execution guarantees that no anomalies are caused in intermediate states of the procedure [28, 30, 31, 32]; at each phase the controller waits until all the switches have completed their updates, and only then invokes the next phase in the sequence. • Two-phase updates. In the two-phase approach [22, 83], configuration version tags are used to guarantee consistency; in the first phase the new configuration is installed in all the middle-stage switches of the network, and in the second phase the ingress switches are instructed to start using a version tag that represents the new configuration. During the update procedure every switch maintains two sets of entries: one for the old configuration version, and one for the new version. The version tag attached to the packet determines whether it is processed according to the old configuration or the new one. After the packets carrying the old version tag are drained from the network, garbage collection is performed

3. Timed Consistent Network Updates in SDN

51

on the switches, removing the duplicate entries and leaving only the new configuration. In previous work [13] we argued that time is a powerful abstraction for coordinating network updates. We defined an extension [1] to the OpenFlow protocol [18] that allows time-triggered operations. This extension has been approved and integrated into OpenFlow 1.5 [67], and into the OpenFlow 1.3.x extension package [68].

3.2.2

Time for Consistent Updates

In this paper we study the use of accurate time to trigger consistent network updates. We define a time-based order approach, where each phase in the sequence is scheduled to a different execution time, and a time-based two-phase approach, where each of the two phases is invoked at a different time. We show how the order and two-phase approaches benefit from time-triggered phases. Contrary to the conventional order and two-phase approaches, timed updates do not require the controller to wait until a phase is completed before invoking the next phase, significantly simplifying the controller’s involvement in the update process, and reducing the update duration. The time-based method significantly reduces the time duration required by the switches to maintain duplicate policy rules for the same flow. In order to accommodate the duplicate policy rules, switch flow tables should have a set of spare flow entries [22, 83] that can be used for network updates. Timed updates use each spare entry for a shorter duration than untimed updates, allowing higher scalability. Accurate time synchronization has evolved over the last decade, as the Precision Time Protocol (PTP) [21] has become a common feature in commodity switches, allowing sub-microsecond accuracy in practical use cases (e.g., [44]). However, even if switches have perfectly synchronized clocks, it is not guaranteed that updates are executed at their scheduled times; a scheduling mechanism that relies on the switch’s software may be affected by the switch’s operating system and by other running tasks. We argue that a carefully designed switch can schedule updates with a high degree of accuracy. Moreover, we show that even if switches are not optimized for accurate scheduling, the timed approach outperforms conventional update approaches.

3. Timed Consistent Network Updates in SDN

52

The use of time-triggered updates accentuates a tradeoff between update scalability and consistency. At one end of the scale, consistent updates come at the cost of a potentially long update duration, and expensive memory waste due to rule duplication.1 At the other end, a network-wide update can be invoked simultaneously (e.g., using T IME 4 [1]), allowing a short update time, preventing the need for rule duplication, but yielding a brief period of inconsistency. In this paper we show that timed updates can be tuned to any intermediate point along this scale.

3.2.3

Related Work

Various consistent network update approaches have been analyzed in the literature. Several solutions have been proposed [28, 31, 32, 30], in which consistency is guaranteed by applying updates in a specific order. Another approach is to guarantee consistency using tags that are attached to the packet headers [22, 83, 84]. None of these solutions use accurate time and synchronized clocks as a means to coordinate the updates. In this paper we show that time can be used to improve these methods, allowing better performance during update procedures. Scalability of network updates is another topic that has been discussed in several works. Using multiple SDN controllers to perform network updates, e.g., [85, 86], can improve scalability when the controller’s performance is a bottleneck. Incremental methods [83, 87] can improve the efficient use of flow table space in switches by breaking each update into multiple independent rounds, thereby reducing the total overhead consumed in each separate round. The timed approach we present in this paper can be used in conjunction with each of these approaches, in order to improve scalability and efficiency even further. The use of time in distributed applications has been widely analyzed, both in theory and in practice. Analysis of the usage of time and synchronized clocks, e.g., Lamport [34, 17] dates back to the late 1970s and early 1980s. Accurate time has been used in various different applications, such as distributed databases [88], industrial automation systems [35], automotive networks [36], and accurate instrumentation and measurements [37]. While the usage of accu1 As

shown in [83], the duration of an update can be traded for the update rate. The flow table will typically

include a limited number of excess entries that can be used for duplicated rules. By reducing the update duration, the excess entries are used for a shorter period of time, allowing a higher number of updates per second.

3. Timed Consistent Network Updates in SDN

53

rate time in distributed systems has been widely discussed in the literature, we are not aware of similar analyses of the usage of accurate time as a means for performing consistent updates in computer networks. Time-of-day routing [38] routes traffic to different destinations based on the time-of-day. Path calendaring [65] can be used to configure network paths based on scheduled or foreseen traffic changes. The two latter examples are typically performed at a low rate and do not place demanding requirements on accuracy. In [40] the authors briefly mentioned that it would be interesting to explore using time synchronization to instruct routers or switches to change from one configuration to another at a specific time, but did not pursue the idea beyond this observation. Our previous work [13, 12] introduced the concept of using time to coordinate updates in SDN. The OpenFlow protocol [67, 68] currently supports time-based network updates. In [9] we presented a practical method to implement accurately scheduled network updates in hardware switches using timestamp-based TCAM rules. In this paper we analyze the use of time in consistent updates, and show that time can improve the scalability of consistent updates.

3.2.4

Contributions

The main contributions of this paper are as follows. • We propose to use time-triggered network updates in a way that requires a lower overhead than existing update approaches without compromising the consistency during the update. • We show that timed consistent updates require a shorter duration than existing consistent update methods. We also discuss hybrid approaches that combine the advantages of timed updates with those of other update methods. • We define an inconsistency metric, allowing to quantify how consistent a network update is. • We show that accurate time provides the SDN programmer with a knob for fine-tuning the tradeoff between consistency and scalability.

3. Timed Consistent Network Updates in SDN

54

• We present experimental results that demonstrate the significant advantage of timed updates over other update methods. Our evaluation is based on experiments performed on a 50-node testbed, as well as simulation results.

3.3

Time-based Consistent Updates

We now describe the concept of time-triggered consistent updates. We assume that switches keep local clocks that are synchronized to a central reference clock by a synchronization protocol, such as the Precision Time Protocol (PTP) [21] or R EVERSE PTP [10, 11], or by an accurate time source such as GPS. The controller sends network update messages to switches using an SDN protocol such as OpenFlow [67]. An update message may specify when the corresponding update is scheduled to be performed.

3 S3

2 S2

before after

S4

1

2

S1 (a) Ordered update of a path.

S4

S3

S1

S2

1 3

(b) Two-phase update of a multicast distribution tree.

Figure 3.1: Update procedure examples.

3.3.1

Ordered Updates

Fig. 3.1a illustrates an ordered network update.2 We would like to reconfigure the path of a traffic flow from the ‘before’ to the ‘after’ configuration. An ordered update proceeds as described in Fig. 3.2; the phases in the procedure correspond to the numbers in Fig. 3.1a. 2 An

ordered update proceeds as a sequence of k phases, which must be performed according to a specific order.

The update in the current example is performed in three phases.

3. Timed Consistent Network Updates in SDN

55

U NTIMED O RDERED U PDATE 1 Controller sends the ‘after’ configuration to S1 . 2 Controller sends the ‘after’ configuration to S2 . 3 Controller updates S3 (garbage collection). Figure 3.2: Ordered update procedure for the scenario of Fig. 3.1a.

In the ordered update procedure every phase is performed after the previous phase was completed, and this guarantees consistency. Note that packets may arrive out-of-order to S4 due to the update. However, the update is per-packet consistent [22], i.e., each packet is forwarded either according to the ‘before’ configuration or according to the ‘after’ configuration, and no packets are dropped. A time-based order update procedure is described in Fig. 3.3. T IMED O RDERED U PDATE 0 Controller sends timed updates to all switches. 1 S1 enables the ‘after’ configuration at time T1 . 2 S2 enables the ‘after’ configuration at time T2 > T1 . 3 S3 performs garbage collection at time T3 > T2 . Figure 3.3: Timed Ordered update procedure for the scenario of Fig. 3.1a.

Notably, the ordered approach requires the controller to be involved in the entire update procedure, making the update process sensitive to the load on the controller, and to the communication delays at the time of execution. In contrast, in the time-base protocol, the controller is only involved in phase 0, and if T1 is chosen correctly, the update process is not influenced by these issues.

3.3.2

Two-phase Updates

An example of a two-phase update is illustrated in Fig. 3.1b; the figure depicts a multicast distribution tree through a network of three switches. Multicast packets are distributed along the

3. Timed Consistent Network Updates in SDN

56

paths of the ‘before’ tree. We would like to reconfigure the distribution tree to the ‘after’ state.

U NTIMED T WO - PHASE U PDATE 1 2 3

Controller sends the ‘after’ configuration to S1 . Controller instructs S2 to start using the ‘after’ configuration with the new version tag. Controller updates S1 (garbage collection). Figure 3.4: Two-phase update procedure for the scenario of Fig. 3.1b.

The two-phase procedure [22, 83] is described in Fig. 3.4. In the first phase, the new configuration is installed in S1 , instructing it to forward packets that have the new version tag according to the ‘after’ configuration. Thus, as defined in [22], the value of the version tag field is part of the flow match rule, and S1 has two rules in the flow table, one corresponding to the ‘before’ action, and the other corresponding to the ‘after’ action. In the second phase, S2 is instructed to forward packets according to the ‘after’ configuration using the new version tag. The ‘before’ configuration is removed in the third phase. As in the ordered approach, the two-phase procedure requires every phase to be invoked after the previous phase was completed.

T IMED T WO - PHASE U PDATE 0 Controller sends timed updates to all switches. 1 S1 enables the ‘after’ configuration at time T1 . 2 S2 enables the ‘after’ configuration with the new version tag at time T2 > T1 . 3 S1 performs garbage collection at time T3 > T2 . Figure 3.5: Timed two-phase update procedure for the scenario of Fig. 3.1b.

In the timed two-phase approach, specified in Fig. 3.5, phases 1, 2, and 3 are scheduled in advance by the controller. The switches then execute phases 1, 2, and 3 at times T1 , T2 , and T3 , respectively.

3. Timed Consistent Network Updates in SDN

3.3.3

57

k-Phase Consistent Updates

The order approach guarantees consistency if updates are performed according to a specific order. More generally, we can view an ordered update as a sequence of k phases, where in each phase j, a set of N j switches is updated. For each phase j, the updates of phase j must be completed before any update of phase j + 1 is invoked. The two-phase approach is a special case, where k = 2; in the first phase all the switches in the middle of the network are updated with the new policy, and in the second phase the ingress switches are updated to start using the new version tag.

3.3.4

The Overhead of Network Updates

Both the order method and the two-phase method require duplicate configurations to be present during the update procedure. In each of the protocols of Figures 3.2-3.5, both the ‘before’ and the ‘after’ configurations are stored in the switches’ expensive flow tables from phase 1 to phase 3. The unnecessary entries are removed only after garbage collection is performed in phase 3. In the timed protocols of Fig. 3.3 and 3.5 the switches receive the update messages in advance (phase 0), and can temporarily store the new configurations in an inexpensive memory. The switches install the new configuration in the expensive flow table memories only at the scheduled times, thereby limiting the period of duplication to the duration from phase 1 to phase 3. The overhead cost of the duplication depends on the time elapsed between phase 1 and phase 3. Hence, throughout the paper we use the update duration as a metric for quantifying the overhead of a consistent update that includes a garbage collection phase.

3.4 3.4.1

Terminology and Notations The Network Model

We reuse some of the terminology and notations of [22]. Our system consists of N + 1 nodes: a controller c, and a set of N switches, S = {S1 , . . . , SN }. A packet is a sequence of bits, denoted

3. Timed Consistent Network Updates in SDN

58

by pk ∈ Pk, where Pk is the set of possible packets in the system. Every switch Si ∈ S has a set Pri of ports. The sources and destinations of the packets are assumed to be external; packets are received from the ‘outside world’ through a subset of the switches’ ports, referred to as ingress ports. An ingress switch is a switch that has at least one ingress port. Every packet pk is forwarded through a sequence of switches (Si1 , . . . , Sim ), where the first switch Si1 is an ingress switch. The last switch in the sequence, Sim , forwards the packet through one of its ports to the outside world. When a packet pk is received by a switch Si through port p ∈ Pri , the switch uses a forwarding function Fi : Pk × Pri −→ A, where A is the set of possible actions a switch can perform, e.g., ‘forward the packet through port q0 . The packet content and the port through which the packet was received determine the action that is applied to the packet. It is assumed that every switch maintains a local clock. As is standard in the literature (e.g., [89]), we distinguish between real time, an assumed Newtonian time frame that is not directly observable, and local clock time, which is the time measured on one of the switches’ clocks. We denote values that refer to real time by lowercase letters, e.g. t, and values that refer to clock time by uppercase, e.g., T . We define a packet instance to be a tuple (pk, Si , p,t), where pk ∈ Pk is a packet, Si ∈ S is the ingress switch through which the packet is received, p ∈ Pri is the ingress port at switch Si , and t is the time at which the packet instance is received by Si .

3.4.2

Network Updates

We define a singleton update u of switch Si to be a partial function, u : Pk × Pri * A. A switch applies a singleton update, u, by replacing its forwarding function, Fi with a new forwarding function, F0i , that behaves like u in the domain of u, and like Fi otherwise. We assume that every singleton update is triggered by a set of one or more messages sent by the controller to one of the switches. We define an update to be a set of singleton updates U = {u1 , . . . , um }. We define an update procedure to be a set U = {(u1 ,t1 , phase(u1 )), . . . , (um ,tm , phase(um ))} of triples, such that for

3. Timed Consistent Network Updates in SDN

59

all 1 ≤ j ≤ m, we have that u j is a singleton update, phase(u j ) is a positive integer specifying the phase number of u j , and t j is the time at which u j is performed. Moreover, it is required that for every 1 ≤ i, j ≤ m, if phase(ui ) < phase(u j ) then ti < t j . This definition implies that an update procedure is a sequence of one or more phases, where each phase is performed after the previous phase is completed, but there is no guarantee about the order of the singleton updates within each phase. A

k-phase

update

procedure

is

an

update

procedure

U

=

{(u1 ,t1 , phase(u1 )), . . . , (um ,tm , phase(um ))} in which for all 1 ≤ j ≤ m we have 1 ≤ phase(u j ) ≤ k, and for all 1 ≤ i ≤ k there exists an update u j such that (u j ,t j , i) ∈ U. We define a timed singleton update uT to be a pair (u, T ), where u is a singleton update, and T is a clock value that represents the scheduled time of u. We assume that every switch maintains a local clock, and that when a switch receives a message indicating a timed singleton update uT it implements the update as close as possible to the instant when its local clock reaches the value T . Similar to the definition of an update procedure, we define a timed update procedure UT to be a set UT = {(uT1 ,t1 , phase(uT1 )), . . . , (uTm ,tm , phase(uTm ))}. An update procedure, U = {(u1 ,t1 , phase(u1 )), . . . , (um ,tm , phase(um ))}, and a timed update procedure, UT = {(vT1 ,t1 , phase(vT1 )), . . . , (vTn ,tn , phase(vTn ))}, are said to be similar, denoted by UT ∼ U if m = n and for every 1 ≤ j ≤ m we have u j = v j and phase(u j ) = phase(v j ). We define consistent forwarding based on the per-packet consistency definition of [22]. Intuitively, given an untimed update U, a packet is consistently forwarded if it is processed by all switches either according to the new configuration, after the update U was applied, or according to the old one, but not according to a mixture of the two. Formally, let (pk, Si1 , p1 ,t) be a packet instance that is forwarded through a sequence of switches Si1 , Si2 , . . . , Sim through ports p1 , p2 , . . . , pm , respectively, and is assigned the actions a1 , a2 , . . . , am . Let Fi j be the forwarding function of Si j before the update is applied, and let F0 i j be the forwarding function after the update. The packet instance (pk, Si1 , p1 ,t) is said to be consistently forwarded if either of the following is satisfied:

3. Timed Consistent Network Updates in SDN

60

(i) Fi j (pk, p j ) = a j for all 1 ≤ j ≤ m, or (ii) F0 i j (pk, p j ) = a j for all 1 ≤ j ≤ m. A packet instance that is not consistently forwarded, is said to be inconsistently forwarded. Dc

An upper bound on the controller-to-switch delay, including the network latency, and the internal switch delay until completing the update.

Dn

An upper bound on the end-to-end network delay.



An upper bound on the time interval between the transmission times of two consecutive update messages sent by the controller. An upper bound on the scheduling error; an update that is scheduled to be performed

δ

at T is performed in practice during the time interval [T, T + δ]. Tsu

The timed update setup time; in order to invoke a timed update that is scheduled to time T , the controller sends the update messages no later than at T − Tsu . Table 3.1: Delay-related Notations

3.4.3

Delay-related Notations

Table 3.1 presents key notations related to delay and performance. The attributes that play a key role in our analysis are Dc , Dn , and δ. These attributes are discussed further in Section 3.5.

3.5 3.5.1

Upper and Lower Bounds Delay Upper Bounds

Both the order [28, 30, 31, 32] and the two-phase [22, 83] approaches implicitly assume the existence of two upper bounds, Dc and Dn (see Table 3.1): • Dc : both approaches require previous phases in the update procedure to be completed before invoking the current phase. Therefore, after sending an update message, the controller

3. Timed Consistent Network Updates in SDN

61

must wait for a period of Dc until it is guaranteed that the corresponding update has been performed; only then can it invoke the next phase in the procedure. Alternatively, explicit acknowledgments can be used to indicate update completions, as further discussed in Section 3.5.2. • Dn : garbage collection can take place after the update procedure has completed, and all en-route packets have been drained from the network. Garbage collection can be invoked either after waiting for a period of Dn after completing the update, or by using soft timeouts.3 Both of these approaches assume there is an upper bound, Dn , on the end-to-end network latency. Is it practical to assume that the upper bounds Dc and Dn exist? Network latency is often modeled using long-tailed distributions such as exponential or Gamma [90, 91], implying that network latency is often unbounded. We demonstrate the long-tailed behavior of network latency by analyzing measurements performed on production networks. We analyze 20 delay measurement datasets from [26, 27] taken at various sites over a one-year period, from November 2013 to November 2014. The measurements capture the round-trip time (RTT) using ICMP Echo requests. The measurements show (Fig. 3.6) that in some networks the 99.999th percentile is almost two orders of magnitude higher than the average RTT. Table 3.2 summarizes the ratio between tail latency values and average values in the 20 traces we analyzed. 99.9th percentile 4.88

99.99th percentile 10.49

99.999th percentile 19.45

Table 3.2: The mean ratio between the tail latency and the average latency. In typical networks we expect Dn to have long-tailed behavior. Similar long-tailed behavior has also been shown for Dc in [30, 64]. 3 Soft

timeouts are defined in the OpenFlow protocol [67] as a means for garbage collection; a flow entry that is

configured with a soft timeout, Dn , is cleared if it has not been used for a duration Dn .

3. Timed Consistent Network Updates in SDN

Tail latency [ms]

1600

62

99.9th percentile

1400

99.99th percentile

1200

99.999th percentile

1000 800 600 400 200 0 0

50

100

150

200

250

Average latency [ms]

Figure 3.6: Long-tail latency At a first glance, these results seem troubling: if network latency is indeed unbounded, neither the order nor the two-phase approaches can guarantee consistency, since the controller can never be sure that the previous phase was completed before invoking the next phase. In practice, typical approaches will not require a true upper bound, but rather a latency value that is exceeded with a sufficiently low probability. Service Level Agreement (SLA) in carrier networks is a good example of this approach; per the MEF 10.3 specification [92], a Service Level Specification (SLS) defines not only the mean delay, but also the Frame Delay Range (FDR), and the percentile defining this range. Thus, service providers must guarantee that the rate of frames that exceed the delay range is limited to a known percentage. Throughout the paper we use Dc and Dn , to denote the upper bounds of the delays. In practice, these may refer to a sufficiently high percentile delay. Our analysis in Section 3.7 revisits the upper bound assumption.

3.5.2

Explicit acknowledgment

If a switch can explicitly notify the controller when it completes an update operation, then the controller has a definitive indication of when the update is completed. Unfortunately, Open-

3. Timed Consistent Network Updates in SDN

63

Flow [67, 41] currently does not support such an acknowledgment mechanism. Hence, one can either use a different SDN protocol that supports explicit acknowledgment (as was assumed in [30]), or use an update procedure in which the controller waits for a fixed period (Dc ) until the switch is guaranteed to complete the update. In the absence of ACKs, update procedures are planned according to a worst-case analysis (Section 3.6), both in the timed and in the untimed approaches; the controller waits until it is guaranteed that the update was completed with a sufficiently high probability. In both the timed and untimed approaches, an update cannot be completed with 100% consistency in the absence of ACKs. Hence, in Section 3.7 we introduce a metric that quantifies the level of consistency during an update. In the presence of ACKs (as assumed in [30]), update procedures can sometimes be completed earlier than without using ACKs. Furthermore, ACKs enable updates to be performed dynamically [30], whereby at the end of each phase the controller dynamically plans the next phase. Whether ACKs are available or not, garbage collection can only be performed after all packets that were en-route during the update have been drained from the network, requiring the controller to wait for a period of Dn units. Fortunately, the timed and untimed approaches can be combined. For example, in the presence of an acknowledgment mechanism, update procedures can be performed in a dynamic, untimed, ACK-based manner, with a timed garbage collection phase at the end. Such a flexible mix-and-match approach allows the SDN programmer to enjoy the best of both worlds. This hybrid approach is further discussed in Section 3.6.5.

3.5.3

Delay Lower Bounds

Throughout the paper we assume that the lower bounds of the network delay and the controllerto-switch delay are zero. This assumption simplifies the presentation, although the model can be extended to include non-zero lower bounds on delays.

3. Timed Consistent Network Updates in SDN

3.5.4

64

Scheduling Accuracy Bound

As defined in Table 3.1, δ is an upper bound on the scheduling error, indicating how accurately updates are scheduled; an update that is scheduled to take place at time T is performed in practice during the interval [T, T + δ].4 A switch’s scheduling accuracy depends on two factors: (i) how accurately its clock is synchronized to the system’s reference clock, and (ii) its ability to perform real-time operations. Most high-performance switches are implemented as a combination of hardware and software components. A scheduling mechanism that relies on the switch’s software may be affected by the switch’s operating system and by other running tasks, consequently affecting the scheduling accuracy. Furthermore, previous work [30, 64] has shown high variability in rule installation latencies in Ternary Content Addressable Memories (TCAMs), resulting from the fact that a TCAM update might require the TCAM to be rearranged. Nevertheless, existing switches and routers practice real-time behavior, with a predictable guaranteed response time to important external events. Traditional protection switching and fast reroute mechanisms require the network to react to a path failure in less than 50 milliseconds, implying that each individual switch or router reacts within a few milliseconds, or in some cases less than one millisecond (e.g. [93]). Operations, Administration, and Maintenance (OAM) protocols such as the IEEE 802.1ag [94] require faults to be detected within a strict timing constraint of ±0.42 milliseconds.5 Measures can be taken to implement accurate scheduling of timed updates: • Common real-time programming practices can be applied to ensure guaranteed performance for time-based update, by assigning a constant fraction of time to timed updates. • When a switch is aware of an update that is scheduled to take place at time Ts , it can avoid performing heavy maintenance tasks near this time, such as TCAM entry rearrangement. 4 An

alternative representation of δ assumes a symmetric error, T ± δ/2. The two approaches are equivalent. are detected using Continuity Check Messages (CCM), transmitted every 3.33 ms. A fault is detected

5 Faults

when no CCMs are received for a period of 11.25 ± 0.42 ms.

3. Timed Consistent Network Updates in SDN

65

• Untimed update messages received slightly before time Ts can be queued and processed after the scheduled update is executed. • If a switch receives a time-based command that is scheduled to take place at the same time as a previously received command, it can send an error message to the controller, indicating that the last received command cannot be executed. • It has been shown that timed updates can be scheduled with a very high degree of accuracy, on the order of 1 microsecond, using T IME F LIP [9]. This approach provides a high scheduling accuracy, potentially at the cost of some overhead in the switch’s flow tables. Observation 3.1. In typical settings δ < Dc . The intuition behind Observation 3.1 is that δ is only affected by the switch’s performance, whereas Dc is affected by both the switch’s performance and the network latency. We expect Observation 3.1 to hold even if switches are not designed for real-time performance. We argue that in switches that use some of the real-time techniques above, δ  Dc , making the timed approach significantly more advantageous, as we shall see in the next section.

3.6 3.6.1

Worst-case Analysis Worst-case Update Duration

We define the duration of an update procedure to be the time elapsed from the instant at which the first switch updates its forwarding function to the instant at which the last switch completes its update. We use Program Evaluation and Review Technique (PERT) charts [23] to illustrate the worstcase update duration analysis. Fig. 3.7 illustrates a PERT chart of an untimed ordered k-phase update, where three switches are updated in each phase. Switches S1 , S2 , and S3 are updated in the first phase, S4 , S5 , and S6 are updated in the second phase, and so on. In this procedure, the controller waits until phase j is guaranteed to have been completed before starting phase j + 1.

3. Timed Consistent Network Updates in SDN

Cstart

0

C1,1

Dc

0 0

S1,1

C1,2 Dc S1,2

C2,4

0

Dc

S2,4

0

C2,5 Dc S2,5

max( ,Dc) Phase 1

C3,1

...

Ck,N-2

Dc

Sk,N-2

0

0

C2,6 Dc S2,6

C1,3 Dc S1,3

66

Cfin

0 Ck,N-1 Dc Sk,N-1

0

max( ,Dc)

Phase 2

0

Ck,N Dc Sk,N Phase k

Figure 3.7: A PERT chart of a k-phase update. Each node in the PERT chart represents an event, and each edge represents an activity. A node labeled C j,i represents the event ‘the controller starts transmitting a phase j update message to switch Si ’. A node labeled S j,i represents ‘switch Si has completed its phase j update’. The weight of each edge indicates the maximal delay to complete the transition from one event to another. Cstart and C f in represent the start and finish times of the update procedure, respectively. The worst-case duration between two events is given by the longest path between the two corresponding nodes in the graph. Throughout the section we focus on greedy update procedures. An update procedure is said to be greedy if the controller invokes each update message at the earliest possible time that guarantees that for every phase j all the singleton updates of phase j are completed before those of phase j + 1 are initiated.

3.6.2

Worst-case Analysis of Untimed Updates

Untimed Updates We start by discussing untimed k-phase update procedures, focusing on a single phase, j, in which N j switches are updated. In Lemma 3.2 and in the upcoming lemmas in this section we focus on greedy updates. Lemma 3.2. If U is a multi-phase update procedure, then the worst-case duration of phase j of U is:

3. Timed Consistent Network Updates in SDN

(N j − 1) · ∆ + Dc

67

(3.1)

Proof. Assume that the controller transmits the first update message of phase j at time t. Since N j switches take part in phase j, and ∆ is the upper bound on the duration between two consecutive messages, the controller invokes the last update message of phase j no later than at t +(N j −1)·∆. Since Dc is the upper bound on the controller-to-switch delay, the update is completed at most Dc time units later. Hence, the worst-case update duration is (N j − 1) · ∆ + Dc .

The following lemma specifies the worst-case update duration of a k-phase update. The intuition is straightforward from Fig. 3.7. Lemma 3.3. The worst-case update duration of a k-phase update procedure is:

k

∑ (N j − 1) · ∆ + (k − 1) · max(∆, Dc) + Dc

(3.2)

j=1

Proof. Each phase j delays the controller for (N j − 1) · ∆. Since the update is greedy, at the end of each of the first k − 1 phases the controller waits max(∆, Dc ) time units to guarantee that the phase has completed, and then immediately proceeds to the next phase. The update is completed, in the worst case, Dc time units after the controller sends the last update message of the kth phase. The claim follows.

Specifically, in two-phase updates k = 2, and thus: Corollary 3.4. If U is a two-phase update procedure, then its worst-case update duration is:

(N1 + N2 − 2) · ∆ + max(∆, Dc ) + Dc

(3.3)

3. Timed Consistent Network Updates in SDN

68

Untimed Updates with Garbage Collection In some cases, garbage collection is required for some of the phases in the update procedure. For example, in the two-phase approach, after phase 2 is completed and all en-route packets have been drained from the network, garbage collection is required for the N1 switches of the first phase. More generally, assume that at the end of every phase j the controller performs garbage collection for a set of NG j switches. Thus, after phase j is completed the controller waits Dn time units for the en-route packets to drain, and then invokes the garbage collection procedure for the NG j switches. After invoking the last message of phase j, the controller waits for max(∆, Dc + Dn ) time units. Thus, the worst-case duration from the transmission of the last message of phase j until the garbage collection of phase j is completed is given by Eq. 3.4. max(∆, Dc + Dn ) + (NG j − 1) · ∆ + Dc

Cstart

0

C1,1

Dc

0

S1,1

C2,4

Dc

0

C1,2 Dc S1,2

0

C1,3 Dc S1,3

Dn

C2,5 Dc S2,5 C2,6

Dc

max( ,Dc) Phase 1

Dn

S2,4

Phase 2

(3.4)

C3,1

Dn

S2,6

Dc

0

S3,1

Cfin

0 C3,2 Dc S3,2 C3,3

Dc

0

S3,3

max( ,Dc+Dn) Garbage collection phase

Figure 3.8: A PERT chart of a two-phase update with garbage collection, performed after phase 2 is completed. Garbage collection removes the ‘before’ configuration (see Fig. 3.1) from the switches that took part in phase 1. Fig. 3.8 depicts a PERT chart of a two-phase update procedure that includes a garbage collection phase. No garbage collection is required at the end of phase 1, and thus NG 1 = 0. At the end of the second phase, garbage collection is performed for the policy rules of phase 1, affecting NG 2 = 3 switches: S1 , S2 , and S3 . This is in fact a special case of a 3-phase update procedure, where the third phase takes place only after all the en-route packets are guaranteed to have been

3. Timed Consistent Network Updates in SDN

69

drained from the network. The main difference between this example and the general k-phase graph of Fig. 3.7 is that in Fig. 3.8 the controller waits at least max(∆, Dc + Dn ) time units from the transmission of the last message of phase 2 until starting to invoke the garbage collection phase. Note that in a multi-phase update there may be several garbage collection phases, each performed at a different stage of the update. Lemma 3.5. If U is a two-phase update procedure with a garbage collection phase, then its worst-case update duration is:

(N1 + N2 + NG 2 − 3) · ∆ + max(∆, Dc )+ (3.5) + max(∆, Dc + Dn ) + Dc Proof. In each of the three phases the controller waits at most ∆ time units between two consecutive update messages, summing up to (N1 + N2 + NG 2 − 3) · ∆. The controller waits for max(∆, Dc ) time units at the end of phase 1, guaranteeing that all the updates of phase 1 have been completed before invoking phase 2. At the end of phase 2 the controller waits for max(∆, Dc + Dn ) time units, guaranteeing that phase 2 is completed, and that all the en-routed packets have been drained before starting the garbage collection phase. Finally, Dc time units after the controller sends the last message of the garbage collection phase, the last update is guaranteed to be completed.

3.6.3

Worst-case Analysis of Timed Updates

Worst-case-based Scheduling Based on a worst-case analysis, an SDN program can determine an update schedule, T = (T1 , . . . , Tk , Tg 1 , . . . , Tg k ). Every timed update uT is performed no later than at T + δ. Consequently, we can derive the worst-case scheduling constraints below.

3. Timed Consistent Network Updates in SDN

70

Definition 3.6 (Worst-case scheduling). If U is a timed k-phase update procedure, then a schedule T = (T1 , . . . , Tk , Tg 1 , . . . , Tg k ) is said to be a worst-case schedule if it satisfies the following two equations:

T j = T j−1 + δ f or every phase 2 ≤ j ≤ k

(3.6)

Tg j = T j + δ + Dn

(3.7)

for every phase j that requires garbage collection. Note that a greedy timed update procedure uses worst-case scheduling. Every schedule T that satisfies Eq. 3.6 and 3.7 guarantees consistency. For example, the timed two-phase update procedure of Fig. 3.9 satisfies the two scheduling constraints above.

δ T Cstart su

T1

δ

S1,1 S1,2

δ

0 0 0

δ T2

δ

S2,5

δ

S1,3 Phase 1

S2,4

Dn Dn Dn

S2,6 Phase 2

δ Tg1

δ

S3,1 S3,2

Cfin

δ S3,3 Garbage collection phase

Figure 3.9: A PERT chart of a timed two-phase update with garbage collection.

Timed Updates A timed update starts with the controller sending scheduled update messages to all the switches, requiring a setup time Tsu . Every phase is guaranteed to take no longer than δ. An example of a timed two-phase update is illustrated in Fig. 3.9. Lemma 3.7. The worst-case update duration of a k-phase timed update procedure with a worstcase schedule is k · δ.

3. Timed Consistent Network Updates in SDN

71

Proof. The lemma follows directly from the worst-case scheduling constraints of Eq. 3.6 and 3.7.

Based on the latter, we derive the following lemma. Lemma 3.8. If U is a two-phase timed update procedure with a garbage collection phase using a worst-case schedule, then its worst-case update duration is Dn + 3 · δ. Proof. By Lemma 3.7, the first two phases take 2 · δ time units. The garbage collection phase requires δ additional time units, and Dn time units to allow all en-route packets to drain from the network. Thus, the update duration is Dn + 3 · δ.

3.6.4

Timed vs. Untimed Updates

We now study the conditions under which the timed approach outperforms the untimed approach. Based on Lemmas 3.3 and 3.7, we observe that a timed k-phase update procedure has a shorter update duration than a similar untimed k-phase update procedure if: k

k·δ
t1 and there is no packet instance (pk3 , S3 , p3 ,t3 ) ∈ PI such that t2 > t3 > t1 , then t2 = t1 + 1/R. 6 For

simplicity, we define that all packets of a test flow are identical. It is in fact sufficient to require that all

packets of the flow are indistinguishable by the switch forwarding functions, for example, that all packets of a flow have the same source and destination addresses.

3. Timed Consistent Network Updates in SDN

75

We assume a method that, for a given test flow f and an update u, allows to measure the number of packets n( f , u) that are forwarded inconsistently.7 Definition 3.14 (Inconsistency metric). Let f be a test flow with a packet arrival rate R( f ). Let U be an update, and let n( f ,U) be the number of packet instances of f that are forwarded inconsistently due to update U. The inconsistency I( f ,U) of a flow f with respect to U is defined to be: I( f ,U) =

n( f ,U) R( f )

(3.12)

The inconsistency I( f ,U) is measured in time units. Intuitively, I( f ,U) quantifies the amount of time that flow f is disrupted by the update.

3.7.2

Fine Tuning Consistency

Timed updates provide a powerful mechanism that allows SDN programmers to tune the degree of consistency. By setting the update times T1 , T2 , . . . , Tk , Tg 1 , . . . , Tg k , the controller can play with the consistency-scalability tradeoff; the update overhead can be reduced at the expense of some inconsistency, or vice versa.8 Example 3.15. We consider a two-phase update with a garbage collection phase. We assume that δ = 0 and that all packet instances are subject to a constant network delay, Dn . By assigning T = T1 = T2 = Tg 1 , the controller schedules a simultaneous update. This approach is referred to as T IME 4 in [1]. All switches are scheduled to perform the update at the same time T . Packets entering the network during the period [T − Dn , T ] are forwarded inconsistently. The inconsistency metric in this example is I = Dn . The advantage of this approach is that it completely relieves the switches from the overhead of maintaining duplicate entries between the phases of the update procedure. 7 This 8 In

measurement can be performed, for example, by per-flow match counters in the switches. some scenarios, such as security policy updates, even a small level of inconsistency cannot be tolerated. In

other cases, such as path updates, a brief period of inconsistency comes at the cost of some packets being dropped, which can be a small price to pay for reducing the update duration.

3. Timed Consistent Network Updates in SDN

76

Example 3.16. Again, we consider a two-phase update (Fig. 3.11), with δ = 0 and a constant network delay, Dn . We assign T2 = T1 + δ according to Eq. 3.6, and Tg 1 is assigned to be T2 + δ + d, where d < Dn . The update is illustrated in the PERT chart of Fig. 3.11. Hence, packets entering the network during the period [T2 − Dn + d, T2 ] are forwarded inconsistently. The inconsistency metric is equal to I = max(Dn − d,0). In a precise sense, the delay d is a knob for tuning the update inconsistency.

δ=0 T Cstart su

S1,1

0

T1 δ=0 S1,2 0 0 δ=0 S1,3 Phase 1

δ=0

S2,4

T2 δ=0 S2,5 δ=0

S2,6

Phase 2

d d d

δ=0

S3,1

Tg1 δ=0 S3,2 δ=0

Cfin

S3,3

Garbage collection phase

Figure 3.11: Example 3.16: PERT chart of a timed two-phase update. The delay d (red in the figure) is a knob for consistency.

3.8

Evaluation

Our evaluation was performed on a 50-node testbed in the DeterLab [25, 95] environment. The nodes (servers) in the DeterLab testbed are interconnected by a user-configurable topology. Each testbed node in our experiments ran a software-based OpenFlow switch that supports time-based updates, also known as Scheduled Bundles [67]. A separate machine was used as a controller, which was connected to the switches using an out-of-band control network. The OpenFlow switches and controller we used are a version of OFSoftSwitch and Dpctl [74], respectively, that supports Scheduled Bundles [1]. We used R EVERSE PTP [11, 10] to guarantee synchronized timing.

3. Timed Consistent Network Updates in SDN

3.8.1

77

Experiment 1: Timed vs. Untimed Updates

We emulated a typical leaf-spine topology (e.g., [96]) of N switches, with N 3

2N 3

leaf switches, and

spine switches (see Fig. 3.12). The experiments were run using various values of N, between 6

and 48 switches.

N/3 spine switches 2N/3 leaf switches Figure 3.12: Leaf-spine topology.

We measured the delay upper bounds, Dn , Dc , δ, and ∆. Table 3.3 presents the 99.9th percentile delay values of each of these parameters. These are the parameters that were used in the controller’s greedy updates. Dn 0.262

Dc 4.865

δ 1.297

∆ 5.24

Table 3.3: The measured 99.9th percentile of each of the delay attributes in milliseconds.

We observed a low network delay Dn , as it was measured over two hops of a local area network. In Experiment 2 we analyze networks with a high network delay. Note that the values of δ and Dc were measured over software-based switches. Since hardware switches may yield different values, some of our experiments were performed with various synthesized values of δ and Dc , as discussed below. The measured value of ∆ was high, on the order of 5 milliseconds, as Dpctl is not optimized for performance. The experiments consisted of 3-phase updates of a policy rule: (i) a phase 1 update, involving all the switches, (ii) a phase 2 update, involving only the leaf (ingress) switches, and (iii) a garbage collection phase, involving all the switches.

3. Timed Consistent Network Updates in SDN

78

Results. Fig. 3.13a compares the update duration of the timed and untimed approaches as a function of N. Untimed updates yield a significantly higher update duration, since they are affected by (N1 + N2 + NG 2 − 3) · ∆, per Lemma 3.5.9 Hence, the advantage of the timed

0.9

Timed - experimental Untimed - experimental Timed - theoretical Untimed - theoretical

0.8 0.7

Update Duration [seconds]

Update Duration [seconds]

approach increases with the number of switches in the network, illustrating its scalability.

0.6 0.5 0.4 0.3 0.2 0.1 0 0

10

20

30

40

3.5

Timed - experimental Timed - theoretical

3 2.5 2 1.5 1 0.5 0 0

50

0.5

1

δ [seconds]

Number of Switches

(b) The update duration as a function of the

of switches.

scheduling error, for N = 12.

-0.2

4.5

Timed - experimental Untimed - experimental Timed - theoretical Untimed - theoretical

4 3.5

Update Duration [seconds]

Update Duration [seconds]

(a) The update duration as a function of the number

3 2.5 2 1.5 1 0.5 0 0

0.2

0.4

0.6

0.8

1

Dc-δ [seconds]

0.45

Timed - experimental Untimed - experimental Timed - theoretical Untimed - theoretical

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

0.05

0.1

Dn [seconds]

(c) The update duration as a function of Dc − δ, for

(d) The update duration as a function of the

N = 12, δ = 100 ms, various values of Dc .

end-to-end network delay Dn , for N = 12.

Figure 3.13: Timed updates vs. untimed updates. Each figure shows the experimental values, and the theoretical worst-case values, based on Lemmas 3.5 and 3.8. 9 The

slope of the untimed curve in Fig. 3.13a is ∆, by Lemma 3.5. The theoretical curve was computed based on

the 99.9th percentile value, whereas the mean value in our experiment was about 20% lower, explaining the different slopes of the theoretical and experimental curves.

3. Timed Consistent Network Updates in SDN

79

The impact of the scheduling error on the update duration in the timed approach is illustrated in Fig. 3.13b. As expected, the update duration grows linearly with δ, however, the update duration of the untimed approach is expected to be higher, as typically δ < Dc . Fig. 3.13c shows the update duration of the two approaches as a function of Dc − δ, as we ran the experiment with synthesized values of δ and Dc . We fixed δ at 100 milliseconds, and tested various values of Dc . As expected (by Section 3.6.4), the results show that for Dc − δ > 0 the timed approach yields a lower update duration. Furthermore, only when the scheduling error, δ, is significantly higher than Dc does the untimed approach yield a shorter update duration. As discussed in Section 3.5.4, we typically expect Dc − δ to be positive, as δ is unaffected by high network delays, and thus we expect the timed approach to prevail. Interestingly, the results show that even when the scheduling is not accurate, e.g., if δ is 100 milliseconds worse than Dc , the timed approach has a lower update duration. Fig. 3.13d illustrates the effect of the end-to-end network latency on the update duration. Both the timed and untimed approaches are linearly proportional to the network latency, following Lemmas 3.5 and 3.8. However, the timed approach allows a lower update duration, as it is not affected by N and ∆.

3.8.2

Experiment 2: Fine Tuning Consistency

The goal of this experiment was to study how time can be used to tune the level of inconsistency during updates. In order to experiment with real-life wide area network delay values, Dn , we performed the experiment using publicly available topologies. Network topology. Our experiments ran over three publicly available service provider network topologies [14], as illustrated in Fig. 3.14. We defined each node in the figure to be an OpenFlow switch. OpenFlow messages were sent to the switches by a controller over an out-ofband network (not shown in the figures). Network delays. The public information provided in [14] does not include the explicit delay of each path, but includes the coordinates of each node. Hence we derived the network delays from the beeline distance between each pair of nodes, assuming 5 microseconds per kilometer,

3. Timed Consistent Network Updates in SDN

80

destination source

source

destination

(a) Sprint topology.

(b) NetRail topology.

destination

source

(c) Compuserve topology.

Figure 3.14: Publicly available topologies [14] used in our experiments. Each path of the test flows in our experiment is depicted by a different color. Black nodes are OpenFlow switches. White nodes represent the external source and destination of the test flows in the experiment. as recommended in [97]. The DeterLab testbed allows a configurable delay value to be assigned to each link. We ran our experiments in two modes: (i) Constant delay — each link had a constant delay that was configured to the value we computed as described above. (ii) Exponential delay — each link had an exponentially distributed delay. The mean delay of each link in experiment (ii) was equal to the link delay of this link in experiment (i), allowing an ‘apples to apples’ comparison. Test flows. In each topology we ran five test flows, and measured the inconsistency during a timed network update. Each test flow was injected by an external source (see 3.14) to one of the ingress switches, forwarded through the network, and transmitted from an egress switch to an external destination. Test flows were injected at a fixed rate of 40 Mbps using Iperf [80]. Network updates. We performed two-phase updates of a Multiprotocol Label Switching (MPLS) label; a flow is forwarded over an MPLS Label-Switched Path (LSP) with label A, and then reconfigured to use label B. A garbage collection phase was used to remove the entries of label A. Conveniently, the MPLS label was also used as the version tag in the two-phase updates. Inconsistency measurement. For every test flow f , and update U, we measure the number of inconsistent packets during the update n( f ,U). Inconsistent packets in our context are either packets with a ‘new’ label arriving to a switch without the ‘new’ rule, or packets with an ‘old’ label arriving to a switch without the ‘old’ rule. We used the switches’ OpenFlow counters to count the number of inconsistent packets, n( f ,U). We compute the inconsistency of each update

25 20 15 10 5 0 0

10

20

30

flow 1b flow 2b flow 3b flow 4b flow 5b

25 20 15 10 5 0 0

30

10

(a) Sprint - constant network

20 15 10 5 0 0

50

20 15 10 5 0

30

delay. Inconsistency [milliseconds]

Inconsistency [milliseconds]

25

flow 1c flow 2c flow 3c flow 4c flow 5c

25

0

10

20

30

Update Duration [milliseconds]

(b) NetRail - constant network (c) Compuserve - constant

delay. flow 1a flow 2a flow 3a flow 4a flow 5a

30

Update Duration [milliseconds]

Update Duration [milliseconds]

30

20

Inconsistency [milliseconds]

flow 1a flow 2a flow 3a flow 4a flow 5a

81

100

Update Duration [milliseconds]

30

network delay. flow 1b flow 2b flow 3b flow 4b flow 5b

25 20 15 10 5 0 0

50

100

Update Duration [milliseconds]

Inconsistency [milliseconds]

30

Inconsistency [milliseconds]

Inconsistency [milliseconds]

3. Timed Consistent Network Updates in SDN

30

flow 1c flow 2c flow 3c flow 4c flow 5c

25 20 15 10 5 0 0

20

40

60

80

100

Update Duration [milliseconds]

(d) Sprint - exponential network (e) NetRail - exponential network (f) Compuserve - exponential delay.

delay.

network delay.

Figure 3.15: Inconsistency as a function of the update duration. Modifying the update duration controls the degree of inconsistency. Two graphs are shown for each of the three topologies: exponential delay, constant delay. using Eq. 3.12. Results. We measured the inconsistency I during each update as a function of the update duration, Tg 1 − T1 (see Fig. 3.9). We repeated the experiment for each of the topologies and each of the test flows of Fig. 3.14. The results are illustrated in Fig. 3.15. The figure depicts the tradeoff between the update duration, and the inconsistency during the update. A long update duration bares a cost on the switches’ expensive memory resources, whereas a high degree of inconsistency implies a large number of dropped or misrouted packets. Using a timed update, it is possible to tune the difference Tg 1 − T1 , directly affecting the degree of inconsistency. An SDN programmer can tune Tg 1 − T1 to the desired sweet spot based

3. Timed Consistent Network Updates in SDN

82

on the system constraints; if switch memory resources are scarce, one may reduce the update duration and allow some inconsistency. As illustrated in Fig. 3.15d, 3.15e, and 3.15f, this fine tuning is especially useful when the network latency has a long-tailed distribution. A truly consistent update, where I = 0, requires a very long and costly update duration. As shown in the figures, by slightly compromising I, the switch memory overhead during the update can be cut in half.

3.8.3

Simulation: Using ACKs

In this simulation we analyzed a two-phase hybrid approach that uses acknowledgments for the first two phases, and a timed garbage collection phase. As discussed in Section 3.5.2, explicit acknowledgment is currently not supported by OpenFlow, and thus we could not experiment with ACKs in the OpenFlow-based testbed that was used in Experiments 1 and 2. Therefore, we used a simulation to evaluate the usage of ACKs. We assumed the topology of Experiment 1 with N = 12, and used the measured values of Dc , δ,

Update Duration [seconds]

and ∆, as presented in Table 3.3. The simulation was implemented in Visual Basic. 0.25

Timed - simulation Untimed - simulation Timed - theoretical Untimed - theoretical

0.2 0.15 0.1 0.05 0 0

0.05

0.1

Dn [seconds]

Figure 3.16: Update duration of the garbage collection phase. We simulated the garbage collection phase of a two-phase update, as depicted in Fig. 3.10, and compared the update duration in the timed and untimed cases for various values of Dn . We defined Tsu to be (NG 2 − 1) · ∆ + Dc , allowing enough time for the timed update messages to propagate from the controller to the switches. Fig. 3.16 depicts the duration of the garbage

3. Timed Consistent Network Updates in SDN

83

collection phase in the timed and in the untimed approaches. The theoretical values in the figure are based on Eq. 3.10 and 3.11. As illustrated in the graph, if Dn is higher than a few milliseconds, then the timed approach yields a significantly lower update duration, illustrating the advantage of a timed garbage collection phase.

3.9

Discussion

Failures. Switch failures during an update procedure may compromise the consistency during an update. For example, a switch may silently fail to perform an update, thereby causing inconsistency. Both the timed and untimed update approaches may be affected by failure scenarios. The OpenFlow Scheduled Bundle [67] mechanism provides an elegant mechanism for mitigating failures in timed updates; if the controller detects a switch failure before an update is scheduled to take place, it can send a cancellation message to all the switches that take part in the scheduled update, thus guaranteeing an all-or-none behavior. Security considerations. Security threats in SDNs and possible mitigations have been discussed in previous work (e.g., [98, 99, 100]). In the context of the current paper, other security threats may arise from the use of timed updates; A man-in-the-middle attacker can selectively delay control plane messages between the controller and switches, thereby preventing timed updates from being performed at their scheduled time. This attack can be mitigated by using an explicit acknowledgment mechanism, as discussed in Section 3.5.2. Since timed updates rely on a time synchronization protocol, an attack on the time protocol can compromise the timed update mechanism. Such attacks can be mitigated by securing the time protocol [101].

3.10

Conclusion

Accurate time synchronization has become a common feature in commodity switches and routers. We have shown that it can be used to implement consistent updates in a way that reduces the update duration and the expensive overhead of maintaining duplicate configurations. Moreover, we have shown that accurate time can be used to tune the fine tradeoff between consistency

3. Timed Consistent Network Updates in SDN

84

and scalability during network updates. Our experimental evaluation demonstrates that timed updates allow scalability that would not be possible with conventional update methods.

Acknowledgments This work was supported in part by the ISF grant 1520/11. We gratefully acknowledge the DeterLab project [25] for the opportunity to perform our experiments on the DeterLab testbed.

Chapter 4 T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges This chapter is an extended version of the paper: [9] T. Mizrahi, O. Rottenstreich and Y. Moses, “TimeFlip: Scheduling network updates with timestamp-based TCAM ranges,” in IEEE INFOCOM, 2015.

4.1

Abstract

Network configuration and policy updates occur frequently, and must be performed in a way that minimizes transient effects caused by intermediate states of the network. It has been shown that accurate time can be used for coordinating network-wide updates, thereby reducing temporary inconsistencies. However, this approach presents a great challenge; even if network devices have perfectly synchronized clocks, how can we guarantee that updates are performed at the exact time for which they were scheduled? In this paper we present a practical method for implementing accurate time-based updates, using T IME F LIPs. A T IME F LIP is a time-based update that is implemented using a timestamp field in a Ternary Content Addressable Memory (TCAM) entry. T IME F LIPs can be used to implement Atomic Bundle updates, and to coordinate network updates with high accuracy. We analyze the amount of TCAM resources required to encode a T IME F LIP, and show that if there is enough flexibility in determining the scheduled time, a T IME F LIP can be encoded by a single

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

86

TCAM entry, using a single bit to represent the timestamp, while allowing a very high degree of accuracy.

4.2 4.2.1

Introduction Background

Network updates are a routine necessity; policy changes or traffic-engineered route changes may occur frequently, and often require network devices to be reconfigured. This challenge is especially critical in Software Defined Networks (SDN), where the control plane is managed by a logically centralized controller, and configuration updates occur frequently. Such configuration updates can involve multiple network devices, potentially resulting in temporary anomalies such as forwarding loops or packet loss. Network devices such as routers and switches use TCAMs for various purposes, e.g., packet classification, Access Control Lists (ACLs), and forwarding decisions. TCAMs are an essential building block in network devices. A typical example for the importance of TCAMs is OpenFlow [41, 18]. An OpenFlow switch performs its functionality using one or more flow tables, most commonly implemented by TCAMs (see, e.g., [102, 103]). The order of the entries in a TCAM determines their priority. Hence, installing a new TCAM entry often involves rearranging existing entries, yielding high overhead for each TCAM update. It has been shown [30] that the latency of a TCAM rule installation may vary from a few milliseconds to a few seconds. A recently introduced approach [13, 73] proposes to use accurate time and local clocks as a means to coordinate network updates. By using synchronized clocks, configuration changes can be scheduled in a way that guarantees a coordinated network-wide update, thereby reducing transient anomalies. One of the main challenges in this approach is to guarantee that scheduled updates are performed accurately according to the desired schedule. Even if the clocks in the network are perfectly synchronized, performing configuration changes requires a potentially complex procedure that may be completed at an uncertain time.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

4.2.2

87

Introducing T IME F LIPs

In this paper we present a method that uses T IME F LIPs to perform accurate time-based network updates. We define a T IME F LIP to be a scheduled update that is implemented using TCAM ranges to represent the scheduled time of operation. We analyze TCAM lookups (Fig. 4.1) that take place in network devices, such as switches and routers. We assume that the device maintains a local clock, and that a timestamp T recording the local arrival time is associated with every packet that is received by the device. Typically, TCAM search keys consist of fields from the packet header, as well as some additional metadata. In our setting, the metadata includes a timestamp T . Hence, a TCAM entry can specify a range relative to the timestamp T , as a way of implementing time-based decisions. The timestamp T is not integrated into the packet, as it is only used internally in the device, and thus does not compromise the traffic bandwidth of the network device.

Network Device

TCAM search key

su, …, s1

Network Device

T action

TCAM

s , …, s1 T T0 search u action key

Figure 4.1: TCAM lookup: conventional vs. T IME F LIP. T IME F LIP uses a timestamp field, representing the time range T ≥ T0 . Using a simple microbenchmark on a commercial switch, we show that T IME F LIPs can be performed by existing network devices, and analyze the achievable scheduling accuracy of T IME F LIPs. Using the Precision Time Protocol (PTP), based on the IEEE 1588 standard [21], network devices can typically synchronize with an accuracy on the order of 1 µsec [44, 15, 45].1 We show that the accuracy at which a T IME F LIP is executed compared to its scheduled time is 1 Accurate

clock synchronization has become a common and affordable feature in network devices; typical off-

the-shelf switch silicons, including low-end devices, have native support for PTP.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

88

two orders of magnitude more accurate, and hence that network-wide updates can be timed with a 1 µsec accuracy using PTP. T IME F LIPs enable two important scenarios: (i) Atomic Bundle. It is sometimes desirable to reconfigure a network device by applying a set of configuration changes as a bundle, i.e., every packet should be processed either before any of the modifications have been applied, or after all have been applied. The Atomic Bundle feature in OpenFlow [18] defines such functionality; the OpenFlow 1.4 specification suggests that Atomic Bundles can be implemented either by temporarily queuing packets during the update, or by using double buffering techniques. Both approaches may incur significant cost in terms of resources. T IME F LIPs allow a clean and natural way to implement Atomic Bundles; the set of configuration changes can be enabled at all times T ≥ T0 for some chosen time T0 , and the timestamp T defines when the bundle commands atomically come into effect. (ii) Network-wide coordinated updates. If network devices use synchronized clocks, then T IME F LIP can be used for updating different devices at the same time,2 or for defining a set of scheduled updates coordinated in a specific order [13]. T IME F LIPs require every TCAM entry to include a timestamp field. We show that this perentry overhead is relatively small. Moreover, since TCAMs have fixed entry sizes, it is often the case that a portion of the TCAM entry is unused, and thus can be utilized for the timestamp field. For example, in many cases TCAM entries are used to store the IPv4 5-tuple, requiring 104 bits, while the smallest TCAM entry that can accommodate the 5-tuple is typically larger, e.g., 144 bits [104] or 160 bits [105], leaving a large number of unused bits that can be used for the timestamp field. As TCAM resources are scarce and costly, we aim to represent the timestamp field by as few bits as possible, and each T IME F LIP by as few TCAM entries as possible. Optimal representation of TCAM ranges is a problem that has been widely studied in the literature (e.g., [106, 107]). The problem we address has two unique properties that, to the best of our knowledge, have not been previously analyzed: 2 Subject

to the accuracy of the clock synchronization mechanism.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

89

scheduled time time Tmin

T0

Tmax

Figure 4.2: Scheduling tolerance: T0 ∈ [Tmin , Tmax ]. • Scheduling tolerance. State-of-the-art TCAM range analysis studies the number of TCAM entries required to represent a given range of values. In the context of this paper, the range values can be chosen in a way that minimizes the number of TCAM entries. If the time T0 for which a time-based network update is scheduled can be selected within a scheduling tolerance (Fig. 4.2), given by a range of time values [Tmin , Tmax ], then the number of entries required to represent the range can be reduced. Notably, the scheduling tolerance does not compromise the accuracy of the T IME F LIP. It only presents some flexibility in choosing T0 ; we assume that an SDN controller may choose any T0 within the given range. Once T0 is chosen, the T IME F LIP should occur accurately at T0 . • Periodic ranges. If some of the most significant bits of the timestamp value are known to be constant during the T IME F LIP, the network device can simply ignore these bits, by placing ‘don’t care’ values in the respective bits in the TCAM. This property is unique to time ranges, and is not applicable to previously analyzed TCAM field ranges, such as TCP port ranges. For example, if the portion of the timestamp that represents the date is known to be constant during a T IME F LIP, it can be ignored. We refer to time ranges that ignore some of the most significant bits as periodic ranges, and show that the use of periodic ranges allows to represent the time ranges by fewer TCAM entries.

4.2.3

Contributions

The main contributions of this paper are: • We introduce T IME F LIPs and show how to accurately perform coordinated network updates and Atomic Bundle updates using them.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

90

• Our analysis provides a tight upper bound on the number of TCAM entries required for representing a T IME F LIP. We show that by correctly choosing the update time, T0 , the number of TCAM entries used for representing the timestamp range can be significantly reduced. • We present an optimal scheduling algorithm; no other scheduling algorithm can produce a timestamp range that requires fewer TCAM entries. • We analyze the number of bits required for representing the timestamp field in TCAM entries, and show that it is a function of the scheduling tolerance. • We show that using periodic ranges, the timestamp field can be represented by a single bit in the TCAM entry, and every T IME F LIP requires a single TCAM entry, provided that the scheduling tolerance is sufficiently relaxed. • We use a microbenchmark to demonstrate that our approach can be effectively used to schedule accurate time-based updates with existing commercial network devices.

4.2.4

Related Work

Consistent network updates have been widely analyzed in the literature. A common approach to avoiding inconsistencies during topology updates in routing protocols is to use a sequence of configuration commands [28], whereby the order of execution guarantees that no anomalies are caused in intermediate states of the procedure. Another recently introduced approach for consistent updates [22] uses configuration version tags to guarantee consistency. Dynamic traffic engineering [33, 30] has been shown to require frequent topology updates that must be applied carefully to optimize the network utilization. In [22] the authors argued that a simultaneous network update does not guarantee consistency, since packets that are en-route during the update may be processed by a mixture of configurations. Indeed, the use of accurate time by itself does not guarantee consistency, but time can be used to implement update scenarios (e.g., [108]) that are not possible with the consistent updates of [22]. Specifically, accurate time can also be used for implementing timed update procedures

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

91

that do guarantee consistent updates, as discussed in [8]. The current paper presents T IME F LIP, a method for performing accurately timed updates. Time and synchronized clocks are used in various distributed applications, from mobile backhaul networks [15] to distributed databases [88]. OpenFlow [18] uses timeouts for expiring old forwarding rules. The controller can define a timeout for a flow rule, causing it to be removed when the timeout expires. However, timeouts are defined with a coarse granularity of 1 second, and thus do not allow delicate coordination. Moreover, since timeouts are by nature relative, they do not allow the accurate coordination that absolute time can provide. In [40] the authors observed that it would be interesting to explore using time synchronization to reconfigure routers at a specific time. The Interface to the Routing System (I2RS) working group of the Internet Engineering Task Force (IETF) has recognized the value of time-based state installations [56], but decided not to pursue this concept, as the ability to accurately perform timed installations was not considered viable [109]. Contrarily, we show that accurate time-based updates are in fact viable, even when the TCAM rule installation latency is long and non-deterministic; the current paper proposes a novel approach that enables accurate time-based updates in switches and routers, and demonstrates their applicability to existing network devices. The simplest way to encode a range in TCAMs is known as prefix encoding [106]. In this scheme the set of values that match a rule is presented as a union of prefix TCAM entries (with a sequence of don’t cares as a suffix of the entry), representing disjoint sets of values. For instance, if we denote by W the number of bits used for encoding the range, then for W = 4 the range [4, 14] can be encoded by the four TCAM entries (01**), (10**), (110*), (1110) that respectively represent the ranges [4, 7], [8, 11], [12, 13] and [14, 14], whose union is the requested range [4, 14]. With this encoding, any range defined on W bits can be encoded with at most 2W − 2 entries. Moreover, an extremal range of the form [T0 , 2W − 1] or [0, T0 ] can be represented using at most W entries. By using complementary ranges these two bounds were improved to   W and W 2+1 , respectively, in [110]. An alternative encoding named SRGE (Short Range Gray Encoding), that relies on the Gray-code representation of ranges, was shown to improve the

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

92

maximal expansion to 2W − 4 for general ranges [107]. While the above works focused on the encoding of a single range, a wide literature discusses efficient encodings of classifiers with an ordered list of range rules, [111, 112, 113, 114, 115, 116, 117, 118].

4.3

Understanding T IME F LIP via a Simple Example

4.3.1

Timestamp Format

In this example we use the 64-bit NTP timestamp format [119]. This time format represents the time elapsed since the base date, which is January 1st , 1900. The time format consists of two 32-bit fields: (i) Time.Sec: the integer part of the number of seconds since the base date, and (ii) Time.Frac: the fractional part of the number of seconds. We assume that the TCAM lookup key includes the 64-bit timestamp field, T , representing the time at which the packet was received by the switch. As we shall see in the example, only a small portion of this 64-bit field is used in practice. Recall (Sec. 4.2.4) that if T is represented by 64 bits, then every extremal range of the form T ≥ T0 requires at most 64 entries. For instance, if T0 = 1073741824 = 230 , then the extremal range T ≥ T0 can be represented by two TCAM entries, as depicted in Fig. 4.3a. 01 1

*…* *…*

Time.Sec

*…* *…*

Time.Frac

*…*

Time.Sec

1

*…*

Time.Frac

(a) The extremal range T ≥ 230 sec. Represented (b) The update of Fig. 4.4 can be implemented by a using the 31st and 32nd bits of Time.Sec. This

T IME F LIP using a single TCAM entry,

makes use of two TCAM entries.

and a single unmasked bit.

Figure 4.3: Time range examples.

4.3.2

A Path Reroute Scenario

Consider an SDN with five switches (Fig. 4.4). Two traffic flows, f1 and f2 are forwarded according to the ‘before’ configuration. Now, let us assume that the S1 → S2 path needs to be

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

93

shut down for maintenance. Flow f1 must be rerouted to the ‘after’ path, and thus f2 is also diverted to the ‘after’ path to avoid congestion on the S4 → S5 path. In order to reroute the two flows the SDN controller needs to update the configurations of switches S1 and S3 . Note that the maintenance task is not urgent, and thus the controller can tolerate a slight delay of the update; we assume a relaxed TOL = 1 sec. However, the two switches should be updated as close as possible to simultaneously, to avoid temporary congestion on S2 → S5 and S4 → S5 .

f1

S1

S2

S1 S5

S3

f2 before

S3

S4

f1

S2 S5

f2

S4 after

Figure 4.4: Flows need to convert from the ‘before’ configuration to the ‘after’. We assume that at time Tsend the controller is notified about the required maintenance task, and sends update messages to S1 and S3 . In this example Tsend = 1111111110.1234 sec (see Fig. 4.5a). We assume that the update messages are guaranteed to be delivered and installed by the switches no later than 0.1 sec after Tsend . Thus, the controller can schedule the update to take place at or after Tmin = 1111111110.2234. Since we assumed that TOL = 1 sec, the controller can tolerate an update at any time between Tmin and Tmax = 1111111111.2234. scheduled time

insertion scheduled time

cleanup

time Tsend

Tmin

T0

Tmax

time T0-

T0

T 0+

TOL

(a) Scheduling the update.

(b) Installation bounds.

Figure 4.5: Scheduling timelines. If the controller chooses the scheduled time to be the integer T0 = 1111111111, then the 32-bit Time.Frac can be assigned don’t care, and thus the range T ≥ T0 effectively uses only the

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

94

upper 32 bits, requiring up to 32 TCAM entries. Once a switch receives a message from the controller it can further reduce the time range representation; if it is guaranteed that the switch installs the new rule less than one second before T0 , and removes the timestamp constraint3 less than a second after T0 (see Fig. 4.5b), then assigning don’t care in the 31 most significant bits of the timestamp does not affect the range rule, requiring just a single TCAM entry (Fig. 4.3b).

4.3.3

The Intuition Behind the Example

This example combines two key techniques that are unique to T IME F LIP: (i) the controller chooses a time T0 within the scheduling tolerance that reduces the required number of entries, and (ii) the switch assigns don’t care to the most significant bits that are guaranteed to be constant during the procedure. These two techniques, the former implemented by the controller and the latter by the switch, allow the T IME F LIP to be implemented using a single TCAM entry, as shown in Fig. 4.3b. The scheduling tolerance captures the fact that there is no urgency in the required update, and thus the controller can tolerate some flexibility in the selection of T0 . However, once T0 is chosen, the update is performed accurately at time T0 ± δ, where δ is the scheduling error of the switch, typically around 1 µsec. Thus, despite the relaxed scheduling tolerance, the accuracy is not compromised; both switches perform the update within the time window 1111111111 ± δ. A periodic time range is a range that is represented using don’t care in some of the most significant bits of the timestamp field. In our example, assigning don’t care to the 31 upper bits defines a periodic range with a 2-second period, effectively reducing the number of entries required to represent the range by 31. This technique is possible in the current example since it is guaranteed that the value of the upper 31 bits does not change during the lifetime of the T IME F LIP. 3 After

time T0 the switch can assign don’t care in the timestamp field, making the TCAM rule permanent.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

4.4 4.4.1

95

Model and Notations TCAM Entries

A TCAM is an associative memory device that allows fast classification. It compares a search string against a table of stored entries, and returns the address of the matching data. Each address is associated with a specific action. Each TCAM bit can have one of three possible values, 0, 1, or ∗, with the latter representing the don’t care value. The order of entries in a TCAM determines their priority. A TCAM search returns the address of the first entry that matches the search key. Our analysis focuses on a TCAM lookup that is performed by a network device, or a device for short. We assume that the device has access to a clock, and that before a TCAM lookup the device produces a timestamp T , which is obtained by capturing the value of the clock at some instant before the TCAM lookup. For example (Fig. 4.1), the device can capture the timestamp T for each received packet immediately upon its arrival. The timestamp, together with the packet header, will serve as an input to the TCAM lookup. A TCAM entry is denoted by S → a, where S = (σU , . . . , σ1 ) ∈ {0, 1, ∗}U . The number of bits in a TCAM entry is denoted by U, where 0, 1 are bit values and ∗ stands for don’t care. We denote a sequence of m don’t care bits by (∗m ). A bit that is assigned the don’t care value is said to be masked. The set of possible actions is denoted by A , where an individual action is denoted by a ∈ A . Specifically, throughout our analysis S will have the form (su , . . . , s1 ,tW , . . . ,t1 ) such that u +W = U, and tW , . . . ,t1 represent the bits corresponding to the timestamp T . +

We denote the m most significant bits of the timestamp T by T m , and the k least significant -

bits of T by T k . We define a time-based TCAM entry to be an entry in which at least one of the bits tW , . . . ,t1 has a value in {0, 1}, whereas in a time-oblivious entry all the bits tW , . . . ,t1 are ∗. A time range is defined to be an interval [T1 , T2 ], where T1 and T2 are W -bit integers. A time range rule is denoted by (su , . . . , s1 , [T1 , T2 ]), or equivalently, (su , . . . , s1 , [T1 ≤ T ≤ T2 ]). Such a rule can be represented by one or more time-based TCAM entries. The rule expansion of a

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

96

range [T1 , T2 ] is the minimal number of entries that can be used for representing the range. In the context of this paper we focus on prefix-based expansions [106], in which only prefix entries are used. Further discussion about the use of prefix encoding and other encoding schemes is presented in Sec. 4.8.5. An extremal range is a range that has one of two possible forms, either a right extremal range, which has the form [T1 , 2W − 1], also denoted by T ≥ T1 , or a left extremal range, [0, T2 ]. We denote by r(T1 ) the prefix-based expansion of a right range, [T1 , 2W − 1], and by `(T2 ) the expansion of the left range [0, T2 − 1]. Note that `(0) is undefined.

4.4.2

T IME F LIP: Theory of Operation

Consider a coordinated network update that is due to take place at time T0 and requires a TCAM update. The na¨ıve approach to update the TCAM is to schedule the device’s management software4 to perform the required modification as close as possible to T0 . This approach allows a limited degree of accuracy, which depends on the operating system’s ability to perform real-time operations, and on the load caused by other tasks that run on the CPU. In our approach the TCAM management software installs the time-based TCAM entries ahead of the activation time, T0 , allowing the update to be applied precisely at T0 . After time T0 the management software performs cleanup operations, e.g., removing rules that apply only to times T < T0 . We address four main classes of T IME F LIPs (see Fig. 4.6): (i) Installation. A new TCAM rule S → a is installed, effective from time T0 . A timed installation is a rule of the form (su , . . . , s1 , [T ≥ T0 ]) → a. (ii) Removal. An existing TCAM rule S → a is scheduled to be deactivated at time T0 , using a rule of the form (su , . . . , s1 , [T < T0 ]) → a. (iii) Rule update. An existing rule S → a is modified to S0 → a0 at T0 . A rule update can simply be represented as a pair of rules, one for removal and one for installation. (iv) Action update. The action of an existing TCAM rule is modified from a to a0 at time T0 . 4A

network device typically runs a software layer that performs various tasks, including TCAM management.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

TCAM

TCAM

su, …, s1, **

a

su, ...,s1, [T T0]

TCAM

a

su, …,s1, [T T0]

(i)

(ii)

(iii)

TCAM

TCAM

TCAM

su … s1, [T T0]

a

s’u … s’1, [T T0]

a’

(iv)

su, …, s1, [T T0] su, …, s1, ** (v)

97

a’ a

su, …, s1, [T T0] su, …, s1, **

a

a a’

(vi)

Figure 4.6: A timed TCAM Update. Every line in the figure is a time range rule, represented by one or more TCAM entries. (i) Time-oblivious entry. (ii) Installation. (iii) Removal. (iv) Rule update. (v) Action update. (vi) Action update using a complementary timestamp range. Hence, prior to T0 the management software installs a rule of the form (su , . . . , s1 , [T ≥ T0 ]) → a0 that precedes the existing S → a. The first-match behavior of TCAMs implies that if T ≥ T0 , the search matches the newly installed rule, and a0 is performed, whereas at any time before T0 the S → a rule prevails. After time T0 the TCAM management software removes the excess rules: the rule S → a is deleted, and (su , . . . , s1 , [T ≥ T0 ]) → a0 is replaced by (su , . . . , s1 , ∗, . . . , ∗) → a0 , requiring a single entry. As shown in Fig. 4.6(vi), an alternative representation of the action update uses a rule that maps T < T0 to a, and S to a0 . Complementary encoding [110, 120], also known as negative encoding, allows in some cases to represent the rule more efficiently, as further discussed in Sec. 4.6. T IME F LIP can be used in any of these four scenarios. We believe that the action update is the most interesting of the four, since a scenario that requires changing the behavior of an existing flow under traffic is more likely to require a delicate update procedure, and hence is more likely to require T IME F LIPs. Related literature [22, 30] also focuses on action update scenarios rather

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

98

than installation and removal scenarios. For the sake of clarity, we start by discussing the simpler case of timed installations (Sec. 4.5), and then extend these results to action updates (Sec. 4.6).

4.4.3

Timed Installation: Formal Definition

Let R be a time range. Given a time-oblivious TCAM entry S → a with S = (su , . . . , s1 , ∗, . . . , ∗), we define a timed installation of S → a over R to be a TCAM rule SR → a, such that SR := (su , . . . , s1 , R). Hence, a is activated during the time range R. We define the expansion of a timed installation over a time range R to be the expansion of the range R. Since R is a time range, SR is represented by one or more entries in the TCAM. We note that even if more than one entry is used to represent SR , the excess entries are required only for a brief period of time; we assume that shortly after SR is activated the TCAM management software performs a cleanup, leaving only a single entry, representing S → a.

4.5 4.5.1

Optimal Time-based Rule Installation Optimal Scheduling

It has been shown [106] that an extremal range of the form [T0 , 2W − 1] can be represented using at most W entries. However, we observe that a careful selection of the value T0 can significantly reduce the number of entries required for representing this update. The update time T0 may be tuned to an optimal value in the following scenarios: (i) In Atomic Bundle updates, a network device is required to perform a set of changes atomically, without strict timing constraints, and hence it is flexible to select the time T0 at which these changes are performed. (ii) In network-wide coordinated updates, optimal scheduling can be enforced by a central entity that determines the update time, e.g., a Network Management Station (NMS) or an SDN controller. The central entity’s goal is to find a value of T0 (within a set of allowed values) that will minimize the timestamp range expansion; the underlying assumption is that all network

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

99

devices use the same format to represent the timestamp, and the same TCAM range encoding scheme. As depicted in Fig. 4.2, we assume that the value of T0 is determined by a scheduling algorithm, subject to the constraint Tmin ≤ T0 ≤ Tmax . We define the scheduling tolerance, denoted by TOL, to be Tmax − Tmin + 1. This is the number of allowable values for T0 . We wish to study how T0 should be selected. As a first step we learn the expansion of a specific extremal range [T0 , 2W − 1]. Property 4.1, based on [106], shows that the expansion of this range is given by the number of ‘1’-s in a binary representation of 2W − T0 . Property 4.1. The expansion r(T0 ) of a right range [T0 , 2W − 1] is given by the number of ‘1’-s in a binary representation of the number of values in the range, 2W − 1 − T0 + 1 = 2W − T0 . Proof Outline. The property follows from the definition of the prefix encoding as defined in [106]. The encoding is composed of entries that consider disjoint sets of inputs. The cardinality of each set is a power of 2, and distinct sets have different cardinalities. The sum of the cardinalities equals the number of values in the range. Example 4.2. For W = 4, the range [T0 , 2W − 1] = [9, 15] includes 15 − 9 + 1 = 7 values. The binary representation of 7 is 0111, which has three ‘1’-s and accordingly the range can be encoded using the three entries (1001), (101∗), (11 ∗ ∗). Likewise, the range [11, 15] has 15 − 11 + 1 = 5 = 0101 values and can be encoded by the two entries (1011), (11 ∗ ∗). The range [15, 15] has a single value (1 = 0001) and can be encoded by the single entry (1111). By symmetry, we can show that the expansion `(T0 ) of the range [0, T0 − 1] is given by the ‘1’-s in a binary representation of the number of values in the range, T0 − 1 − 0 + 1 = T0 . The next theorem relates the expansion of a right range [T0 , 2W −1] and a left range [0, T0 −1]. Theorem 4.3. For W ≥ 1 and T0 ∈ [1, 2W − 1] the expansion r(T0 ) of the right range [T0 , 2W − 1] and the expansion `(T0 ) of the left range [0, T0 − 1] satisfy r(T0 ) + `(T0 ) ≤ W + 1. Proof. Let Br = 2W − 1 − T0 + 1 = 2W − T0 and B` = T0 − 1 + 1 = T0 be the number of values in the right and the left ranges, respectively. Clearly, Br + B` = 2W . Let (br,W , · · · , br,1 ) and

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

100

(b`,W , · · · , b`,1 ) be the binary representations of Br , B` . Let nr = Σi∈[1,W ] br,i and n` = Σi∈[1,W ] b`,i . By Property 4.1, we have that r(T0 ) = nr and `(T0 ) = n` . We show the result by induction. First, if W = 1, we have that T0 = 1. The right range [1, 1] and the left range [0, 0] can be both encoded in a single entry and r(T0 ) + `(T0 ) = 1 + 1 = 2 ≤ W + 1. For a general W , we distinguish between the two following sub-cases. If br,1 = 1, i.e. Br is odd (and accordingly b`,1 = 1, i.e. B` is odd as well since Br +B` = 2W ), we have that (2W −1)−Br = B` −1. Since (2W −1) can be represented by a binary vector with W ‘1’-s, the number of ‘1’-s in (2W − 1) − Br is W − nr . By the last equality this equals n` − 1, the number of ‘1’-s in B` − 1. We then have that W − nr = n` − 1 and r(T0 ) + `(T0 ) = nr + n` = W + 1. In the second sub-case br,1 = b`,1 = 0 and Br , B` are even. Then, r(T0 ) = nr = Σi∈[1,W ] br,i = Σi∈[2,W ] br,i and `(T0 ) = n` = Σi∈[1,W ] b`,i = Σi∈[2,W ] b`,i . We now consider T00 = 0.5 · T0 with W 0 = W − 1 and examine the expansions r(T00 ) and `(T00 ) within the 0

0

0

space [0, 2W − 1]. We have that 2W − T00 = 0.5 · (2W − T0 ) and the number of values in [T00 , 2W − 1] is represented by (br,W , · · · , br,2 ) and r(T00 ) = Σi∈[2,W ] br,i = r(T0 ). Likewise, since T00 = 0.5 · T0 the number of values in [0, T00 −1] is represented by b`,W , · · · , b`,2 and `(T00 ) = Σi∈[2,W ] b`,i = `(T0 ). Accordingly, r(T0 ) + `(T0 ) (in W ) equals the sum of the expansions r(T00 ) + `(T00 ) in W 0 = W − 1. By the induction hypothesis we have that r(T00 ) + `(T00 ) ≤ W 0 + 1 = W − 1 + 1 = W ≤ W + 1 and the result follows. We can now introduce the S CHEDULE algorithm (Fig. 4.7), which computes an optimal value, TS CH , for a given range [Tmin , Tmax ]. Throughout the paper we use the notation TS CH , defined by TS CH := S CHEDULE(Tmin , Tmax ,W ), and the range RS CH , defined by RS CH := [TS CH , 2W − 1]. Intuitively, S CHEDULE performs a binary search over the range [0, 2W − 1], and returns the first value that falls within [Tmin , Tmax ]. Notably, we shall see that due to the nature of the binary search, S CHEDULE returns the value TS CH with the fewest ‘1’-s in the binary representation of 2W − TS CH , and hence, by property 4.1 minimizes r(TS CH ). In terms of complexity, the number of iterations in S CHEDULE is bounded by W , as it is a binary search over a range of 2W values. The following theorem states that S CHEDULE is optimal, i.e., that no other scheduling algo-

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

101

S CHEDULE(Tmin , Tmax ,W ) 1 t0 ← 0; i ← 0 2 while ti ∈ / [Tmin , Tmax ] 3 i ← i+1 4 if ti−1 < Tmin 5 ti ← ti−1 + 2W −i 6 else 7 ti ← ti−1 − 2W −i 8 TS CH ← ti 9 return TS CH Figure 4.7: Optimal scheduling algorithm; no other scheduling algorithm produces an extremal range with a lower expansion.

rithm produces an extremal range with a lower expansion. Theorem 4.4. If TS CH = S CHEDULE(Tmin , Tmax ,W ), then r(TS CH ) ≤ r(T ) for all T ∈ [Tmin , Tmax ]. Proof. Clearly, if Tmin = 0, then TS CH = 0, and the range [TS CH , 2W − 1] can be represented by a single entry, (∗W ). For Tmin > 0, without loss of generality, TS CH is determined by S CHED ULE

after m iterations, i.e., i = m on line 8 of S CHEDULE. We prove the claim by induc-

tion on m ≥ 1. Denote 2W − TS CH by B. By property 4.1, r(TS CH ) is equal to the number of ‘1’-s in the representation of B. For m = 1 we have that TS CH = 2W −1 , and thus B = 2W −1 . Since the binary representation of B is (10 . . . 0), we have r(TS CH ) = 1, which is of course optimal. Now we assume the claim holds for every TS0 CH that is computed when S CHED ULE

returns after m iterations. Let TS CH be a value returned by S CHEDULE after m + 1 iter-

ations. We distinguish between two cases: (i) TS CH > 2W −1 : we now ignore the most significant bit of the timestamp field, and reexamine the algorithm’s outcome. The algorithm (W −1)-

S CHEDULE(Tmin

(W −1)-

, Tmax

-

,W − 1) returns TS CH (W −1) after m iterations, and thus by the in-

-

duction hypothesis TS CH (W −1) is optimal in [0, 2W −1 − 1]. Now assume by way of contradiction that there exists a time Tmin ≤ T 0 ≤ Tmax such that r(T 0 ) < r(TS CH ). Thus, the range [T 0 , 2W − 1]

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

102

can be represented by fewer entries than the expansion of [TS CH , 2W − 1], and by removing the -

-

most significant bit of the rule [T 0 , 2W − 1] we get that r(T 0(W −1) ) < r(TS CH (W −1) ), contradicting the induction hypothesis. (ii) TS CH < 2W −1 : similarly to the first case, by observing the -

range [0, 2W −1 − 1] we deduce that TS CH (W −1) is obtained after m iterations, and is thus optimal. Assume by way of contradiction that there is a T 0 ∈ [Tmin , Tmax ] such that r(T 0 ) < r(TS CH ). Denote 2W − T 0 by B0 . Note that [Tmin , Tmax ] ⊂ [0, 2W −1 − 1], since otherwise we would have 2W −1 ∈ [Tmin , Tmax ], and S CHEDULE would terminate after one iteration. Thus, T 0 < 2W −1 . It +

+

follows that B01 = B1 = 1. Since r(T 0 ) < r(TS CH ) in [0, 2W −1 − 1], by property 4.1 we have that the number of ‘1’-s in B0 is smaller than the number of ‘1’-s in B. We conclude that there -

-

-

-

are less ‘1’-s in B0(W −1) than in B(W −1) , yielding r(T 0(W −1) ) < r(TS CH (W −1) ), which is in contradiction to the induction hypothesis. An interesting property of S CHEDULE is presented in Lemma 4.5: the output of the algorithm, TS CH , has a long sequence of least significant ‘0’ bits. This property allows very efficient prefix encoding of extremal ranges of the form T ≥ TS CH . Lemma 4.5. The blog2 (TOL)c least significant bits of TS CH are all ‘0’. Proof. We denote blog2 (TOL)c by X. For TS CH = 0 we have W bits of ‘0’, and the claim is satisfied. For TS CH > 0, we prove this claim by induction on m, the number of iterations in S CHEDULE. For m = 1 we have TS CH = 2W −1 with W − 1 least significant bits of ‘0’, and since TOL < 2W , we have X ≤ W − 1, and thus the claim is satisfied. We assume that the claim holds for m − 1, and prove for m. We consider two distinct cases: (i) TS CH > 2W −1 : in this case [Tmin , Tmax ] ⊂ [2W −1 , 2W − 1]. By considering the (W − 1)-bit shifted range [2W −1 , 2W − 1], -

S CHEDULE produces TS CH (W −1) after m − 1 iterations, and thus by the induction hypothesis the X least significant bits are ‘0’, which is true also for TS CH . (ii) TS CH > 2W −1 : this case is similar to (i), except that [Tmin , Tmax ] ⊂ [0, 2W −1 − 1], and thus we can run S CHEDULE on [0, 2W −1 − 1], again concluding from the induction hypothesis that the X least significant bits are ‘0’. The following lemma presents an upper bound on the expansion of the range T ≥ TS CH , as a function of the scheduling tolerance. This captures a tradeoff between the scheduling tolerance

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

103

and the time-based range expansion; it is possible to reduce the expansion of a timed installation by increasing the scheduling tolerance.

Lemma 4.6. If TOL < 2W then r(TS CH ) ≤ W − blog2 (TOL)c.

Proof. Denote 2W − TS CH by B, and blog2 (TOL)c by X. By Lemma 4.5, the X least significant bits of TS CH are ‘0’, and hence the X least significant bits of B are ‘0’. Thus, the number of ‘1’-s in the representation of B does not exceed W − X, and by Property 4.1 we have r(TS CH ) ≤ W − X.

4.5.2

Average Expansion

In this section we study the influence of the scheduling tolerance on the average expansion. We concentrate on the average expansion of the prefix-based encoding of ranges of the form [T0 , 2W − 1] for a given W . Intuitively, for a larger scheduling tolerance TOL within which T0 should be selected, the flexibility is larger and the expansion for the best of the options is expected to be smaller. Our model is the following. For a given W and TOL ∈ [1, 2W ], we examine the possible [Tmin , Tmax ] values that enable TOL possible options. These are the 2W −TOL+1 values [0, TOL− 1], [1, TOL], [2, TOL + 1], · · · , [2W − TOL − 1, 2W − 2], [2W − TOL, 2W − 1]. As we described, for each [Tmin , Tmax ] we use the S CHEDULE algorithm to calculate a T0 that has an expansion of minT0 ∈[Tmin ,Tmax ] r(T0 ). Based on Property 4.1, for [Tmin , Tmax ], TS CH is the value that minimizes the number of ‘1’-s in a binary representation of 2W − T0 ∈ [2W − Tmax , 2W − Tmin ]. We calculate the average value of r(TS CH ) among the 2W − TOL + 1 possible values of [Tmin , Tmax ]. We denote this value by ρ(W, TOL).

Theorem 4.7. For W ≥ 1 and TOL ∈ [1, 2W ], let a = dlog2 (TOL)e. The average value of the expansion of the easiest-to-encode range according to a set [Tmin , Tmax ] with TOL possible values

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

104

is given by  1 ρ(W, TOL) = W · 1 − 2W −a 2 − TOL + 1 + 2W −a−1 · (W − a + 2) · (2a − TOL + 1)  −a−1 i−1 + ΣW 2 · (i + 2) · (TOL − 1) . i=0 Proof. Denote 2W − Tmin and 2W − Tmax by Bmin and Bmax , respectively. Likewise, let B0 represent the value of 2W − T0 for some T0 . Intuitively, a value of [Tmin , Tmax ] defines [Bmax , Bmin ] of which B0 can be selected. Let d ∈ [0,W ] be the maximal number of most significant bits such that +

+

Bdmin = Bdmax . We distinguish between several cases according to the value of d. We first assume that Tmin ≥ 1 (and accordingly Bmin ≤ 2W − 1). There are TOL − 1 possible values of [Bmax , Bmin ] with d = 0. These are the values that include 2W −1 − 1 and 2W −1 . The TOL − 1 options are [2W −1 − TOL + 1, 2W −1 ], · · · , [2W −1 − 1, 2W −1 + TOL − 2]. In each of these options we can select B0 = 2W −1 , T0 = 2W − B0 = 2W −1 and encode the range [T0 , 2W − 1] = [2W −1 , 2W − 1] by a single entry. There are 2 · (TOL − 1) values of [Bmax , Bmin ] for which d = 1. These are [x + 2W −2 − TOL+1, x+2W −2 ], · · · , [x+2W −1 −1, x+2W −1 +TOL−2] for x ∈ {0, 2W −1 }. The (TOL−1) options with x = 0 can be encoded with a single entry while the (TOL−1) options with x = 1 require 2 entries. They require a total number of 1 · (TOL − 1) + 2 · (TOL − 1) = 2d · (TOL − 1) · (1 + d/2) entries. More generally, for i ∈ [0,W − a − 1] there are 2i · (TOL − 1) values of [Bmax , Bmin ] with d = i that require a total number of 2i · (TOL − 1) · (1 + i/2) = 2i−1 · (TOL − 1) · (2 + i) entries. For d = W − a, there are 2W −a · (2a − TOL + 1) values of [Bmax , Bmin ]. In 2W −a of them, B0 can be selected such that it has a last bits of 0. Thus these 2W −a values of [Bmax , Bmin ] require a total number of 2W −a · ((W − a)/2) entries. Similarly to the previous detailed cases, the other 2W −a · (2a − TOL) are encoded each with ((W − a)/2 + 1) entries on average, requiring a total number of 2W −a · (2a − TOL) · ((W − a)/2 + 1) = 2W −a−1 · (2a − TOL) · (W − a + 2) entries. By summarizing these requirements together with the single entry required for the case of Tmin = 0, we deduce the suggested average for the 2W − TOL + 1 possible values of [Tmin , Tmax ].

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges Example 4.8. Again, let W = 4.

105

For TOL = 2, we consider the 2W − TOL + 1 =

15 possible ranges [Tmin , Tmax ]: [0, 1], [1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8], [8, 9], [9, 10], [10, 11], [11, 12], [12, 13], [13, 14], [14, 15]. For TOL = 2, we have a = dlog2 (TOL)e = 1. By Theorem 4.7, we have that the average expansion here is ρ(W, TOL) = 2) · (2 − 2 + 1) + Σ2i=0 2i−1 · (i + 2) · 1) =

25 15

1 15

· (1 − 23 + 22 · (3 +

= 53 . Indeed, for the first of these 15 options,

we can set T0 = 0 and encode the range [0,15] by the single entry (****). For [1, 2], [2, 3] we set T0 = 2 and encode [2, 15] by the three entries (001∗), (01 ∗ ∗), (1 ∗ ∗∗). Likewise, for [3, 4], [4, 5], [5, 6], [6, 7] two entries are required. For [7, 8], [8, 9] we set T0 = 8 and encode [8, 15] in a single range. For [9, 10], [10, 11] two entries are required (T0 = 10) while for [11, 12], [12, 13], [13, 14], [14, 15] a single entry is required (for T0 = 12 or T0 = 14). This yields 25 1 ·(1+2·3+4·2+2·1+2·2+4·1) = 15 = 35 = ρ(W, TOL), as suggested an average number of 15

by Theorem 4.7.

4.5.3

Installation Bounds and Periodic Ranges

As noted in Sec. 4.4.2, setting up a timed installation rule requires a two-step procedure (see Fig. 4.8); in the insertion step, the TCAM management software installs the timestamp-dependent TCAM rule representing the configuration that should take place starting at time T0 . In the cleanup step, the management software removes the timestamp dependency of the rules representing the new configuration, leaving a single time-oblivious TCAM entry. In this section we assume that the insertion and cleanup operations are performed within well-known installation bounds,5 denoted by ∆, i.e., it is guaranteed that the time-based rule is inserted no sooner than ∆ before T0 , and is cleaned up by time T0 + ∆ − 1. We shall show that guaranteed installation bounds significantly reduce the number of TCAM entries required for a T IME F LIP; rather than defining a range T ≥ T0 , one can define the range [T0 , T1 ] for some T1 ≥ T0 + ∆ − 1, using fewer TCAM entries with effectively the same impact. ∆

Two ranges, R1 and R2 , are said to be ∆-similar, denoted R1 ∼R2 , if there exists a value T0 5 In

practice, the installation bounds may be high in some network devices. We further discuss how this affects

T IME F LIP in Sec. 4.8.3

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges insertion scheduled time

106

cleanup

time T0 -

T0 +

T0

T1

Figure 4.8: Installation bounds / Given an extremal range such that R1 ∩ R2 ⊇ [T0 , T0 + ∆ − 1] and (R1 ∪ R2 ) ∩ [T0 − ∆, T0 − 1] = 0. RT0 = [T0 , 2W − 1], since RT0 is only observed during the period [T0 − ∆, T0 + ∆ − 1], every range R that is ∆-similar to RT0 produces the same TCAM match results during this period. Hence, every timed installation over RT0 can be represented by an equivalent timed installation over R.

T0

(i)

T1 time

n 2V

(n+1) 2V

T0

(ii)

T1 time

n 2V

(n+1) 2V -

-

Figure 4.9: Periodic ranges: the 2V -periodic continuation of [T0 , T1 ]. (i) For T1V > T0V . (ii) For -

-

T1V < T0V . In B OUNDED R ANGE T1 = T0 + 2V −1 − 1. -

We define the 2V -periodic continuation of a time range [T0 , T1 ], denoted by [T0 , T1 ]V , to be the range defined by masking the W − V most significant bits of the timestamp, i.e., R pc := 2W −V S −1 n=0

-

-

-

-

([T0V , T1V ]+n·2V ). Moreover, if T1V < T0V , then R pc =

2W −V S −1 n=0

-

-

(([T0V , 2V −1]∪[0, T1V ])+

n · 2V ). The expansion of a periodic range R pc is the number of entries used for representing the range. Intuitively, periodic ranges (Fig. 4.9) allow efficient representation of T IME F LIPs. A 2V periodic range is encoded with don’t care in its W −V most significant bits, and thus the number

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

107

of bits required to represent such a range is V . We now introduce the B OUNDED R ANGE algorithm. Given a scheduling time, T0 , and an ∆

extremal range, RT0 = [T0 , 2W − 1], the algorithm computes a periodic range RBR ∼RT0 that, for a sufficiently small ∆, has a smaller expansion than RT0 . B OUNDED R ANGE(T0 , ∆,W ) 1 V ← dlog2 (2∆)e 2 return [T0 , T0 + 2V −1 − 1]V Figure 4.10: Determining a range with installation bounds ∆.

Lemma 4.9. If RBR = B OUNDED R ANGE(T0 , ∆,W ), then the expansion of every timed installation over RBR is bounded by dlog2 (2∆)e. -

Proof. We denote dlog2 (2∆)e by V . We analyze the periodic range [T0 , T0 + 2V −1 − 1]V , focusing on a single range of 2V values, and distinguish between two cases: (W −V )+

(i) T0

+

= (T0 + 2V −1 − 1)(W −V ) : in this case (depicted in Fig. 4.9(i)) we have a shifted -

-

V -bit range, [T0V , (T0 + 2V −1 − 1)V ]. We shall show that this range has a worst-case expansion -

-

of V . We analyze the two sub-ranges [T0V , 2V −1 − 1] and [2V −1 , (T0 + 2V −1 − 1)V ]. Both subranges are in fact (V − 1)-bit shifted extremal ranges. The expansions of these two sub-ranges (V −1)-

are r(T0

(V −1)-

) and `(T0

), respectively. By Lemma 4.3, over a (V − 1)-bit field we have

r(T ) + `(T ) ≤ V for all T . It follows that the expansion of the two sub-ranges is V . (W −V )+

(ii) T0

+

6= (T0 + 2V −1 − 1)(W −V ) : in this case (depicted in Fig. 4.9(ii)), by definition of -

-

a 2V -periodic continuation we have a shifted V -bit range [T0V , 2V − 1] ∪ [0, (T0 + 2V −1 − 1)V ]. As in case (i), we have two (V − 1)-bit complementary extremal ranges, and thus the worst-case expansion of the two sub-ranges is V . The following theorem states that when the scheduling tolerance, TOL, is sufficiently large, a timed installation can be represented by a single TCAM entry.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

108 ∆

Theorem 4.10. If TOL ≥ 2dlog2 (∆)e , then there exists a range R such that R∼RS CH , and the expansion of every timed installation over R is 1. Proof. Since TOL ≥ 2V −1 , it follows that blog2 (TOL)c ≥ 2V −1 . There exists an integer n such that TS CH = n · X, since by Lemma 4.5 the X least significant bits of TS CH are ‘0’. We define ∆

RBR := B OUNDED R ANGE(TS CH , ∆,W ). By definition of B OUNDED R ANGE, RBR ∼RS CH . Thus, at least one of the following must hold: • There exists an integer n such that TS CH = n · 2V . Using B OUNDED R ANGE we have RBR = -

[0, 2V −1 ]V , which can be encoded by a single entry where the timestamp field has the value ∗W −V , 0, ∗V −1 , i.e., the only unmasked bit is tV = 0. -

• There exists an integer n such that TS CH = (2n + 1) · 2V −1 . Thus, TS CHV = 2V −1 . By using -

B OUNDED R ANGE we have RBR = [2V −1 , 2V − 1]V , which can be encoded by the single entry ∗W −V , 0, ∗V −1 , i.e., the only unmasked bit is tV = 1.

The following theorem generalizes the observations about the scheduling tolerance and installation bounds, and provides the worst-case expansion as a function of TOL and ∆. Theorem 4.11. If RBR = B OUNDED R ANGE(TS CH , ∆,W ), and TOL < 2dlog2 (∆)e , then the expansion of every timed installation over RBR is bounded by dlog2 (2∆)e − blog2 (TOL)c. Proof Outline. The proof is based on Lemma 4.6 and Lemma 4.9.

4.5.4

Timestamp Field Size in Bits

In the analysis so far we have been assuming that the timestamp field is a W -bit field. This implies that in every TCAM that requires timed installations, W bits of every entry would be “wasted” on the timestamp field. In this section we analyze how the timestamp field can be significantly reduced, depending on the scheduling tolerance and installation bounds. We show

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

109

that the size of the timestamp field (Fig. 4.11) is affected by two factors of the system: (i) If it is well-known that every T IME F LIP is scheduled with a scheduling tolerance TOL, then the X = blog2 (TOL)c least significant bits of the timestamp field are always don’t care, and thus can be omitted from the timestamp field. (ii) If there are guaranteed installation bounds, ∆, the use of a 2V -periodic range, for V = dlog2 (2∆)e, allows the W −V most significant bits to be omitted. X bits

W-V bits function of

timestamp field

* * * *

function of TOL

* * * * * * * W bits

Figure 4.11: Example of 1-bit timestamp, per Theorem 4.13.

Lemma 4.12. If TOL < 2W , then the range RS CH can be represented by a timestamp field of W − blog2 (TOL)c bits. Proof. We denote blog2 (TOL)c by X. By Lemma 4.5 the X least significant bits of TS CH are ‘0’. Thus, every prefix-based encoding of RS CH has don’t care on the X least significant bits. Hence, RS CH can be represented by the W − X most significant bits. ∆

Theorem 4.13. If TOL ≥ 2dlog2 (∆)e , then there exists a range R such that R∼RS CH , and R can be represented by a timestamp field of a single bit. ∆

Proof. The proof is very similar to the proof of Theorem 4.10; the range RBR satisfies RBR ∼RS CH , and can be represented by a single rule, where the only unmasked bit is tV , for V = dlog2 (2∆)e.

Theorem 4.14. If TOL < 2W , the installation bounds are given by ∆, and TOL < 2dlog2 (∆)e , then ∆

there exists a range R such that R∼RS CH , and R can be represented by a timestamp field of dlog2 (2∆)e − blog2 (TOL)c bits. ∆

Proof. The range RBR satisfies RBR ∼RS CH . By definition of B OUNDED R ANGE, the most significant W −V bits are masked, and can thus be omitted. By Lemma 4.12 the X least significant bits

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

110

are masked and can thus be omitted. Hence, we are left with V − X = dlog2 (2∆)e − blog2 (TOL)c bits. It is well-known [106] that a W -bit extremal range has a worst-case expansion W , i.e., there is a tight coupling between the expansion of an extremal range and the number of bits used to represent it. Thus, it is not surprising that our results show that this coupling applies to timebased ranges as well, as seen in Theorems 4.11 and 4.14. Specifically, in a system that uses a 1-bit timestamp, per Theorem 4.13, every T IME F LIP is represented by a single entry, as shown in Theorem 4.10.

4.6

Optimal Time-based Action Updates

In the previous section we analyzed timed installations (Fig. 4.6(ii)). In this section we briefly discuss these results in the context of timed action updates (Fig. 4.6(v)), and show that the number of entries required to represent a time-based action update is, in the worst case, roughly half of the number of entries required to represent a timed installation. Significantly, timed action updates can be represented either by positive encoding (Fig. 4.6(v)) or by negative encoding (Fig. 4.6(vi)). It was shown [110] that a W -bit extremal range can be represented by d W 2+1 e entries, by choosing the best of the positive or the negative encoding. Let R be a time range, and define Rc := [0, 2W − 1] \ R to be the complementary range of R. Given a time-oblivious TCAM entry S → a with S = (su , . . . , s1 , ∗, . . . , ∗), we define a timed action update of S over R as a pair of TCAM rules (SR → aR , S → a), such that SR := (su , . . . , s1 , R). Hence, aR is activated during the time range R. Note that the order of the rules is of importance, since a TCAM lookup can match S only if it does not match SR . Given a timed action update, (SR → aR , S → a), we define its negative encoding as (SRc → a, S → aR ), such that SRc := (su , . . . , s1 , Rc ). We define the expansion of a timed action update over a time range R, denoted by e(R), as the expansion of R. In this context e(R) is the minimum between the positive encoding of R and its negative encoding, given by Rc . Note that in both cases, positive and negative, the expansion

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

111

does not include the time-oblivious entry, S. The following theorem defines an upper bound on the expansion of a timed action update over a W -bit extremal range. Theorem 4.15. If e(RT0 ,W ) is the expansion of a timed action update over RT0 ,W = [T0 , 2W − 1], then e(RT0 ,W ) ≤ b W 2+1 c. Proof Outline. The proof is based on the d W 2+1 e result of [110], with the exception that, in contrast to [110], our definition of timed action updates excludes the entry that assigns don’t care to the timestamp field. Due to this minor difference, the expansion is b W 2+1 c rather than d W 2+1 e. The following lemma presents the worst-case expansion of using both positive and negative encoding, as a function of the scheduling tolerance. The result generalizes Theorem 4.15. Lemma

4.16. If

e(RS CH )

is

the

expansion

of

RS CH ,

and

TOL < 2W ,

then

e(RS CH ) ≤ b W −blog22(TOL)c+1 c. Proof Outline. The proof is similar to the proof of 4.6, but uses the result of Theorem 4.15 for the worst-case expansion when using both positive and negative encoding. In the previous section we presented the B OUNDED R ANGE algorithm (Fig. 4.10), and showed that the expansion of a timed installation using B OUNDED R ANGE is at most dlog2 (2∆)e. Indeed, B OUNDED R ANGE can be used for timed action updates as well, yielding the same expansion. However, we now present an algorithm that allows a lower expansion for timed action updates. The R EDUCED R ANGE algorithm computes a ∆-similar range for a given [T0 , 2W − 1]. As shown in Lemma 4.17, this algorithm has a worst-case expansion of b dlog2 (4∆)e+1 c. If ∆ is high 2 enough, the expansion of the range of R EDUCED R ANGE is roughly half of the expansion produced by B OUNDED R ANGE. However, R EDUCED R ANGE computes a range over a range of 4∆ values, and hence the number of bits required to represent the timestamp in R EDUCED R ANGE is one bit higher than required by B OUNDED R ANGE. Lemma 4.17. If RRE = R EDUCED R ANGE(T0 , ∆,W ), then e(RRE ) ≤ b dlog2 (4∆)e+1 c. 2

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

112

R EDUCED R ANGE(T0 , ∆,W ) 1 V 0 ← dlog2 (4∆)e 2 TR ← T0 + ∆ − 1 3 TL ← T0 − ∆ 4 5 6 7

(W −V 0 )+

(W −V 0 )+

(W −V 0 )+

= T0 and TL if TR 00 0V V V return [T0 , 2 − 1] else 00 0return [T0V , 2V −1 − 1]V

(W −V 0 )+

= T0

Figure 4.12: Algorithm for finding reduced range with installation bounds.

Proof. We denote dlog2 (4∆)e by V 0 . We distinguish between two cases: • The condition on line 4 of the algorithm is true (Fig. 4.13(a)), and R EDUCED R ANGE returns on line 5. In this case the timed action update is represented by the periodic range 0-

0-

0-

0

0

[T0V , 2V − 1]V , which is encoded by the V -bit range [T0V , 2V − 1]. By Theorem 4.15 0

the worst-case expansion in this case is b V 2+1 c. • R EDUCED R ANGE returns on line 7. We consider two distinct cases: (W −V 0 )+

(i) TR

(W −V 0 )+

= T0

0-

0

: in this case (Fig. 4.13(b-i)) the range of line 7, [T0V , 2V −1 − 0-

0-

0-

0

0

1]V , is encoded by the V 0 -bit range [T0V , 2V −1 − 1]. Since T0V ≤ 2V −1 − 1, it follows 0-

(V 0 −1)-

that T0V = T0

0-

0

. Thus, the range T0V ≤ 2V −1 − 1 can in fact be represented by the (V 0 −1)-

(V 0 − 1)-bit range, [T0

0

, 2V −1 − 1]. By Theorem 4.15 the expansion of this V 0 − 1-bit 0

0

extremal range is bounded by b (V −1)+1 c < b V 2+1 c. 2 (W −V 0 )+

(ii) TL

(W −V 0 )+

= T0

0-

0-

0-

0

: in this case (Fig. 4.13(b-ii)) the range of line 7, [T0V , 2V −1 − 0

0

1]V , is encoded by two sub-ranges, [T0V , 2V − 1] ∪ [0, 2V −1 − 1]. Note that the sub-range 0-

0

(W −V 0 )+

[T0V , 2V −1] includes less than ∆ values, since TR

(W −V 0 )+

= T0

0-

0

. Thus, [T0V , 2V −1]

is in fact a (V 0 − 2)-bit shifted extremal range, which by Theorem 4.15 has a worst case

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

TL

(a)

T0

113

TR time

n 2V’ TL

(b-i)

(n+1) 2V’ T0

TR

n 2V’

(n+1) 2V’ TL

T0

TR

(b-ii)

n 2V’

(n+1) 2V’

Figure 4.13: R EDUCED R ANGE: proof of Lemma 4.17. 0

0

expansion of b (V −2)+1 c. The second sub-range, [0, 2V −1 − 1], requires a single entry, and 2 0

0

thus we have a worst-case expansion of b (V −2)+1 c + 1 < b V 2+1 c. 2 In both cases the worst-case expansion is bounded by b dlog2 (4∆)e+1 c. 2

4.7

Experimental Evaluation

Our evaluation is composed of two parts: (i) A simulation-based analysis was used to evaluate the resources required for representing T IME F LIPs, and to verify our analytical results from the previous sections. (ii) A microbenchmark using a commercial switch was used to evaluate the accuracy of timed updates using our approach.

4.7.1

Simulation-based Evaluation

We implemented the S CHEDULE and B OUNDED R ANGE algorithms, and computed the respective range expansion and the required timestamp bit size in various cases. All of our simulations

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

114

were performed with W = 16. We evaluated the expansion of an extremal range as a function of the scheduling tolerance, TOL. For each value of TOL we simulated all the possible values of Tmin , and the graphs in Fig. 4.14 present both the worst-case expansion and the average expansion (as defined in Sec. 4.5.2). Fig. 4.14a depicts the results for timed installation, i.e., r(T0 ), while Fig. 4.14b illustrates the results for timed action updates. It can be shown that the expansion of the latter is roughly half of the former, since timed action updates make use of both the positive and the

16

Avg (simulated)

16

14

Max (simulated)

14

12

Max (theoretical)

12

Avg (theoretical)

10 8 6

Avg (simulated)

expansion

expansion

negative encoding.

Max (simulated) Max (theoretical)

10 8 6

4

4

2

2 0

0 1

10

100

1000

10000 100000

1

10

100

10000 100000

TOL

TOL

(a) Timed installation: expansion as a function of

1000

(b) Timed action update: expansion as a function of

TOL. The theoretical max is based on Lemma 4.6, TOL using both positive and negative encoding. average is based on Theorem 4.7.

Theoretical max is based on Lemma 4.16.

Figure 4.14: Expansion as a function of TOL Fig. 4.15 depicts the effect of the installation bounds, ∆, on the time range expansion. S CHEDULE was used for computing T0 , and B OUNDED R ANGE was used for selecting the time range. Fig. 4.15a illustrates the expansion for TOL = 1, and includes both the simulated values and the analytical values, based on Lemma 4.9. Fig. 4.15b depicts the worst-case expansion for several values of TOL. The star-shaped markers indicate the points where TOL = 2dlog2 (∆)e , illustrating that, as stated in Theorem 4.10, if ∆ is small enough, i.e., TOL ≥ 2dlog2 (∆)e , the time range can be represented by a single entry.

16

Avg (simulated)

16

TOL=1

14

Max (simulated)

14

TOL=65

12

Max (theoretical)

12

TOL=4097

expansion

expansion

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

10 8 6

10 8 6

4

4

2

2

0

0

1

10

100

1000

10000 100000

115

1

10

100

1000

10000 100000

installation bounds (Δ)

installation bounds (Δ)

(a) Expansion as a function of ∆ for Tmax = Tmin .

(b) The simulated worst-case expansion as

Theoretical values are based on Lemma 4.9.

a function of ∆ for various values of TOL. The star-shaped markers indicate the points where TOL = 2dlog2 (∆)e .

Figure 4.15: Expansion as a function of ∆ with B OUNDED R ANGE in a timed installation. Fig. 4.16 illustrates the effect of the scheduling tolerance and the installation bounds on the number of bits required to represent the timestamp field. Again, the star-shaped markers indicate the points where TOL = 2dlog2 (∆)e , and thus by Theorem 4.13, if ∆ has a smaller value than the star-shaped marker, the timestamp field requires only a single bit. Fig. 4.17 compares the B OUNDED R ANGE algorithm and the R EDUCED R ANGE algorithm for timed action updates; the latter requires fewer entries (Fig. 4.17a), whereas the former allows the timestamp field to be represented by less bits (Fig. 4.17b). The simulations confirm our theoretical results, and demonstrate the tradeoff between the two parameters, TOL and ∆, and the TCAM resource consumption.

4.7.2

Microbenchmark

We performed a microbenchmark on a commercial switch in order to verify two key aspects of T IME F LIP: (i) that the method presented in this paper is applicable to real-life switches, and (ii) that the method can effectively provide a high degree of accuracy. As mentioned above, when

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

16

TOL=1

14

TOL=65

12

TOL=4097

116

bits

10 8 6 4 2 0 1

10

100

1000

10000 100000

installation bounds (Δ)

Figure 4.16: The number of bits as a function of ∆ for various values of TOL, using B OUND ED R ANGE

in a timed installation. The star-shaped markers indicate the points where TOL =

2dlog2 (∆)e .

12

10 9

10

8

BoundedRange

BoundedRange 8

7 6

bits

expansion

ReducedRange

ReducedRange

5 4

6 4

3 2

2 1

0

0

1

1

10

100

1000

10000

10

100000

100

1000

10000 100000

Δ

Δ

(a) The worst-case expansion as a function of ∆ for (b) Number of bits as a function of ∆ for W = 16 W = 16 and Tmax − Tmin = 64.

and Tmax − Tmin = 64.

Figure 4.17: Timed action updates: R EDUCED R ANGE vs. B OUNDED R ANGE.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

117

an update is scheduled for time T0 , it is performed in practice at some time t ∈ [T0 − δ, T0 + δ]. The scheduling error, δ, is affected by two factors, the device’s clock accuracy, which is the maximal offset between the clock value and the value of an accurate time reference, and the execution accuracy, which is a measure of how accurately the device can perform a timed update, given a clock that is perfectly synchronized to real time. The achievable clock accuracy strongly depends on the network size and topology, and on the clock synchronization method being used. For example, the achievable accuracy using the Precision Time Protocol [21] is typically on the order of 1 µsec [15, 45]. Our microbenchmark is focused on the execution accuracy of timebased TCAM updates. 10 9 8 7 6 5 4 3 2 1 0

PDF

65.3

65.4

65.5

65.6

65.7

65.8

time between flips [microseconds]

(a) Experiment setup.

(b) Empirical PDF of the time between T IME F LIPs.

Figure 4.18: Microbenchmark The experiment was performed using an evaluation board of the Marvell 98DX4251 [121] switch silicon. It is important to emphasize that we used the switch as-is, without modifications or extensions. The reason we used a switch silicon evaluation board is that it provides flexible configuration options compared to an off-the-shelf pizza-box switch. Specifically, the evaluation board allows the flexibility to define the structure of the TCAM key. The TCAM key can be configured to include any of the packet header fields, as well as many metadata fields, including the ingress timestamp. The ingress timestamp is the time at which the packet was received by the switch.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

118

This timestamp is measured using the switch’s internal clock, which can be synchronized to other clocks using IEEE 1588 [21]. The 98DX4251 measures the ingress time of every packet that is received by the switch. This ingress timestamp is attached to the packet’s internal metadata, and thus it does not increase the packet length, and does not reduce the throughput of switch. The experiment setup is illustrated in Fig. 4.18a. We used an IXIA XM12 packet generator, that was connected to ports 0, 1, and 2 of the switch, and was configured to continuously transmit 64B-packets to port 0 of the switch at a full-wire-speed of 10 Gbps. Thus, a packet was transmitted to the switch every 67.2 ns (nanoseconds). The switch was configured to perform a TCAM lookup on all incoming packets, with the following two entries: • (in port = 0, T = (∗ . . . ∗, 1, ∗15 )) → out port = 1 • (in port = 0, T = (∗ . . . ∗)) → out port = 2 The only unmasked bit in the timestamp field of the first entry was t16 = 1. T is measured in nanoseconds, and therefore the 16th bit, t16 , represents 216 ns. Consequently, the two rules produce periodic behavior where each rule is matched for a duration of 215 ns; the first rule is matched for a duration of 215 ns, and then the second rule is matched for 215 ns, and so on. In the context of this experiment, every T IME F LIP between out port=1 and out port=2 is a timed action update. Our analysis focuses on the question how accurately the timed action updates occur. To answer this question, we measured the time between two consecutive T IME F LIPs from out port=1 to out port=2. We repeated this measurement 50 times, and the empirical Probability Density Function (PDF) of these measurements is illustrated in Fig. 4.18b. The expected mean time interval between T IME F LIPs was 216 = 65536 ns. The precision of our measurements was affected by two factors: (i) the packet generator timestamped the incoming packets with a 10 ns resolution, and (ii) the packet generator transmitted a packet exactly every 67.2 ns. Due to these two factors, the precision of the measurement was on the order of tens of nanoseconds. Notably, since the measurement is performed by the packet generator, as a difference between two T IME F LIP events, no synchronization is required between the packet generator and the switch.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

119

As shown in Fig. 4.18b, the timed action updates were all performed within tens of nanoseconds of the expected time, which is well within the margin of error of our measurement method. Hence, the execution accuracy in our experiment was no worse than tens of nanoseconds, which is negligible compared to the clock accuracy in a typical network, on the order of 1 µsec. Thus, the microbenchmark indicates that using the method we present in this paper, updates can be timed in a typical network with a microsecond accuracy. Notes. In this experiment we observed that T IME F LIP can be implemented by existing switch silicons with a sub-1 µsec accuracy at full-wire-speed. Note that this microbenchmark provides no indication regarding large-scale networks or stress scenarios, and does not necessarily reflect on real-life values of TOL and ∆. We did not evaluate the clock accuracy, but we note that the switch we evaluated has hardware support for IEEE 1588 [21], a mature technology that provides a sub-microsecond clock accuracy in typical network scenarios (e.g., [44]), with typically less than 100 packets per second per port [122], a negligible overhead in high-speed networks.

4.8 4.8.1

Discussion Scheduling Accuracy

T IME F LIP allows accurate scheduling while allowing efficient TCAM resource consumption. As discussed in Section 4.7.2, the accuracy of the T IME F LIP scheduling is affected by two factors, the execution accuracy, and the clock accuracy. Notably, the execution accuracy of T IME F LIPs is not affected by the scheduling tolerance, TOL, and the installation bound, ∆, whereas the resources required, namely the number of bits per timestamp and the number of entries per T IME F LIP is indeed affected by these two parameters. The execution accuracy in our microbenchmark was on the order of tens of nanoseconds (Sec. 4.7.2), and since the clock accuracy in large-scale networks has been shown to be on the order of 1 µsec [44], we deduced that the scheduling accuracy is on the order of 1 µsec. However, an accuracy of 1 µsec may not be sufficient in high-speed networks, where the network latency can be as low as tens of µsec. Fortunately, low network latency allows the synchronization protocol to achieve a higher accuracy [123], and thus in low-latency networks we expect the

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

120

scheduling accuracy to be better than 1 µsec.

4.8.2

Timestamp Size in Real-Life

The required size of the timestamp field in the TCAM is a function of TOL and ∆, as per Theorems 4.11 and 4.14. Different SDN applications may yield different TOL values, and the value of ∆ may vary according to the switch types. Therefore, the timestamp size should be designed according to the worst-case values of TOL and ∆. Some switches, such as the 98DX4251 which was used in the experiment of Section 4.7.2, provide the flexibility to determine the size of the timestamp field in the TCAM entry. Clearly, the timestamp field should be as compact as possible, allowing the timestamp to fit into unused spare bits in the TCAM entry. If the timestamp size exceeds the number of spare bits in the TCAM entry, then using the timestamp requires to increase the size of the TCAM entries, causing a respective decrease in the number of TCAM entries. Example 4.18 provides some intuition as to what the timestamp field size should be in typical systems. Example 4.18. If ∆ is 10 seconds, and TOL is 100 milliseconds, then by Theorems 4.11 and 4.14, a 9-bit timestamp can be used to represent any extremal time range with at most nine TCAM entries. We believe that Example 4.18 presents a pragmatic real-life timestamp size, as even in highly stressed conditions the installation bounds are not expected to exceed 10 seconds, and a TOL of 100 milliseconds should be enough to satisfy urgent update requirements.

4.8.3

TCAM Update Performance

Previous work [30] has demonstrated large fluctuations in TCAM rule installation latencies, since a TCAM update often requires multiple TCAM entries to be moved or reordered. These latencies have been shown to vary from a few milliseconds to a few seconds. In the context of our analysis, these high latencies are represented by a high value of ∆. Notably, a high value of ∆ may yield high resource consumption by each T IME F LIP, but does not compromise the T IME F LIPs’ execution accuracy.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

121

When the rate of TCAM updates is high, we expect ∆ to have a high value. Thus, a system should be designed assuming a sufficiently high value of ∆, considering the most stressed update scenarios. Based on the analysis of [30, 64], it is safe to assume that ∆ = 10 sec in typical systems (as in Example 4.18), as the worst-case installation latencies have been shown to be on the order of a few seconds. In some cases ∆ may be lower, allowing to further reduce the amount of TCAM resources for each T IME F LIP. Another aspect of the update performance is the TCAM’s access throughput. Since a TCAM has limited throughput, in some switches TCAM rule installations or updates may temporarily suspend the traffic, causing a slight degradation in the switch’s full-wire-speed performance. Each T IME F LIP may require a few TCAM entries, which may reduce the TCAM’s throughput compared to untimed update approaches.

4.8.4

Timed Updates of Non-TCAM Memories

The concepts presented in this paper can be used for applying timed updates to non-TCAM lookup tables in network devices. We provide an example of performing a timed update in an IP routing table. Assume that at time T0 a set of entries in the routing table should be updated to a new value. As shown in Fig. 4.19, a time-based TCAM range is used for defining the time range T ≥ T0 , and the corresponding action is a version metadata field, indicating whether routing should be performed based on the old version or on the new one. The version value is then used to access the routing table, along with the destination IP address. This approach bears some resemblance to the version tag approach of [22], although our approach uses the version indication internally in the network device, and it is not added to the packet header as in [22].

4.8.5

On the TCAM Encoding Scheme

Using Prefix Encoding Throughout this paper, the analysis assumes that time ranges are represented by prefix encoding [106] to describe ranges in TCAMs. While this is the simplest and most common coding scheme for TCAM ranges, other schemes might be considered. Although for some specific

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

122

IP routing lookup

TCAM T incoming packet

[T T0]

ver, IP outgoing packet

Figure 4.19: Timed updates in non-TCAM lookups ranges the alternative schemes might achieve improved expansions, this specific scheme has an important advantage. It suggests an upper bound of b W 2+1 c on the maximal (worst) case expansion of complementary extremal ranges [110]. As shown in [110], no encoding scheme can achieve a smaller bound on its worst case expansion. Accordingly, as discussed in Sec. 4.6, the expansion of a timed action update equals this upper bound, i.e. the bound b W 2+1 c for the prefix encoding, cannot be improved using any other alternative scheme. Alternative Encoding Schemes Despite of the above, we might have been thinking of using alternative schemes to improve the representation of some specific ranges. We briefly discuss two such encodings, Short Range Gray Encoding (SRGE) [107], and Database Independent Range PreEncoding (DIRPE) [124]. SRGE. We show that the analysis and the optimal selection of T0 (Fig. 4.7) applies also to SRGE, a common encoding scheme that relies on Gray code. This means that there are no ranges for which this encoding can improve the performance of prefix encoding. We rely on an observation that any range of the form [T0 , 2W − 1] has exactly the same expansion in the two schemes. This is summarized in the following lemma. Lemma 4.19. A right extremal range [T0 , 2W − 1] has the same expansion in the prefix encoding and in SRGE encoding scheme. Proof. The proof directly follows from the algorithm of the SRGE encoding. Generally, SRGE splits a range [s, e] to two disjoint parts [s, p], [p + 1, e] in two disjoint subtrees. It encodes the

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

123

shorter of them using prefix encoding and uses the selection property of the Gray code to replace a digit in these entries by ‘*’ to cover a subset of the same size from the longer part. Then, if required, it completes the encoding of the longer part using at least one additional entry. Given a range with a length that is a power of two, both schemes require a single entry. Otherwise, in the case of a right extremal range of the form [T0 , 2W − 1], the right part is longer than the left and takes the whole relevant subtree. Then, both schemes are composed of the entries in a prefix encoding of the left part together with one additional entry for the right part. Based on the latter lemma, we can deduce that the selection of TSCH based on the S CHEDULE algorithm is optimal also for the SRGE scheme, and that the average expansion is equal in the two schemes. DIRPE Much like T IME F LIP, several encoding schemes such as DIRPE make use of the unused bits that are available in TCAM entries due to the non-flexible width of TCAMs. These bits can be used to further reduce the range expansion. To avoid competition on the same resource, it is preferable not to use T IME F LIP with these specific schemes. We demonstrate that in practice the observed expansion is often small even when using the prefix encoding or the SRGE encoding, which do not rely on these additional bits.

4.9

Conclusion

We introduced T IME F LIP, a practical method of implementing accurate time-based network updates and a natural implementation of Atomic Bundles, using time-based TCAM ranges. We have shown that in practical conditions, a small number of timestamp bits are required to accurately perform a T IME F LIP using a small number of TCAM entries. At the heart of our analysis lie two properties that are unique to time-based TCAM ranges. First, by carefully choosing the scheduled update time, the range values can be selected to minimize the required TCAM resources. Second, if there is a known bound on the installation time of the TCAM entries, then by using periodic time ranges, the expansion of the time range can be significantly reduced. We have shown that T IME F LIPs work on existing network devices, making accurate time-based updates a viable tool for network management.

4. T IME F LIP: Scheduling Updates with Timestamp-based TCAM Ranges

124

Chapter 5 OneClock to Rule Them All: Using Time in Networked Applications This chapter is an extended version of the paper:

[3] T. Mizrahi and Y. Moses, “OneClock to rule them all: Using time in networked applications,” in IEEE/IFIP Network Operations and Management Symposium (NOMS) miniconference, 2016.

5.1

Abstract

This paper introduces OneClock, a generic approach for using time in networked applications. OneClock provides two basic time-triggered primitives: the ability to schedule an operation at a remote host or device, and the ability to receive feedback about the time at which an event occurred or an operation was executed at a remote host or device. We introduce a novel predictionbased scheduling approach that uses timing information collected at runtime to accurately schedule future operations. Our work includes an extension to the Network Configuration protocol (NETCONF), which enables OneClock in real-life systems. This extension has been published as an Internet Engineering Task Force (IETF) RFC, and a prototype of our NETCONF time extension is publicly available as open source.

5. OneClock to Rule Them All: Using Time in Networked Applications

126

Experimental evaluation shows that our prediction-based approach allows accurate scheduling in diverse and heterogeneous environments, with various hardware capabilities and workloads. OneClock is a generic approach that can be applied to any managed device: sensors, actuators, Internet of Things (IoT) devices, routers, or toasters.

5.2 5.2.1

Introduction Background

Motivation. Various distributed applications require the use of accurate time, including industrial automation systems [35], automotive networks [36], and accurate measurement [37]. Surprisingly, while these different applications typically use standard time synchronization methods (e.g., [21]), there is no standard method for using time, and thus each of these applications uses a proprietary management protocol that invokes time-triggered operations. In this paper we present a generic approach that allows the use of accurate time to manage various diverse devices, from routers to toasters.1 Why NETCONF? A formal announcement by the Internet Engineering Steering Group (IESG), released in March 2014 [126], declared that the IETF is encouraging the use of NETCONF [20], rather than the Simple Network Management Protocol (SNMP) [19]. Indeed, the networking community is quickly shifting from SNMP-based Management Information Bases (MIB) to modules based on YANG [127], the modeling language used by NETCONF. During the writing of this paper, the IETF Active Internet Draft list [128] consisted of 256 drafts that define YANG data models [129], and only 21 drafts that define MIBs. NETCONF and YANG are gaining momentum in the context of various diverse applications, not only in the traditional realm of routers and switches, but also in other applications, such as Virtualized Network Functions [130] and Internet of Things (IoT) devices [131, 132]. NETCONF is being adopted not only by the IETF, but also by other organizations, such as the Open Networking Foundation [42], and the Metro Ethernet Forum [133]. 1 Paraphrasing

the 25-year old gimmick of the network-managed toaster [125].

5. OneClock to Rule Them All: Using Time in Networked Applications

127

We chose to use NETCONF as a baseline for OneClock, due to its increasing adoption rate and diversity of applications and environments.

5.2.2

The OneClock Protocol

In this paper we introduce a generic protocol for using time in networked applications. The protocol is defined as an extension of NETCONF. A full specification of this extension, including an open-source YANG module that defines the extension, has been published as an RFC [134]. Our OneClock extension defines two basic time-related primitives: (i) scheduling: a NETCONF client2 can schedule a Remote Procedure Call (RPC) to be performed by a NETCONF server at a prescribed future time, and (ii) reporting: a NETCONF client can receive feedback about the time of execution of an RPC, or a notification about the time of occurrence of a monitored event. OneClock can be used in various important use cases, such as invoking scheduled operations in diverse applications, taking coordinated snapshots of a system, or performing network-wide atomic commits.

5.2.3

OneClock: Accurate Scheduling

One of the greatest challenges in our approach is to accurately schedule network operations. Even if a managed device (server) keeps an accurate clock, it is difficult to guarantee that scheduled operations are performed very close to their scheduled times. The actual execution time may depend on the processing power of the server, on its operating system, and its load due to other tasks that run in parallel. We propose a prediction-based approach that allows a client to accurately schedule network operations without prior knowledge about the servers’ performance. The approach is based on measuring the Elapsed Time of Execution (ETE) of each RPC, and using previous ETE measurements to predict the next ETE. 2 We

client.

follow the NETCONF terminology; managed devices, referred to as servers, are managed by one or more

5. OneClock to Rule Them All: Using Time in Networked Applications

s

s

128

e

Figure 5.1: Elapsed Time of Execution (ETE): ETE = Te − Ts . The ETE is defined to be Te − Ts (see Fig. 5.2), where Ts is the scheduled start time of the RPC, and Te is the actual completion time of the RPC. The actual start time of the RPC is denoted by Ts0 . Hence, as depicted in Fig. 5.1, the ETE is affected by two non-deterministic factors: (i) the server’s ability to accurately start the operation, and (ii) the running time of the RPC. For each scheduled operation (see the numbered steps in Fig. 5.2):3 1. The client predicts the ETE of the next RPC based on previous measurements of the scheduled time and execution time. 2. For a given desired execution time, Td , the client schedules the operation to be performed at Td − ET E. 3. The server reports the actual time of execution, Te , back to the client, allowing the client to use this feedback for scheduling future operations. Notably, our scheduling approach allows a NETCONF client to accurately schedule network operations in a heterogeneous environment, where the performance of the managed servers is not necessarily known in advance.

5.2.4

Related Work

The use of the time-of-day in network management is a common practice. Time-of-day routing [38] routes traffic to different destinations based on the time-of-day. Scheduled operations [135, 39] allow various policies and configurations to be applied at specific time ranges. 3 We follow the notation of [20], where Remote Procedure Calls are denoted by uppercase RPC, and the messages

that carry RPCs are denoted by lowercase rpc.

5. OneClock to Rule Them All: Using Time in Networked Applications

scheduled time

T1

T2

execution time

scheduled time Ts = Td - ETE

measured ETE

Ts

3

-rep (Te ) ly

2

rpc

server

1

(T s) rpc

(T 1)

...

rpc

client

rpc -rep (T2 ) ly

compute predicted ETE based on ETE measurements

129

Td Te

time

actual execution time

desired execution time

Figure 5.2: Prediction-based scheduling: by predicting the ETE, a client can control when the RPC will be completed. The work of [8] analyzed the use of timed path updates in Software Defined Networks. This paper introduces a more general framework that allows time-triggered operations in any network managed device, and enables accurate scheduling of network operations in a heterogeneous environment. The work of [136] suggested a method for accurate scheduling in switches and routers using Ternary Content Addressable Memories (TCAM). Our scheduling scheme is more generic, as it makes no assumption about the hardware of the managed devices. The literature is rich with works that analyze and predict program running times, e.g., [137, 138, 139, 140, 141]. In this paper we use time series analysis to predict the execution time of a remote operation, allowing to perform accurate scheduling of the requested operation.

5.2.5

Contributions

The main contributions of this paper are: • We introduce OneClock, a generic approach for using time in networked applications. OneClock defines two basic primitives, schedule, and report. Several use cases that demonstrate the merits of OneClock are presented. • We present a scheduling approach that allows accurate scheduling by predicting the server’s execution time. We analyze three prediction algorithms: two average-based algorithms, and a Kalman-Filter-based algorithm.

5. OneClock to Rule Them All: Using Time in Networked Applications

130

• We define a OneClock extension to NETCONF, which has been published as an IETF RFC. • We have implemented a prototype of the NETCONF time extension. Our prototype is available as open source. Our experimental evaluation demonstrates how accurately events can be scheduled over a network.

5.3

Using OneClock in Practice

In this section we describe three use cases that illustrate how the two time-triggered primitives, scheduling and reporting, can be used in distributed systems.

5.3.1

Coordinated Operation

It is often desirable to coordinate a set of events or operations that should take place at different nodes in the system at the same time,4 or should occur according to a specific order. The schedule primitive can be used to coordinate events occurring in actuators in a factory product line [35], to coordinate a routing change in a network [142], or to orchestrate events in scientific experiments [143]. Using OneClock, a client can schedule a simultaneous event at multiple servers, or define a sequence of scheduled times that determine the order and relative timing of events.

5.3.2

Coordinated Snapshot

In many applications it is desirable to monitor events or statistics with respect to a common time reference. A client can perform a coordinated snapshot, i.e., capture the state of a monitored attribute at all the servers at the same time. While a simultaneous snapshot does not produce a consistent 4 In

practical systems it is typically not possible to coordinate events to be performed exactly at the same time

at different nodes. Throughout the paper, the term ‘same time’ should be read as ‘same time within the accuracy limitations of the servers’.

5. OneClock to Rule Them All: Using Time in Networked Applications

Client

Client

T T Server 1 Server 2

131

T Server n

T T Server 1 Server 2

T Server n

(a) Coordinated operations: all servers perform the(b) Coordinated snapshot: all servers send their state operation at the same time, T .

to the client at the same time.

Figure 5.3: Coordinated operations and coordinated snapshots. distributed snapshot [144], it provides a coordinated snapshot of the servers’ state. For example, when collecting statistics from all the servers, it is most useful to capture the information at the same time in all servers. In power grid networks [43], synchrophasor measurements are used for monitoring the operation of a power grid. These measurements must be synchronized, so as to allow correct system-wide processing. OneClock enables coordinated snapshots; using the schedule primitive, a client can schedule a NETCONF get-config operation [20] to be taken at time T , causing all the servers to send their response at the same time (Fig. 5.3b).

5.3.3

Network-wide Atomic Commit

The NETCONF commit [20] is an RPC that commits the candidate configuration, i.e., copies the candidate configuration to the running configuration. This operation allows the client to prepare a set of configuration updates in the candidate datastore, and then apply them at once with the commit operation. It is often desirable to perform a network-wide atomic commit, where either all the servers successfully perform the commit operation, or if some of the servers are not able to perform the commit, then none of the servers perform it. Atomic commits can be performed using NETCONF without our time extension, but potentially at the cost of a temporary state of inconsistency, where different servers use different

132

e

5. OneClock to Rule Them All: Using Time in Networked Applications

s

scheduled time

RPC executed

s

(b) Reporting the execution time.

s

s

s

e

RPC executed

not

e

ifica ti

on

(a) Scheduled RPC.

e

s

scheduled time

(c) Reporting the execution time of a scheduled

(d) Scheduled RPC with notification.

RPC.

Figure 5.4: The time capability in NETCONF. configuration versions (Fig. 5.5). This can be done using the NETCONF confirmed commit procedure. This procedure requires two steps: (i) the client sends a first commit message to all the servers, causing them to switch to the candidate configuration, and (ii) the client sends a confirming commit message to all the servers, finalizing the commit procedure. If the two phases are not completed successfully, or if the client cancels the commit, the servers roll back to the previous configuration.

OK

erro r

OK

OK

OK

client cel can

New configuration

(T) mit com (T) mit

Old configuration

com

cel can

mit com mit

com

client

erro r

The OneClock schedule primitive enables a clean and straightforward approach to network-

server A

server A Time

server B

server B (a)

(b)

T

Figure 5.5: Atomic commit: (a) NETCONF confirmed commit, without using time. (b) Timetriggered commit.

5. OneClock to Rule Them All: Using Time in Networked Applications

133

wide commits, as illustrated in Fig. 5.5. The client sends a scheduled commit message to the servers, to be performed at a future time T . If some of the servers fail to schedule the commit operation, the client can cancel the commit before time T , leaving all the servers at the current configuration.

5.4

NETCONF Time Extension

We introduce an extension to the NETCONF protocol that allows time-triggered operations. The extension is defined as a new capability [20]. Details are presented in [134].

5.4.1

Overview

The time capability provides two main functions: • Scheduling. When a client sends an rpc message to a server, the message may include the scheduled-time parameter, denoted by Ts in Fig. 5.4a. The server then starts to execute the RPC as close as possible to the scheduled time Ts , and once completed the server can respond with an rpc-reply message. • Reporting. When a client sends an rpc message to a server, the message may include a get-time element (see Fig. 5.4b), requesting the server to return the execution time of the RPC. In this case, after the server performs the RPC it responds with an rpc-reply that includes the execution-time parameter, specifying the time Te at which the RPC was completed. The two scenarios discussed above imply that a third scenario can also be supported (Fig. 5.4c), where the client sends an rpc message that includes a scheduled time, Ts , as well as the get-time element. This allows the client to receive feedback about the actual execution time, Te . Ideally, Ts = Te . However, the server may execute the RPC at a slightly different time than Ts , for example if the server is tied up with other tasks at time Ts . The report abstraction, presented in Sec. 5.2.2, allows the client to receive information about the execution time of an RPC, or to receive notifications about the time of occurrence of events.

5. OneClock to Rule Them All: Using Time in Networked Applications

134

The former can be implemented using the get-time procedure we defined, while the latter is already supported in NETCONF by using notifications that include the eventTime parameter [145].

5.4.2

Applying the Time Primitives to Various Applications

The time capability specification we defined [134] includes a YANG module that adds the two new primitives, schedule and report, as two parameters in all the RPC types defined in [20]. For example, this YANG module enables scheduled commit, and scheduled set-config RPCs. Notably, the time primitives are not limited to the RPCs defined in [20]. If a new YANG module defines a new RPC, the module can include the time parameters, allowing the new RPC to use the time primitives. Our open source code includes two such examples: • We enhanced the well-known toaster YANG module [146], by allowing the make-toast operation to be a scheduled RPC. • We created a new YANG module called test, which triggers the server to perform a configurable command line. Using the schedule parameter, the test RPC can be used as a remote variant of the well-known Cron [147] command in Linux. The report primitive can be used not only by applying the get-time parameter, but also by other means that are inherently possible when using NETCONF. The time of occurrence of important events can be sent to the client using a NETCONF notification [145], or can be included in the NETCONF data model. For example, the YANG data model that defines a log entry may include the time-of-day in each log entry.

5.4.3

Notifications and Cancellation Messages

Notifications As illustrated in Fig. 5.4a, after a scheduled RPC is executed the server sends an rpc-reply. The rpc-reply may arrive a long period of time after the rpc message was sent by the client,

5. OneClock to Rule Them All: Using Time in Networked Applications

135

leaving the client without a clear indication of whether the rpc was received. Therefore, we define an optional netconf-scheduled-message notification (Fig. 5.4d), which provides an immediate acknowledgment of the scheduled RPC. As illustrated in Fig. 5.4d, when the server receives a scheduled RPC it sends a notification that includes the message-id of the scheduled RPC. Cancellation Messages A client can cancel a scheduled RPC by sending a cancel-schedule RPC (Fig. 5.6). The cancel-schedule RPC, defined in this document, can be used to enforce the coordinated

time

ly -rep rpc

(T s)

server

celcan dule e sch

rpc

client

n ot ifica tion

network-wide commit described in Sec. 5.3.3.

RPC not executed

Ts

Figure 5.6: Cancellation message.

5.4.4

Clock Synchronization

The time capability we defined requires clients and servers to maintain clocks. It is assumed that clocks are synchronized by a clock synchronization method, e.g., [119, 21].

5.4.5

Acceptable Scheduling Range

A server that receives a message that is scheduled to be performed at time Ts verifies that the value Ts is not too far in the past or in the future. As illustrated in Fig. 5.7, the server verifies that Ts is within the acceptable scheduling range. If Ts occurs in the past and within the acceptable scheduling range, the server performs the RPC as soon as possible

5. OneClock to Rule Them All: Using Time in Networked Applications

136

The scheduling bound defined by sched-max-future guarantees that every scheduled RPC is restricted to a near future scheduling time, on the order of seconds, and not on the order of hours or days. This restriction significantly reduces the impact of potential coherency problems that may result from server failures, or from multiple clients trying to schedule conflicting operations.

5.5

Prediction-based Scheduling

Our scheduling approach is based on using previous measurements of the Elapsed Time of Execution (ETE). Based on the ETE measurements, the client uses the prediction approach illustrated in Fig. 5.8. The prediction approach consists of three steps: 1. When a scheduled operation is required to take place at time Td , the client uses previous ETE measurements, x[1], . . . , x[n − 1], to predict the next ETE, denoted by s[n|n − 1]. 2. The next scheduled time is Ts = Td − s[n|n − 1]. 3. The client updates its measurement set based on the feedback received from the server about the execution time Te . Server receives scheduled RPC. Time Ts sched-max-past

sched-max-future

acceptable scheduling range

Figure 5.7: Acceptable scheduling range: defined by two configurable parameters: sched-maxfuture and sched-max-past.

5. OneClock to Rule Them All: Using Time in Networked Applications

ETE measurements x[1], x[2],  , x[n-1]

3

1

ETE Prediction: s[n|n-1]

2

137

mea s x[n] ured T =T e e -T s

Scheduling: Ts = Td - s[n|n-1]

Figure 5.8: Prediction-based scheduling approach. In the rest of this section we describe the two main components of our scheduling approach, the ETE measurements, and the ETE prediction.

5.5.1

ETE Measurements

We analyze two measurement methods: Periodic probing. This approach uses ETE measurements that are taken periodically at a constant frequency. In systems that require periodic operations, these ETE measurements are inherently available. In other systems, the client can proactively send periodic scheduled RPCs to every server in order to probe the ETE. The main drawback of periodic probing is that it can potentially consume unnecessary resources, both at the client and at the server. Burst probing. The second approach uses an on-demand burst of probe RPCs; when a scheduled RPC is required, the client initiates a burst of N scheduled RPCs, performed at a fixed frequency. This approach does not require resource consumption in the absence of actual scheduled RPC requests, but the prediction is potentially less accurate, since it is based on a smaller number of measurements. Note that both approaches require the probe RPCs to be similar in terms of performance and running time to the future RPC for which the prediction is required.

5.5.2

ETE Prediction Algorithms

We analyzed three prediction algorithms, Average, FT-Average, and Kalman. We now describe these algorithms.

5. OneClock to Rule Them All: Using Time in Networked Applications

138

Baseline The baseline for comparison in our evaluation is the simplest approach which assumes s[n] = 0, and therefore assigns Ts = Td . In this approach the prediction error is equal to the ETE. Average Algorithm The Average algorithm performs an average of the last N measurements:

s[n|n − 1] =

1 N

N

∑ x[n − j]

(5.1)

j=1

Fault-tolerant Average (FT-Average) Algorithm The Fault-tolerant Average [148] performs an average of the last N measurement samples, after ignoring the highest and the lowest measurement values. Hence, this approach masks the most noisy or erroneous measurement of the N samples.

s[n|n − 1] =

 N  1  x[n − j],  ∑ N   j=1    N   1 (∑ x[n − j] N−2

if N < 3

j=1

(5.2)

   − max x[n − j]   1≤ j≤N      − min x[n − j]), otherwise 1≤ j≤N

Kalman Filtering Algorithm Kalman Filtering [149] is one of the most well-known data fusion and estimation methods. One of its significant advantages is that it is the optimal estimator in systems with white Gaussian noise. The algorithm we use is a one-dimensional Kalman Filter. Our terminology and notations are based on the standard literature, e.g., [150].

5. OneClock to Rule Them All: Using Time in Networked Applications x[n] s[n] s[n|n − 1] w[n] v[n] P[n] P[n|n − 1] K[n] W [n] V [n]

139

The observed ETE of the nth sample. The estimated ETE at n, given the measurements up to n. The estimated ETE at n, given the measurements up to n − 1. The ETE signal noise of the nth sample. The measurement noise of the nth sample. The estimated variance of the ETE. The estimated variance at n, given the measurements up to n − 1. The Kalman gain. The estimated variance of w[n], given the measurements up to n − 1. The estimated variance of v[n], given the measurements up to n − 1. Table 5.1: Kalman Filter Notations

Modeling the system. In the general Kalman Filtering model, the system equation is s[n] = F · s[n − 1] + w[n], where F is the state transition coefficient. In our context the client does not have any information about how the ETE changes as a function of time, and therefore it is assumed that the state transition coefficient is 1. Hence, the Kalman system equation is given by 5.3.

s[n] = s[n − 1] + w[n]

(5.3)

The Kalman observation equation is given by:

x[n] = s[n] + v[n]

(5.4)

Based on the two equations above, we present the prediction equations and the update equations, which are the core of the Kalman Filtering algorithm. Prediction equations. The client uses the prediction equations in step 1 of Fig. 5.8 to estimate the next ETE based on the first n − 1 measurements.

s[n|n − 1] = s[n − 1]

(5.5)

P[n|n − 1] = P[n − 1] +W [n]

(5.6)

5. OneClock to Rule Them All: Using Time in Networked Applications

140

Update Equations. The client uses the update equations in step 3 of Fig. 5.8 to update its state based on the new measurement, x[n].

s[n] = s[n|n − 1] + K[n](x[n] − s[n|n − 1])

K[n] =

P[n|n − 1] P[n|n − 1] +V [n]

P[n] = (1 − K[n])P[n|n − 1]

(5.7)

(5.8)

(5.9)

Variance estimation. W [n] is defined to be the estimated variance of w[n], and V [n] is the estimated variance of v[n]. By Eq. 5.3 and Eq. 5.4, we have w[n] = s[n] − s[n − 1], and v[n] = x[n] − s[n]. Hence, the variance of w[n] and v[n] can be estimated by the sample variance using the last N values of x[·] and s[·], as follows:

W [n] =

1 N · ∑ ((s[n − i] − s[n − i − 1])− N i=1 (5.10) N

(

1 · ∑ (s[n − j] − s[n − j − 1])))2 N j=1

V [n] =

1 N · ∑ ((x[n − i] − s[n − i])− N i=1 (5.11)

1 N ( · ∑ (x[n − j] − s[n − j])))2 N j=1

0.1

Baseline Average FT-Average Kalman

0.01

0.001

0.0001

Type I

Type II

Type III

Type IV

Type V

Type VI

Baseline Average FT-Average Kalman

0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 1

10

100

1000

0.00001

Server Type

(a) Performance on different

Measurement Period [seconds]

(b) Periodic measurement.

Mean Absolute Prediction Error [seconds]

Mean Absolute Prediction Error [seconds - log.]

1

Mean Absolute Prediction Error [seconds]

5. OneClock to Rule Them All: Using Time in Networked Applications

141 Baseline Average FT-Average Kalman

0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 0

5

10

15

Number of Samples per Burst

(c) Bursty measurement.

platforms.

Figure 5.9: Performance on various machine types (a). Type V machines were used in (b) and (c).

5.6

Evaluation

5.6.1

Background

We implemented a prototype of the NETCONF time capability. The prototype was implemented as an extension to the OpenYuma [151], a NETCONF software implementation written in C over Linux. Our code is publicly available as open source [152]. Goal. The goal of the experiments was to evaluate our prediction-based scheduling approach over various machines, platforms, and under various workloads. Method. We evaluated the three prediction algorithms (Sec. 5.5) on Linux-based servers in two academic testbeds, Emulab [24] and DeterLab [25], and in two public cloud platforms, Microsoft Azure [153], and Amazon Web Services (AWS) [154]. Our measurements were performed on over 100 servers, for a total duration of over 5000 hours, summing up to over 3 million measurement samples. Our results are based on measurements that were performed using a commit RPC on the wellknown toaster YANG module [146]. In each experiment a NETCONF client sent scheduled RPC messages to a server, and the client recorded the Ts and Te values. The experiments produced log files (at the client) containing Ts and Te values, and then the three prediction algorithms were run

5. OneClock to Rule Them All: Using Time in Networked Applications

0.02

Absolute Prediction Error [seconds]

Absolute Prediction Error [seconds]

1 VM - Baseline 1 VM - FT-Average 20 VMs - Baseline 20 VMs - FT-Average

0.015

0.01

0.005

0

Baseline FT-Average Stress - Baseline Stress - FT-Average

0.04

0.03

0.02

0.01

50

100

150

Time [seconds]

(a) Performance on shared machine with 20 VMs, compared

0.05

Baseline Average FT-Average Kalman

0.04

0.03

0.02

0.01

0

0

Absolute Prediction Error [seconds]

0.05

0.025

142

0 0

50

100

150

0

50

100

150

Time [seconds]

Time [seconds]

(b) Performance on stressed server (c) Occasional error spikes during compared to unstressed server.

a stress experiment.

to machines with one VM.

Figure 5.10: Instantaneous prediction error viewed over a 150 second period. The behavior shows peaks under synthetic workload. (a) was measured on Azure, and (b), (c) on Type V machines. offline.5 The three algorithms were run with N = 8 in most of the runs6 , except for specific runs in which the value of N was different, as described below. We quantify the accuracy of our prediction by observing the mean absolute prediction error. The prediction error of an RPC is defined as the difference between the predicted ETE and the measured ETE.

5.6.2

Experiment I: Performance on different platforms

In this experiment (Fig. 5.9a) we compared the prediction error of the three prediction algorithms on various server types. The prediction in this experiment was based on periodic sampling, with a measurement period of 8 seconds.7 The list of servers we tested is presented in Table 5.2. As shown in Fig. 5.9a, the prediction algorithms significantly reduced the prediction error compared to the baseline approach. The experiment shows that in most of the cases FT-Average produces the lowest error. 5 In

the current prototype we have not integrated the prediction algorithm logic into the NETCONF client. The

prediction algorithms were run offline on the log files of the NETCONF client. 6 N is the number of measurement samples used in each prediction computation. For further details see Sec. 5.5. 7 The measurement period is the elapsed time between two consecutive measurements.

5. OneClock to Rule Them All: Using Time in Networked Applications Type I II III IV V VI

Description Public cloud (shared tenancy), 1GB memory Public cloud (shared tenancy), 768MB memory Xeon E3 LP 2.4 GHz, 16GB memory Xeon 2.1 GHz, 4GB memory Quad Core Xeon E5530 2.4 GHz, 12GB memory Dual Core Opteron 1.8 GHz, 4GB memory

143

Platform / class Amazon / t2.micro Azure / A0 DeterLab / MicroCloud DeterLab / pc2133 Emulab / d710 DeterLab / bvx2200

Table 5.2: Machine types.

5.6.3

Experiment II: Periodic vs. bursty measurement

We compared periodic measurement and burst-based measurement (see Fig. 5.9b and 5.9c). The periodic measurement was performed at various measurement periods, and the burst measurement was performed with various burst sizes, and with a fixed period of one measurement per second. We note that in this experiment the error produced by the three algorithms is very similar. Interestingly, the results show that a burst of 4 samples suffices to produce similar results to a periodic measurement. In the periodic measurement (Fig. 5.9b) the lowest prediction error was achieved with a period of one measurement per second. We were not able to test lower measurement periods due to a performance limitation in the NETCONF client we used. Another interesting observation is that when the measurement period was on the order of one minute or more we observed slightly higher ETE values (Fig. 5.9b) than when the measurement period was on the order of a few seconds. This can be explained by the server’s cache policy, which allows better performance for operations that are performed frequently.

5.6.4

Experiment III: Performance under synthetic workload

In this experiment we studied the prediction error in stressed NETCONF servers, compared to the error in unstressed NETCONF servers. We used two methods to stress the machines: (i) We used the lookbusy [155] utility to inject synthetic workload (Fig. 5.10b). We configured the utility to run at a CPU utilization of 95% and at a memory utilization of 95%. (ii) We used the

5. OneClock to Rule Them All: Using Time in Networked Applications

144

Azure platform to run multiple VMs on the same physical machine (Fig. 5.10a). We ran 20 VMs on the same machine, where one of the VMs was the NETCONF server. During the stress experiments we observed that most of the ETE measurements were unaffected by the stress, but as depicted in Fig. 5.10a and 5.10b, there were occasional spikes in the ETE, causing temporary high prediction error. As shown in Fig. 5.10a and 5.10b, prediction error of the FT-Average algorithm during the ETE spikes is slightly lower than the baseline error, and during most of the run the prediction error of the FT-Average is significantly lower than the baseline error. Fig. 5.10c compares the three prediction approaches during an ETE spike. As illustrated in Fig. 5.10c, the FT-Average algorithm was the most resilient to these spikes, as it ignores the maximal and minimal measurement samples, and thus ignores the peak ETE value. As depicted in the figure, the two other algorithms were more sensitive to these spikes.

5.7

Discussion

Prediction method. As discussed in the previous sections, we analyzed three prediction algorithms. Kalman filtering was used as it is one of the most celebrated and popular data fusion algorithms. The Average approach was chosen due to its simplicity, and FT-Average due to its resilience to occasional isolated noisy measurements. The experimental results show that the prediction error offered by the three algorithms is similar, and is significantly lower than the baseline error. The FT-Average approach showed slightly lower prediction error in most of the experiments. FT-Average is especially advantageous in the presence of occasional spikes in the ETE, as it inherently ignores the erroneous measurement. Interestingly, even a short burst of N = 4 measurements allows the simple FT-Average algorithm to predict the ETE with a very low prediction error. Measurement period. The measurement period is the elapsed time between two consecutive measurements. Frequent measurements may be more sensitive to changes in the ETE, and allow a more accurate prediction. On the other hand, if measurements are performed too frequently they may affect the server’s performance. Since the client continuously monitors the prediction

5. OneClock to Rule Them All: Using Time in Networked Applications

145

error, the client can dynamically change the measurement period for each server to improve the prediction error. Thus, an interesting extension to our work would be to implement an algorithm that dynamically changes the measurement period for each server. RESTCONF. An interesting next step would be to extend the scope of our work, and apply it to the emerging RESTCONF [156]. This work would be especially interesting in the context of resource constrained servers, such as IoT devices. Multiple RPC types. The ETE of an RPC depends on the RPC type. Thus, the prediction method should be used on a per-RPC-type basis. A possible extension of our work would be to consider how to predict the ETE of RPC type A using measurements of RPC type B. Time zone issues. Since the client and servers may be spread across multiple time zones. The NETCONF date-and-time format specifies the time zone for each timestamp, thereby avoiding ambiguity in the timestamp value. Moreover, to avoid problems that may arise during Daylight Saving Time (DST) changes, the client can invoke scheduled RPCs using the UTC time zone, which is not subject to DST changes. Impact of the network delay on the measurements. When a client sends a scheduled RPC message, the message must be sent in advance, allowing the message to arrive to the server before the scheduled time. Thus, as the network delay increases, the client must send the scheduled RPC sooner. In our experiments we considered the network delay when planning the time at which RPCs are sent. Note that the ETE is a metric of the servers’ performance, and is not affected by the network delay. Inter-RPC influence. The ETE of an RPC may be affected by other RPCs that are running in parallel, or are scheduled to run in parallel. The prediction approach presented in this paper estimates the RPC’s execution time, given that the server is subject to workload by other tasks or other RPCs running in parallel. In our evaluation we mimicked these scenarios by synthetically creating additional workload on the servers.

5. OneClock to Rule Them All: Using Time in Networked Applications

5.8

146

Conclusion

OneClock is a generic approach for using accurate time in distributed applications. As NETCONF is gaining momentum and penetrating various new network applications, OneClock seems like a natural extension that can add the time dimension to network configuration and management. We analyzed three prediction algorithms, and found the simple FT-Average to be the most accurate algorithm in most of the experiments. Our experimental evaluation confirms that predictionbased scheduling provides a high degree of accuracy in various diverse environments, decreasing the prediction error by an order of magnitude compared to the na¨ıve baseline approach.

5.9

Acknowledgments

We gratefully acknowledge Alon Schneider and Eylon Egozi for their help with the prototype implementation. We thank Amazon for supporting our AWS experiments by an AWS in Education Research Grant award. We gratefully acknowledge the Emulab project [24] and the DeterLab project [25] for the opportunity to perform our experiments on their testbeds. This work was supported in part by the ISF grant 1520/11.

Chapter 6 ReversePTP: A Clock Synchronization Scheme for Software Defined Networks This chapter is a preprinted version of the paper: [5] T. Mizrahi, E. Saat and Y. Moses, “ReversePTP: A clock synchronization scheme for software defined networks,” International Journal of Network Management (IJNM), accepted, 2016. Preliminary versions of this paper appeared in HotSDN 2014 [11] and in IEEE ISPCS 2014 [10].

6.1

Abstract

We introduce R EVERSE PTP, a clock synchronization scheme for Software Defined Networks (SDN). R EVERSE PTP is based on the Precision Time Protocol (PTP), but is conceptually reversed; in R EVERSE PTP all nodes (switches) in the network distribute timing information to a single software-based central node (the SDN controller), which tracks the state of all the clocks in the network. Hence, all computations and bookkeeping are performed by the central node, whereas the ‘dumb’ switches are only required to periodically send it their current time. In accordance with the SDN paradigm, the ‘brain’ is implemented in software, making R EVERSE PTP flexible and programmable from an SDN programmer’s perspective. We present the R EVERSE PTP architecture and discuss how it can be used in typical SDN architectures. Our

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

148

experimental evaluation of a network with 34 R EVERSE PTP-enabled nodes demonstrates the effectiveness and scalability of using R EVERSE PTP.

6.2 6.2.1

Introduction Background

Software Defined Networking (SDN) is an architecture in which a network is managed by a centralized controller. The controller provides an Application Programming Interface (API) that allows SDN programmers to manage the network using a high-level programming language. The SDN approach defines a clear distinction between the data plane and the control plane; on the data plane forwarding decisions are taken locally by each switch in the network, while the control plane is managed by a centralized entity called the controller, overcoming the need for complicated distributed control protocols and providing the network operator with powerful and efficient tools to manage the data plane. In recent years, as accurate time synchronization has become an accessible and affordable tool, it is used in various different applications; mobile backhaul networks use clock synchronization protocols to synchronize mobile base stations [15]. Google’s Spanner [16] uses synchronized clocks as a tool for synchronizing a distributed database. Industrial automation systems [35] use synchronized clocks to allow deterministic response times of machines to external events, and to enforce coordinated orchestration in factory product lines. The Time Sensitive Networking (TSN) technology [36] is used in automotive networks and in audio/video networks. The well-known nuclear research labs at CERN use state-of-the-art synchronization [37], allowing sub-nanosecond accuracy in response to management messages that control the experiments. The rise of SDN raises two interesting use cases for accurate clock synchronization: • Distributing time over SDN-enabled networks. Accurate time is required in various environments, in which SDN is being considered (e.g., [60]). Hence, an SDN can be used as a vessel for distributing accurate time between endpoints. Notably, accurate synchronization be-

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

149

tween endpoints requires the SDN-enabled network equipment to take part in the synchronization protocol. • Timed network operations. A recently introduced approach [13, 1, 7] suggests the usage of accurate time to coordinate network updates in an OpenFlow-enabled network in a simple and scalable manner, reducing packet loss and anomalies during configuration or topology changes. Accurate time can be a useful tool in various scenarios in the dynamic SDN setting, e.g., coordinated topology changes, resource allocation updates, and accurate timestamping for event logs and statistics collection. Our previous work in this context includes an extension [73] to the OpenFlow [41, 67] protocol that enables timed operations using OpenFlow. This new feature has been approved by the Open Networking Foundation (ONF) and integrated into OpenFlow 1.5 [67], and into the OpenFlow 1.3.x extension package [68].

A network that either distributes time between its endpoints or makes use of timed operations requires a clock synchronization protocol. The Precision Time Protocol (PTP), defined in the IEEE 1588-2008 standard [21], is a natural candidate, as it can provide a very high degree of accuracy, typically on the order of 1 microsecond (e.g. [44]) or less, and is widely supported in switch silicons.

The challenge of using a standard synchronization protocol such as PTP in an SDN environment lies in the fundamental difference between these two technologies. A key property of SDN is its centralized control plane, whereas PTP is a decentralized control protocol; a master clock is elected by the Best Master Clock Algorithm (BMCA), and each of the slaves runs a complex clock servo algorithm that continuously computes the accurate time based on the protocol messages received from the master clock. Thus, if SDN switches function as PTP slaves, then in contrast to the SDN philosophy they are required to run complex algorithmic functionality, and to exchange control messages with other switches. Indeed, a hybrid [67] approach can be taken, where the SDN operates alongside traditional control-plane protocols such as PTP. Our approach is to adapt PTP to the SDN philosophy by shifting the core of its functionality to the controller.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

6.2.2

150

R EVERSE PTP in a Nutshell

In this paper we introduce R EVERSE PTP, a novel approach that addresses the above challenge; in contrast to the conventional PTP paradigm, where a master clock distributes its time to multiple slave clocks (Fig. 6.1a), in R EVERSE PTP (Fig. 6.1b) there is a single node (the SDN controller) that runs multiple instances of a PTP slave and multiple masters (the switches). Every switch runs a separate instance of PTP with the controller, where the switch is the master and the controller is the slave. Hence, every switch distributes its time to the controller. The controller keeps track of the offset, skew, and drift of each of the switches’ clocks with respect to the controller’s clock. This relieves the switches of any complicated computations, and does not require message exchange between switches, as all protocol messages are exchanged with the controller.

(a) Conventional PTP.

(b) R EVERSE PTP-enabled SDN.

Figure 6.1: Time distribution in PTP and R EVERSE PTP. The significant advantage of R EVERSE PTP is that the complex algorithmic functionality that acquires the accurate time using the information received in the PTP messages is implemented in the controller. This algorithmic logic can be modified or reprogrammed at the controller without upgrading switches in the network, or can be dynamically tuned and adapted to the topology and behavior of the network based on a network-wide perspective. Moreover, an operator that uses switches from different vendors may experience inconsistent behavior when using conventional PTP, as different switches may use different clock servo algorithms, while R EVERSE PTP guarantees uniform behavior as all the algorithmic logic runs on the controller.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

151

Notably, the main difference between PTP and R EVERSE PTP is the direction of time distribution; in R EVERSE PTP time is distributed in an all-to-one fashion in contrast to PTP’s one-toall nature. Hence, the accuracy of conventional PTP and R EVERSE PTP should be similar, given that all other aspects of the network are the same. R EVERSE PTP is defined as an IEEE 1588 profile.1 Since switches function as conventional PTP masters, our approach is applicable to existing implementations of PTP-enabled switches. We note that the R EVERSE PTP approach can be applied to other synchronization protocols, e.g., the Network Time Protocol (NTP) [119]. In a precise sense, R EVERSE PTP is not a clock synchronization protocol. Instead, R E VERSE PTP

is a tool for coordinating synchronized actions in an SDN environment. All the tim-

ing information is maintained in the ‘brain’ of the network, the controller, whereas the ‘dumb’ SDN switches do not need to be synchronized.

6.2.3

Related Work

In preliminary versions of this paper [10, 11] we introduced R EVERSE PTP and its main principles. The current paper describes R EVERSE PTP in detail, and includes detailed experimental evaluation results. A topic that has been thoroughly studied in the literature is software-based implementation of accurate clock synchronization [157, 158], not to be confused with software-defined networking (SDN), which is the network architecture that the current paper focuses on. The PTIDES project [159, 160], defines a programming model for time-aware programs. The current paper presents R EVERSE PTP, and focuses on how time-aware SDN applications can use it. The programming model is beyond the scope of this paper. In conventional synchronization protocols, multiple time sources are sometimes used to improve the accuracy and security of the protocol [119], or for redundancy, allowing fast recovery when the primary master fails [161]. Contrary to conventional synchronization protocols, R E VERSE PTP 1A

is not used for clock synchronization, but for many-to-one time distribution, and for

profile [21] is a specific selection of features and modes of PTP.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

152

coordinating actions at different sites in a timely manner.

6.2.4

Contributions

The main contributions of this paper are as follows. • We introduce R EVERSE PTP. To the best of our knowledge, our work is the first to present a clock synchronization scheme for SDN. • We show that R EVERSE PTP can be defined as a PTP profile, i.e., a subset of the features of PTP. Consequently, R EVERSE PTP can be implemented by existing PTP-enabled switches. • We show that R EVERSE PTP is applicable to two main scenarios: (i) an SDN that uses time to coordinate configuration updates and notifications, and (ii) an SDN that distributes accurate time between its endpoints or attached networks. • We present experimental results that analyze and demonstrate the accuracy and scalability of R EVERSE PTP, and show that they are comparable to those of PTP.

6.3 6.3.1

Preliminaries A Brief Overview of PTP

PTP is a protocol that allows the distribution of time and frequency over packet-switched networks. Three types of nodes are defined in PTP (see Fig. 6.2a); Ordinary Clocks (OC), Transparent Clocks (TC) and Boundary Clocks (BC). TCs and BCs are either switches or routers, whereas OCs are typically endpoints. An OC can either be a master or a slave. A master distributes information about its local time to slave clocks in the network using PTP messages. The Best Master Clock Algorithm (BMCA) is used to elect the most accurate clock in the network as the master. A set of PTP clocks that synchronize to a common master forms a PTP domain. A network may consist of several domains, where each domain is dominated by a different master. A master periodically sends Sync messages to its slaves, incorporating information about its current time. Delay Request and Delay Response messages (see Fig. 6.2b) are used to measure

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

153

the network delay between the master and slave. At the end of this message exchange the slave has four time values, T1 , T2 , T3 and T4 , as illustrated in Fig. 6.2b, allowing it to compute the offset, o, between its clock and the master’s clock as follows: o = T2 − T1 − dMS

(6.1)

The parameter dMS is the delay between the master and the slave, and is given by:

dMS =

(T4 − T1 ) − (T3 − T2 ) 2

(6.2)

Sync and Delay messages are transmitted periodically, allowing slaves to synchronize to the master’s clock based on multiple measurements of the o and dMS values. Each slave typically uses a servo algorithm, which filters and combines multiple measurements of Eq. 6.1 and 6.2, and synchronizes the slave’s clock.

1 2 3 4

(a) PTP clocks in a network.

(b) PTP message exchange.

Figure 6.2: The Precision Time Protocol (PTP). The usage of TCs and BCs in PTP is referred to as on-path support. A BC is a node that functions as a slave on one of its ports, and as a master on its other ports. TCs are simple intermediate nodes; they are neither masters nor slaves. Their role is to relay PTP protocol messages between the master and slaves, and to compute a correction field for each message. The correction field represents the aggregated internal delay of all the TCs on the path between

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

154

the master and slave; when a TC relays a protocol message it measures the delay of the message from reception to transmission, and adds the value of this delay to the correction field. The slave can then use the correction field to eliminate the non-deterministic delays that result from the TCs’ transmission queues. Since TCs only compute the difference between the time of transmission and the time of reception, they are not required to synchronize their clocks to the master’s time. A syntonized TC is a TC that is frequency-synchronized to the master’s clock, allowing a more accurate correction field than when using a non-syntonized TC. PTP supports two timestamping modes: one-step mode, where each Sync message includes its time of transmission, T1 , and two-step mode, where a Sync message is sent without a timestamp, followed by a Follow-Up message that incorporates the timestamp T1 .

6.3.2

A Model for using Time in SDN

As is standard in the literature (e.g., [162, 163]), we distinguish between real time, which is not observable by nodes in our system, and clock time, as measured by the nodes. Real time values are denoted in lower case, whereas clock time variables and constants in upper case. We assume that each node in our system maintains a clock. As in [164, 163], the value of a clock T (t) at time t is assumed to be a quadratic function of t, as follows: T (t) = t + o(t0 ) + ρ(t0 ) · [t − t0 ] + 0.5 · d · [t − t0 ]2

(6.3)

Here t0 is some previous reference time value, o(t0 ) = T (t0 ) − t0 is the offset at time t0 , the clock skew at t0 is denote by ρ(t0 ), also known as the frequency error, and d is the drift, which is the first derivative of the skew. As in [164, 163], we assume that the drift, d, is constant. Our system consists of n + 1 nodes: a controller c, and a set S of n switches.2 We define two possible timed operations: • Timed notification: We define a set B of notifications that a switch can send to the controller. A notification sent to the controller indicates an event that occurred at the switch, and is accompanied by a timestamp, indicating when the event occurred. A switch can send a notifi2 Throughout

the paper we assume that the controller is also the R EVERSE PTP slave.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

155

cation message of the form Msc (i, β, T i ) to the controller, denoting that the message includes a notification β ∈ B, and a timestamp T i in terms of i’s clock. • Timed command: We define a set A of possible commands that the controller can send to switches. The controller can send a timed command message to switch i, denoted by Mcs (i, α, T i ), implying that i should perform α ∈ A at time T i in terms of i’s clock. T IME C ONF(Te , AM ) 1 for i ∈ S do 2 send Mcs (i, αi , Te ) Figure 6.3: A protocol for coordinated network updates.

A controller can use timed commands to invoke a coordinated action at multiple switches. T IME C ONF [13], formally defined in Fig. 6.3, is a simple protocol for time-based updates, where the controller defines a single execution time, Te , for all switches,3 and AM := {α1 , α2 , . . . , αn } is a set of n commands, such that the controller assigns the action αi to switch i for each i ∈ S.

6.4

R EVERSE PTP: Theory of Operation

As described in Section 6.2.2, in R EVERSE PTP a central node (the SDN controller) is the ‘brain’ of the network, and performs two main tasks: (i) running the PTP protocol as a slave separately with each of the SDN switches (which function as PTP masters), and (ii) performing the necessary translation between the masters’ times and the slave’s time. Each of these tasks is discussed in detail below. Running the PTP protocol. The R EVERSE PTP slave is bound (concurrently and independently) to n R EVERSE PTP masters. The slave exchanges PTP messages with each of the masters, and based on these messages maintains four parameters for each master i: 3 The

controller should take care to schedule an execution time that allows enough time for the update message

to be sent and propagated to all switches in the network.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks Tislast

156

The time at which the latest Sync message from master i was received. The superscript ‘s’ indicates that this timestamp is

oˆ i ρˆ i dˆ i

measured in terms of the slave’s clock. The estimated offset between the clocks of master i and the slave at time Tislast . The estimated skew between master i and the slave at Tislast . The estimated drift between master i and the slave. Table 6.1: R EVERSE PTP slave parameters.

The offset oˆ i , the skew ρˆ i , and the drift dˆ i , are computed by the slave based on the latest measurement of Tislast , as well as on previous measurements. Various well-known algorithms can be used for computing these two parameters, e.g., [164, 158]. Time translations. The parameters of Table 6.1 allow the slave to translate any timestamp T s in terms of the slave’s clock to a timestamp T i (T s ) in terms of master i’s clock,4 or vice versa. Notably, every timed operation that can be performed in a clock-synchronized network is also possible in a R EVERSE PTP-enabled network: • Given a timed notification received from master i with a timestamp T i , the slave translates T i to T s (T i ) in terms of its local clock. • When the slave invokes a timed command that needs to take place at T s in terms of the slave’s clock, the slave translates T s to T i (T s ), and uses T i (T s ) in the timed command message it sends to master i. The following theorem presents the translation of T i (T s ) and T s (T i ), given the estimated parameters of Table 6.1. Theorem 6.1. Let tislast be the time at which the value of the slave’s clock is Tislast , and let oi (tislast ), 4 We

use the notation T i (T s ), referring to the value of master i’s clock at the time instant when the value of the

slave s’s clock is T s .

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

157

ρi (tislast ), and di be the offset, skew and drift at time tislast . If oi (tislast ) = oˆ i , ρi (tislast ) = ρˆ i , and di = dˆ i , then: (i) T i (T s ) = T s + oˆ i + ρˆ i · (T s − Tislast ) + 0.5 · dˆ i · (T s − Tislast )2 , and

(ii) T s (T i ) = Tislast +

q −(1+ρˆ i )+ (ρˆ i +1)2 +2dˆ i (T i −Tis

last

−ˆoi )

dˆ i

Proof. We first prove (i). Since oi (tislast ) = oˆ i , ρi (tislast ) = ρˆ i , and di = dˆ i , by Eq. 6.3 we obtain (i). We now prove (ii). First, we observe that when the drift dˆ i is negligible, by (i) we have: limdˆ i →0 T i = T s + oˆ i + ρˆ i · (T s − Tislast ). It follows that: T i − oˆ i + ρˆ i · Tislast lim T = 1 + ρˆ i dˆ i →0 s

(6.4)

Now, based on (i) we have: T i = T s (1 + ρˆ i ) + oˆ i − ρˆ i · Tislast + 0.5dˆ i (T s 2 − 2T s Tislast + Tislast 2 ) Reorganizing the terms we obtain: 0.5dˆ i · T s 2 + (1 + ρˆ i − dˆ i · Tislast ) · T s + oˆ i − ρˆ i · Tislast + 0.5 · dˆ i · Tislast 2 − T i = 0 We isolate T s by solving the quadratic equation above, and obtain:

1 1 T = (dˆ i · Tislast − ρˆ i − 1) ± ˆdi ˆdi s

q (dˆ i Tislast − ρˆ i − 1)2 − 2dˆ i (ˆoi − ρˆ i Tislast + 0.5dˆ i Tislast 2 − T i ),

and thus: Ts =

1 ˆ s 1 (di Ti last − ρˆ i − 1) ± dˆ i dˆ i

q (ρˆ i + 1)2 + 2dˆ i (T i − Tislast − oˆ i )

(6.5)

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

158

The ‘±’ implies that there are two possible solutions. However, only one of these solutions is valid. We show this by observing the limit of the latter equation when dˆ i → 0.

lim T s = lim

dˆ i →0

(dˆ i Tislast − ρˆ i − 1) ±

q (ρˆ i + 1)2 + 2dˆ i (T i − Tislast − oˆ i ) dˆ i

dˆ i →0

Now, by applying L’Hˆopital’s rule to the latter, we obtain:

lim T s = lim

dˆ i →0

((dˆ i Tislast − ρˆ i − 1) ±

q (ρˆ i + 1)2 + 2dˆ i (T i − Tislast − oˆ i ) )0 (dˆ i )0

dˆ i →0

0.5 · 2 · (T i − Tislast − oˆ i ) = lim Tislast ± q , dˆ i →0 (ρˆ i + 1)2 + 2dˆ i (T i − Tislast − oˆ i ) and thus:

lim T s =

dˆ i →0

Tislast (1 + ρˆ i ) ± (T i − Tislast − oˆ i ) 1 + ρˆ i

(6.6)

By comparing Eq. 6.6 and Eq. 6.4, we conclude that from the two ‘±’ solutions, only the ‘+’ solution is valid, and thus by Eq. 6.5 we have:

T s (T i ) = Tislast +

−(1 + ρˆ i ) +

q (ρˆ i + 1)2 + 2dˆ i (T i − Tislast − oˆ i ) dˆ i

(6.7)

Intuitively, Theorem 6.1 states that if the slave’s estimated parameters are accurate, then T i (T s ) is given by (i), and T s (T i ) is given by (ii). Notably, Corollaries 6.2 and 6.3, which follow directly from the theorem, provide a practical method for the slave to compute master i’s clock value at a given time T s , or to compute the slave’s clock value for a given value T i of master i’s clock.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

159

Corollary 6.2. For any time T s measured in terms of the R EVERSE PTP slave’s clock, the slave can estimate the corresponding value of master i’s clock by: Tˆ i (T s ) = T s + oˆ i + ρˆ i · (T s − Tislast ) + 0.5 · dˆ i · (T s − Tislast )2

(6.8)

Corollary 6.3. For any time T i in terms of master i’s clock, the slave can estimate the corresponding value of its clock by:

Tˆ s (T i ) = Tislast +

q −(1 + ρˆ i ) + (ρˆ i + 1)2 + 2dˆ i (T i − Tislast − oˆ i ) dˆ i

(6.9)

A first-order approximation. Interestingly, it is possible to use the following first-order approximation of Eq. 6.8, neglecting ρˆ i · (T s − Tislast ) + 0.5 · dˆ i · (T s − Tislast )2 : Tˆ i (T s ) = T s + oˆ i

(6.10)

This approximation is especially useful in cases where T s − Tislast is negligible, i.e., when the timed operation occurs close to the time at which the last Sync message was received.

6.5

The R EVERSE PTP Profile

In this section we show that R EVERSE PTP can be defined as a PTP profile, i.e., as a subset of the features defined in the IEEE 1588 [21] standard. Significantly, since R EVERSE PTP is defined as a subset of PTP, it follows that R EVERSE PTP can be implemented by existing PTP-enabled switches. The R EVERSE PTP profile has two interesting properties with respect to SDN: (i) to the extent possible it requires very simple functionality from the switches, and (ii) all PTP messages are exchanged between a controller and a switch, and thus no messages are exchanged between switches. As per IEEE 1588 [21], the selected modes of operation of the R EVERSE PTP profile are specified in Table 6.2. We now briefly describe the set of attributes and modes that define this profile.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks Property Best Master Clock Algo-

Behavior in R EVERSE PTP none (not used)

rithm Path delay mechanism Transport protocol One-step vs. two-step On-path support

End-to-end (E2E) UDP over IPv4 or UDP over IPv6 Both one-step and two-step modes are supported On-path support is optional, using transparent

Multicast vs. unicast

clocks Sync: either unicast or multicast. Delay Req and

Domains

Delay Resp: always unicast. Each master-slave pair forms a different domain.

160

The domain number is the same for all clocks, as Management messages

in [161] Management messages are optional

Table 6.2: R EVERSE PTP Profile Properties. PTP domains. A set of PTP clocks that synchronize to a common master forms a PTP domain. In R EVERSE PTP, every switch forms a master-slave pair with the controller, in which the switch serves as master and the controller as slave; in particular, each master-slave pair forms a separate PTP domain (Fig. 6.4). When a slave receives a PTP message it identifies the packet’s domain based on its source address. Thus, as in [161], all domains can use the same domain number.5 On-path support. A PTP-enabled network may use Transparent Clocks (TC) or Boundary Clocks (BC), which are intermediate switches or routers that take part in the protocol, allowing high end-to-end accuracy. The usage of TCs or BCs is referred to as on-path support. On-path support in R EVERSE PTP is optional. TCs may be used, allowing to improve the accuracy of the protocol using the PTP correction field [21]. This implies that a R EVERSE PTP-enabled switch may function both as a master, distributing its time to the slave, and as a TC, which relays PTP messages from other Ordinary Clocks (Fig. 6.4b). For simplicity, BCs are not used 5A

less scalable solution to identify the packet’s domain is by using the domain number field in the PTP header.

Since this field is 8-bits long, this solution would limit the number of R EVERSE PTP masters to 256.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

REVERSEPTP slave

Domain A Domain B Domain C

REVERSEPTP slave

161

Domain A Domain B Domain C Domain D

Master + Transparent Clock (TC)

masters

masters

(a) R EVERSE PTP without

(b) R EVERSE PTP with

on-path support.

on-path support.

Figure 6.4: R EVERSE PTP: each master determines a separate domain.

in R EVERSE PTP,footnoteNote that an SDN can function as a ‘big BC’, as described in Section 6.6.3, but no BCs are used in R EVERSE PTP domains. since a BC must, by definition, keep a synchronized clock and run a clock servo algorithm. It is interesting to note that if Transparent Clocks are used, they can be either syntonized [21] or non-syntonized. A syntonized TC is a TC that is frequency-synchronized to the master’s clock, allowing a more accurate correction field than a non-syntonized TC. Our centralized paradigm requires TCs to be as simple as possible, and hence non-syntonized; a non-syntonized TC is simply required to compute the residence time of en-route PTP messages, and is thus not required to run a complex servo algorithm. Moreover, TCs are not required to exchange PTP messages; a TC only udpates the correction field of en-route messages, indicating the internal latency of each message as it is transmitted through the TC. In Section 6.8.3 we briefly discuss how R EVERSE PTP can be extended to allow syntonized TCs. Best Master Clock Algorithm (BMCA). PTP uses the BMCA to choose a master. In contrast, R EVERSE PTP does not use the BMCA; during the network initialization all switches are configured as masters and the controller is configured as a slave. This approach is aligned with the SDN paradigm, where a bootstrapping procedure (e.g., [42]) is typically used for configuring basic attributes, such as the controllers’ IP addresses. Unicast and multicast transmission. Sync messages in R EVERSE PTP can be sent either

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

162

as unicast or as multicast. During network initialization, switches need to be configured with the controller’s unicast address. If multiple SDN controllers are used, Sync messages are distributed to a multicast address that represents the controllers. The operation of R EVERSE PTP in SDNs with multiple controllers is further discussed in Section 6.8.4. Delay Request and Delay Response messages are always sent as unicast. One-step vs. two-step. Both one-step and two-step modes can be used in R EVERSE PTP. Peer delay mechanism. The delay measurement mechanism in PTP has two possible modes of operation: End-to-End (E2E) mode, where delay request and response messages are exchanged between the master and slave, and Peer-to-Peer (P2P) mode, where intermediate TCs perform delay measurement on a one-hop basis. While the two modes can provide the same level of accuracy, P2P mode requires PTP messages to be exchanged between TCs. Hence, R E VERSE PTP

uses the E2E mode, as this paradigm implies that all delay messages are exchanged

between a controller and a switch, and no messages need be exchanged between switches.

6.6 6.6.1

Using R EVERSE PTP in SDNs The R EVERSE PTP Architecture in SDN

A typical SDN architecture is illustrated in Fig. 6.5a. The network operating system is a logical entity that manages the control plane of the network, and communicates with switches using an SDN protocol such as OpenFlow [67]. The controller may run one or more SDN Application, a module that performs a network function such as routing or access control, using the SDN control plane. Every switch uses one or more flow tables, which determine the switch’s data plane decisions, such as forwarding and security decisions. A switch’s control plane agent is responsible for managing the flow tables based on commands received from the controller. We now describe an SDN architecture that uses R EVERSE PTP, as illustrated in Fig. 6.5b. The logical blocks in the figure are described as follows. Clocks. Every switch maintains a clock, which keeps the local wall-clock time and allows the switch to perform time-triggered actions and to timestamp notifications. R EVERSE PTP does not

Network Operating System

Time-based update app

Time dist. app

SDN Application

Controller

SDN Application

SDN Application

SDN Application

SDN Application

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

Network Operating System timestamp conversion

SDN protocol, e.g., OpenFlow Control plane agent

SDN protocol, e.g., OpenFlow Control plane agent switch scheduling

Flow Flow ... Flow Table Table Table

Switch

(a) Typical SDN architecture.

Flow Flow ... Flow Table Table Table

163

Controller

clock REVERSEPTP Slave PTP REVERSEPTP Master clock

Switch

(b) R EVERSE PTP-enabled SDN.

Figure 6.5: The R EVERSE PTP architecture in SDNs: every switch runs a R EVERSE PTP master, and the controller runs multiple R EVERSE PTP slave instances. In an SDN that runs conventional PTP, a typical approach would be for the controller to run a PTP master, and for each switch to be a PTP slave. require the switches’ clocks to be synchronized or initialized to a common time reference. The controller maintains a local clock. The controller’s clock is used as the reference for scheduling network-wide coordinated updates, and for measuring timestamped events. Thus, in some systems the controller’s clock may be required to be synchronized to an accurate external reference source such as a GPS receiver. R EVERSE PTP master. Each switch functions as a PTP master, and periodically sends Sync messages to the controller (its PTP slave), containing a local timestamp. We emphasize that the PTP master functionality is typically supported by existing implementations of PTP-enabled switches. R EVERSE PTP slave. The controller maintains n R EVERSE PTP slave modules, where n is the number of switches in the network. Each slave module periodically receives Sync messages from one of the switches (its masters), i, and based on these messages it maintains the offset, skew, and drift of i w.r.t. the controller’s clock (Table 6.1).

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

164

Timestamp conversion. This module performs the required translation between the controller’s clock time and the switches’ clock time, as described in Section 6.4; a timestamp T s in terms of the controller’s clock, is translated into T i in terms of i’s clock using Eq. 6.8. Similarly, a notification from switch i that contains a timestamp Ti , can be converted to the controller’s timescale by Eq. 6.9. It is a key observation that the timestamp conversion module allows SDN applications that run at the controller to implement any time-based protocol that would require switches to be synchronized using a conventional synchronization protocol. This interesting property applies generally to SDN applications; coordination is not required directly between switches, but only through the controller. We now describe two interesting examples of SDN applications that use R EVERSE PTP: the time-based update application, and the time distribution application.

6.6.2

Time-based Updates using R EVERSE PTP

Time-based update application. This simple SDN application performs time-based network updates using the T IME C ONF protocol (Fig. 6.3). When the application sends a time-based update, denoted by Mcs (i, αi , Te ), on line 2 of Fig. 6.3, the time conversion module translates Te to a time T i in the domain of i’s clock. Conceptually, the joint operation of the time-based update application and the timestamp conversion block performs the following protocol: R EVERSE T IME C ONF(Te , AM ) 1 for i ∈ S do 2 T i ← Te + oˆ i + ρˆ i · (Te − Tislast ) + 0.5 · dˆ i · (Te − Tislast )2 3 send Mcs (i, αi , T i ) to i Figure 6.6: Coordinated updates using R EVERSE PTP.

As in Eq. 6.10, a first-order approximation of R EVERSE T IME C ONF where line 2 is replaced

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

165

by T i ← Te + oˆ i can be used when the higher order terms, ρˆ i · (Te − Tislast ) and 0.5 · dˆ i · (Te − Tislast )2 , are negligible. Switch scheduling. When switch i receives a scheduled message, Mcs (i, α, T i ), from the controller, this module schedules the command α to time T i in terms of the switch’s local clock.

6.6.3

Time Distribution over SDNs using R EVERSE PTP

In some cases time must be distributed between end stations or networks that are attached to an SDN. For instance, an SDN-based mobile backhaul network must allow time distribution between base station sites, enabling the base stations to be synchronized. In this section we present an SDN application, denoted by time dist. app in Fig. 6.5b, that allows time distribution over an SDN. In conventional PTP-enabled networks time is distributed over one or more PTP Boundary Clocks (BCs) [21], as shown in Fig. 6.7a. A BC is a switch or a router that maintains an accurate clock based on Sync messages that it receives from the PTP master, and distributes its time to the PTP slaves. When a BC receives a Sync message6 from the master (step 1 in Fig. 6.7a), its ingress time is accurately measured. Based on the Sync message and its ingress timestamp, the BC adjusts its clock. When the BC generates a Sync message to one of the slaves, the message is accurately timestamped when it is transmitted through the egress port (step 2 in Fig. 6.7a). Our approach is illustrated in Fig. 6.7b; R EVERSE PTP is used within the SDN, allowing the controller to maintain the time offset to each of the switches. An SDN is often viewed as a single ‘big switch’. Similarly, in our approach the SDN is a distributed BC that functions as a single logical ‘big BC’. When the master sends a Sync message, switch 1 accurately measures its ingress time, T 1 (step 1 in Fig. 6.7b), and sends the packet and T 1 to the controller for further processing. The controller converts T 1 to T s using the timestamp conversion module, and the time dist. app (Fig. 6.5b) adjusts the controller’s clock based on the incoming Sync message and T s . When the time dist. app sends a Sync message through switch 2, it uses the 6 The

similar.

Sync message procedure is presented as an example. The procedure for other types of PTP messages is

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

166

Software Defined Network REVERSEPTP

controller

PTP

1

2

Boundary Clock (BC)

master

switch 1

switch

switch 2

switch

slaves

Boundary Clock (BC)

(a) A conventional BC.

(b) An SDN as a ‘big BC’.

Figure 6.7: SDN as a Boundary Clock. packet’s correction field [21] to reflect the offset o2 between switch 2 and the controller, and the packet is timestamped by switch 2 as it is transmitted (step 2 in Fig. 6.7b). This procedure can be implemented in OpenFlow [67], using Packet-In and Packet-Out messages between the controller and the switches. Note that there is currently no standard means for the ingress port of switch 1 to convey T 1 to the controller. A similar problem has been raised in the IEEE 1588 working group (e.g. [165]), and proposals that address it are currently under discussion there. The significant advantage of the ‘big BC’ approach compared to the conventional PTP approach is that it enables the programmability of R EVERSE PTP, while presenting standard PTP behavior to external non-SDN nodes.

6.7

Evaluation

We have implemented a prototype of R EVERSE PTP, based on the open-source PTPd [75], and evaluated its performance in a testbed with 34 nodes (Fig. 6.8). The goals of our experiments were: (i) To evaluate the effectiveness of scheduling a simultaneous event in the network using R EVERSE PTP, (ii) To verify that R EVERSE PTP and conventional PTP provide a similar degree of accuracy under the same network conditions, and specifically, to verify that R EVERSE PTP can schedule events with a high degree of accuracy

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

167

using the approximation of Eq. 6.10, and (iii) To analyze the scalability of R EVERSE PTP. The experiments were conducted on the DeterLab testbed [25], where every testbed machine (computer) served as a PTP clock running the software-based PTPd. Note that our experiments used software-based PTP clocks in a network with up to two hops without on-path support, and we observed that the achievable accuracy in this software-based environment was on the order of tens to hundreds of microseconds.

7

Accuracy measurement method. The accuracy of a set of clocks is typically measured in a controlled environment by connecting identical fiber cables from each of the clocks to a measurement device; each of the clocks transmits a 1 Pulse-Per-Second (1 PPS) signal to the measurement device, allowing the device to measure the offset between the clocks. Such a setting was not possible in the DeterLab testbed. Therefore, we measured the accuracy by using two types of time-triggered events, as described in the next subsection.

6.7.1

Time-triggered Events

This experiment included two scenarios: performing a time-triggered coordinated event, and recording a timestamped event. Both scenarios were implemented using R EVERSE PTP, and we compare the results to an implementation of the same scenarios using PTP. R EVERSE PTP masters were implemented by running PTPd in master-only mode, with the BMCA disabled. Our n-slave R EVERSE PTP node ran n instances of PTPd in slave-only mode, each at a different PTP domain, using n PTP domain numbers. PTPd, when in slave mode, periodically provides the current Offset From Master. This value is typically used by the PTPd servo algorithm [158]. Our experiments used the Offset From Master output to perform the first order approximation of Eq. 6.10. Nodes 3 to 34 played the role of R EVERSE PTP masters, node 1 was the R EVERSE PTP slave, and node 2 served as a measurement probe. The experiment consisted of two parts: (i) Coordinated event. We scheduled nodes 3 to 34 to send a Ping message to node 2 at the same time. The scheduling was based on R EVERSE T IME C ONF using the offsets computed by 7 Typical

hardware-based PTP implementations allow an accuracy on the order of 1 microsecond.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

REVERSEPTP slave

measurement probe

168

REVERSEPTP slave

measurement probe

Ping

Ping

DeterLab testbed

DeterLab testbed

REVERSEPTP masters

REVERSEPTP masters

(a) Coordinated event: nodes 3 to 34

(b) Timestamped event: node 2 sends

simultaneously send Ping message to

broadcast Ping message to nodes 3 to 34.

node 2.

Figure 6.8: Network setup.

node 1, the R EVERSE PTP slave, with the approximation of Eq. 6.10. To simplify the experiment we did not use a control protocol such as OpenFlow. Instead, we used a simple doorbell-based method to distribute the scheduling to nodes 3 to 34; after the scheduling times were computed, they were written to the Network File System (NFS), and each of the nodes 3 to 34 read its scheduling time from the NFS. In node 2 we monitored the distribution of the Ping message arrival times; the arrival time of each packet was captured by Wireshark [166] using the machine’s Linux clock. Interestingly, this experiment is the message-based variant of a 1 Pulse Per Second (PPS) signal sent from each of the 32 clocks to a single testing device. We repeated the experiment with PTP instead of R EVERSE PTP, using T IME C ONF (Fig. 6.3). The distribution of the arrival times of the 32 Ping messages at node 2 is illustrated in Fig. 6.9a. The value ‘0’ on the time axis represents the median of the arrival times. As shown in Fig. 6.9a, the arrival times were spread over a period of about 8 milliseconds. In Section 6.4 we observed that the approximation of Eq. 6.10 is valid when T s − Tislast is negligible. In our experiments, the Ping messages were scheduled to be sent 15 seconds in the future, and hence T s − Tislast was on the order of 15 seconds. The results show that 15 seconds is

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks PDF

-8

-3

ReversePTP PTP

2

PDF

-0.15

7

-0.05

169

ReversePTP PTP

0.05

0.15

0.25

Ping Arrival Time [milliseconds]

Ping Arrival Time [milliseconds]

(a) Coordinated event: PDF of the Ping

(b) Timestamped event: PDF of the Ping

arrival time when nodes 3 to 34 send a Ping to node 2 simultaneously.

arrival time when node 2 sends a broadcast Ping to nodes 3 to 34.

Figure 6.9: Accuracy measurements of a coordinated Ping. The timestamped event experiment (b) provides a rough estimate of the clock accuracy.

a sufficiently low value to allow Eq. 6.10 to produce an accurate approximation. (ii) Timestamped event. We sent a broadcast Ping message from node 2 to nodes 3 to 34, and measured its arrival times to each of these nodes using Wireshark. We then used Eq. 6.10 to align the reception times to a common R EVERSE PTP-based time reference. We repeated the experiment using conventional PTP. The empirical PDF of the arrival times measured at nodes 3 to 34 is depicted in Fig. 6.9b. Notably, the distribution of the arrival time in this experiment provides a rough estimate of the clock accuracy; since the Broadcast Ping message is replicated and distributed by the testbed’s hardware switches, we expect very low delay variation among the replicas. Thus, Fig. 6.9b provides an estimate of the clock accuracy. We observed that in the coordinated event experiment the time elapsed from when the Ping message was scheduled to be transmitted until it was transmitted in practice varied at different nodes on the order of a few milliseconds. Hence, in the coordinated event experiment the accuracy of our measurement was affected by the internal delay of the sending hosts’ operating systems, thus explaining the fact that the arrival time in Fig. 6.9a ranges over a period of about 8 milliseconds, a significantly wider range than the one shown in Fig. 6.9b. The experiment demonstrates how R EVERSE PTP can be effectively used to coordinate events,

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

170

or to accurately measure the occurrence time of events. It shows that R EVERSE PTP provides the same level of accuracy as the conventional PTP.

6.7.2

Scalability

In the second experiment we studied the scalability aspects of R EVERSE PTP, with an emphasis on the load placed on a R EVERSE PTP slave compared to a conventional PTP master. Rate of protocol messages. Fig. 6.10 compares the protocol message rate of R EVERSE PTP with that of conventional PTP. The Sync rate in our experiments was 1 Sync message per second, with a Delay Req message rate of 1 per second as well. We used two-step mode, and thus Follow Up messages were used as well. We ran the experiment for a duration of 100 seconds. 200

200

ReversePTP Slave PTP Master

160

ReversePTP Master PTP Slave

180

Messages per Second

Messages per Second

180

140 120 100 80 60 40 20 0

160 140 120 100 80 60 40 20 0

0

10

20

Number of Nodes

30

0

10

20

30

Number of Nodes

(a) R EVERSE PTP slave vs. PTP master. (b) R EVERSE PTP master vs. PTP slave.

Figure 6.10: R EVERSE PTP vs. PTP: rate of PTP messages sent or received by each node. As depicted in Fig. 6.10b, the protocol message rates of a R EVERSE PTP master and a PTP slave are similar, around 4 messages per second. Fig. 6.10a illustrates the message rate at the R EVERSE PTP slave and at the PTP master; the number of messages processed by the R E VERSE PTP

slave is roughly twice the number of the PTP master. The reason is that we ran PTP

in hybrid mode, where Sync and Follow Up were multicast messages, whereas in R EVERSE PTP all messages were unicast. The fact that some of the messages were sent as multicast allowed the conventional PTP to use roughly half the number of messages used by R EVERSE PTP. CPU Utilization. A more important metric of scalability is CPU utilization. We analyzed the difference between R EVERSE PTP and PTP in terms of the percentage of the CPU that is

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks 100

1 0.1 0.01 0.001

100

Type I Type II

10

CPU Utilization [%]

Type I Type II

10

CPU Utilization [%]

CPU Utilization [%]

100

1 0.1 0.01 0.001

0.0001

0.0001 0

10

20

30

Number of Nodes

(a) R EVERSE PTP scalability: CPU utilization of a R EVERSE PTP slave.

171

ReversePTP PTP

10 1 0.1 0.01 0.001

0.0001

0

10

20

30

Number of Nodes

0

10

20

30

Number of Nodes

(b) PTP scalability: CPU

(c) T IME C ONF scalability: CPU

utilization of a PTP master.

utilization of a controller running T IME C ONF on a Type II machine.

Figure 6.11: CPU Utilization in R EVERSE PTP and in PTP as a function of the number of nodes. The figures are presented for two machines types: Type I is a low performance machine, and Type II is high performance. utilized by the PTP process. We repeated the experiment for two types of machines: Type I, a Dual Core AMD Opteron, running at 1.8 GHz, and Type II, an Intel Xeon E3 LP, running at 2.4 GHz. Note that the two types we chose are at the extremes of the performance scale; at the time of the experiment Type II was the most powerful machine in the DeterLab testbed, while Type I was the least powerful one. The CPU utilization measured on R EVERSE PTP masters and PTP slaves (i.e., switches in an SDN environment), was negligible, well below 0.1% on both machine types. A significantly higher utilization was measured on PTP masters and R EVERSE PTP slaves (controllers in an SDN environment), as illustrated in Fig. 6.11a and 6.11b. These two figures illustrate the CPU utilization of the PTPd process as a function of the number of nodes, either R EVERSE PTP masters or PTP slaves. As expected, the CPU utilization of the R EVERSE PTP slave (Fig. 6.11a) is significantly higher than that of the PTP master (Fig. 6.11b), since the R EVERSE PTP slave runs n servo algorithm instances. Nevertheless, R EVERSE PTP consumes only 5% of the resources of machine Type I, and significantly less than 1% of the resources of the more powerful Type II. Fig. 6.11c provides some insight into the effect of R EVERSE PTP on an SDN application such as the T IME C ONF protocol. The figure illustrates the CPU utilization of an SDN controller

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

172

running R EVERSE T IME C ONF vs. an SDN controller running T IME C ONF with the conventional PTP. In the experiment we used the time-enabled OpenFlow prototype of [1]; the controller sent timed update messages to the switches at a rate of 1 message per second. The figure illustrates the CPU utilization as a function of the number of SDN switches. Only the utilization of the SDN controller task was considered in Fig. 6.11c, without considering the PTP task. As shown in the figure, R EVERSE T IME C ONF yields higher CPU load, as each timed message sent by the controller requires the time conversion of line 2 of Fig. 6.6. These results indicate that although R EVERSE PTP incurs higher load on the controller’s CPU than conventional PTP, the R EVERSE PTP scheme can easily scale to hundreds of nodes, and can scale to thousands of nodes when running a R EVERSE PTP slave on a powerful machine.

6.8 6.8.1

Discussion Accuracy

The clock accuracy in a network depends on a number of factors, including the accuracy of the timestamping mechanisms, the quality of the clock oscillators, and whether on-path support is used. As discussed in the Introduction, given the same network characteristics we expect R EVERSE PTP and PTP to provide the same degree of accuracy. The experimental evaluation of Section 6.7 confirmed that R EVERSE PTP, even with the approximation of Eq. 6.8, can provide an accuracy that is comparable to that of conventional PTP. As stated in Section 6.5, the R EVERSE PTP paradigm implies that TCs are non-syntonized. Although the IEEE 1588 standard does not specify that TCs must be syntonized, syntonized TCs have been shown [167] to provide a higher degree of accuracy, especially over a large number of hops. A possible extension to R EVERSE PTP that mitigates this limitation is discussed in Section 6.8.3.

6.8.2

Scalability-Programmability Tradeoff

The experimental evaluation in Section 6.7.2 showed that R EVERSE PTP loads the SDN controller’s resources more than traditional PTP. This emphasizes a tradeoff that is at the heart of

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

173

SDN; the controller takes on tasks that are traditionally performed by switches in non-SDN networks, allowing high flexibility and programmability, at the cost of controller resources. The experiments show that although R EVERSE PTP requires more resources than conventional PTP, the R EVERSE PTP scheme can easily scale to networks with thousands of SDN switches.

6.8.3

Synchronizing Clocks using R EVERSE PTP

The concept we presented does not require switches to be synchronized to a common wall-clock time. However, R EVERSE PTP can be extended to allow switches to be time-synchronized. PTP allows masters to query slaves about the master-slave offset using PTP management messages. Using these messages in R EVERSE PTP, switches can synchronize their clocks with the controller’s clock. Note that the offset only allows switches to get a first-order approximation, as per Eq. 6.10. It is possible to extend PTP to allow slaves to periodically send the three parameters Tislast , oˆ i , and ρˆ i , allowing R EVERSE PTP masters to maintain an accurately synchronized clock. Note that this extension allows switches to be synchronized at the cost of additional complexity and message exchanges. The main benefit of this approach compared to the conventional PTP is that the algorithmic logic is centralized and programmable. The latter extension can similarly be used to allow syntonized TCs; a non-syntonized TC may perform less accurate updates of the correction field due to its inaccurate frequency, whereas a TC that periodically receives the three parameters above can use these parameters to accurately compute the correction field, using well-known methods (e.g., [167]).

6.8.4

R EVERSE PTP in an SDN with Multiple Controllers

SDNs often use multiple controllers in an active-standby mode to provide survivability. In other cases, an SDN is sliced into multiple virtual networks (e.g. [42]), each of which is governed by a separate controller. Interestingly, the R EVERSE PTP architecture is well-suited for multicontroller configurations; each of the switches (masters) distributes its time to all the controllers, allowing each controller to monitor its own offset information regarding the switches. Thus, in sliced networks, R EVERSE PTP allows each slice to be managed according to a different time

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

174

reference, by allowing each controller to be synchronized to a different reference source. Notably, this slicing property is exclusive to R EVERSE PTP, and is not possible in conventional clock synchronization methods. Time distribution to multiple controllers can be performed efficiently by using a multicast group that consists of the controllers, thereby reducing the rate of PTP messages, since each switch sends its Sync messages to the multicast group. Moreover, when the controllers act in an active-standby mode, the switches only need to distribute their time to the currently active controller.

6.8.5

Security aspects

The potential security vulnerabilities of R EVERSE PTP are similar to those of conventional synchronization protocols [168, 101]. In PTP, a successful attack results in one or more slaves not being accurately synchronized to the correct time, whereas in R EVERSE PTP, a successful attack causes the controller to have an inaccurate view of the offset to one or more of the switches. An application that requires accurate time is similarly affected in both cases.

6.9

Conclusion

Clock synchronization protocols are not ‘one size fits all’, as different applications may have different requirements and constraints. We introduced R EVERSE PTP, a clock synchronization scheme suitable for SDN. While R EVERSE PTP is tailored for SDN, it can be valuable in many other centralized and software-controlled architectures, such as industrial automation systems [35], power-grid networks [43], and various network-managed applications [3, 4]. R E VERSE PTP

provides the same level of accuracy as conventional synchronization protocols, in-

cluding PTP, while its novel architecture shifts the complex functionality from the switches to the controller, facilitating the agility and programmability that are of key importance in SDNs.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

175

Acknowledgment The authors would like to thank Wojciech Owczarek for his dedicated help and support with PTPd. We gratefully acknowledge the DeterLab project [25] for the opportunity to perform our experiments on the DeterLab testbed. This work was supported in part by the ISF grant 1520/11.

6. ReversePTP: A Clock Synchronization Scheme for Software Defined Networks

176

Chapter 7 Conclusion 7.1

Summary of Results

In this dissertation we analyzed the use of accurate time for triggering network updates. Our work shows that time is a valuable tool for coordinating network devices in centrally managed networks such as SDNs. The main contributions of this thesis are in four main aspects, as illustrated in the four rows of Fig. 7.1.

Scenarios that benefit from using time

Flow swaps TIME4

Multi-phase updates Timed Consistent Updates

Scheduling protocols

OpenFlow Scheduled Bundles

NETCONF Time Capability

[OpenFlow 1.5]

[RFC 7758]

TIMEFLIP Time-based TCAM ranges

OneClock Prediction-based scheduling

Accurate scheduling methods Clock synchronization

REVERSEPTP

Figure 7.1: Summary of results.

7. Conclusion

178

Scenarios that benefit from using time Two key scenarios that greatly benefit from using time are analyzed: flow swaps and multi-phase updates. Flow swaps. We introduce T IME 4, which is an update approach that performs multiple changes at different switches at the same time. We consider a class of network update scenarios called flow swaps, and show that T IME 4 is the optimal approach for implementing them. In contrast, existing approaches for consistent updates (e.g., [22, 30]) are not applicable to flow swaps, and other update approaches such as SWAN [33] and B4 [59] can perform flow swaps, but at the expense of increased resource overhead. We present the lossless flow allocation (LFA) problem, and use a game-theoretic analysis to formally show that flow swaps are inevitable in the dynamic nature of SDN. We present the design, implementation and evaluation of a prototype that performs timed updates in OpenFlow. We present experimental results that demonstrate the advantage of timed updates over existing approaches; T IME 4 allows fewer packet drops and higher scalability than state-of-the-art approaches. Moreover, we show that existing update approaches (SWAN and B4) can be improved by using accurate time. Our experiments include an emulation of an SDN-controlled video swapping scenario, a real-life use case that has been shown in [63] to be infeasible with previous versions of OpenFlow, which did not include our time extension. Multi-phase updates. The approach presented in this thesis proposes to perform timetriggered multi-phase network updates. We show that timed multi-phase updates can guarantee consistency while requiring a shorter duration than existing consistent update methods. This approach is shown to reduce the expensive overhead of maintaining duplicate configurations. We also discuss hybrid approaches that combine the advantages of timed updates with those of other update methods. In this work we also define an inconsistency metric, allowing to quantify how consistent a network update is. We show that accurate time provides the SDN programmer with a knob for fine-tuning the tradeoff between consistency and scalability. We present experimental results that demonstrate the significant advantage of timed updates over other update methods. Our

7. Conclusion

179

evaluation is based on experiments performed on a 50-node testbed, as well as simulation results.

Scheduling protocols This dissertation defines extensions to standard network protocols, enabling practical implementations of the concepts we present. We define a new feature in OpenFlow called Scheduled Bundles (Chapter 2), and a similar capability in NETCONF (Chapter 5). These two features enable time-triggered operations in OpenFlow and in NETCONF, respectively. As a result of our work, the capability to perform time-triggered updates has been incorporated into the OpenFlow 1.5 protocol [67], and has been defined for the NETCONF protocol in an IETF RFC [4]. Open source prototypes are available for these extensions [69].

Accurate scheduling methods One of the main challenges in the use of accurate time is to implement accurate execution of events, i.e., guaranteeing that scheduled network updates are executed as close as possible to the time for which they were scheduled. For example, a scheduling mechanism that relies on the switch’s software may be affected by the switch’s operating system and by other running tasks. Two accurate scheduling methods are presented and analyzed in this dissertation, T IME F LIP and OneClock. T IME F LIP. We introduce T IME F LIPs; a T IME F LIP is a timestamp-based TCAM range in a hardware switch. We show that T IME F LIP is a practical method of implementing accurate time-based network updates and a natural implementation of Atomic Bundles, using time-based TCAM ranges. We show that in practical conditions, a small number of timestamp bits are required to accurately perform a T IME F LIP using a small number of TCAM entries. At the heart of our analysis lie two properties that are unique to time-based TCAM ranges. First, by carefully choosing the scheduled update time, the range values can be selected to minimize the required TCAM resources. We refer to this flexibility as the scheduling tolerance. Second, if there is a known bound on the installation time of the TCAM entries, then by using periodic time ranges, the

7. Conclusion

180

expansion of the time range can be significantly reduced. We present an optimal scheduling algorithm; no other scheduling algorithm can produce a timestamp range that requires fewer TCAM entries. We analyze the amount of TCAM resources required to encode a T IME F LIP, and show that if there is enough flexibility in determining the scheduled time, a T IME F LIP can be encoded by a single TCAM entry, using a single bit to represent the timestamp, while allowing a very high degree of accuracy. Using a microbenchmark on a real-life network device, we show that T IME F LIPs work on existing network devices, making accurate time-based updates a viable tool for network management. OneClock. We introduce a prediction-based scheduling approach that uses timing information collected at runtime to accurately schedule future operations. We analyze three prediction algorithms: (i) an average-based algorithm, (ii) a fault-tolerant average (FT-Average), and (iii) a Kalman-Filter-based algorithm. In our experimental evaluation we found the simple FT-Average to be the most accurate algorithm in most of the experiments. The evaluation confirms that prediction-based scheduling provides a high degree of accuracy in diverse and heterogeneous environments, decreasing the prediction error by an order of magnitude compared to the na¨ıve baseline approach.

Clock synchronization In Chapter 6 we introduced R EVERSE PTP, a clock synchronization scheme that is adapted to the centralized SDN environment; in R EVERSE PTP all nodes (switches) in the network distribute timing information to a single software-based central node (the SDN controller), that tracks the state of all the clocks in the network. Thus, all computations and bookkeeping are performed by the central node, whereas the ‘dumb’ switches are only required to periodically send it their current time. In accordance with the SDN paradigm, the ‘brain’ is implemented in software, making R EVERSE PTP flexible and programmable from an SDN programmer’s perspective. We show that R EVERSE PTP can be defined as a PTP profile, i.e., a subset of the features of PTP. Consequently, R EVERSE PTP can be implemented by existing PTP-enabled switches. We present experimental results that analyze and demonstrate the accuracy and scalability of

7. Conclusion

181

R EVERSE PTP compared to the conventional PTP.

7.2

Future Work

This dissertation opens the door for several potential research directions. We focused on the use of accurate time in centralized network architectures. An interesting future direction would be to study the use of accurate time in distributed control protocols. For instance, it would be interesting to explore whether the performance of traditional routing protocols can be improved by using accurate time. Another possible research direction is to analyze timed update protocols that use a store-andforward approach to guarantee consistency; T IME 4 does not guarantee consistent forwarding, as packets may be forwarded inconsistently during a short transition period around the scheduled update time. However, consistency can be guaranteed if the switches temporarily store the traffic slightly before the scheduled update, and then forward it slightly after the update is complete. This approach enables consistent updates without the need for the configuration version tags that are used in the approach of [22] and Chapter 3. Finally, the current work focused on SDN. An interesting next-step would be to consider the use of accurate time in virtualized environments; Network Function Virtualization (NFV) has significantly evolved over the past few years, but the use of time in these environments is yet to be considered.

7. Conclusion

182

Bibliography [1] T. Mizrahi and Y. Moses, “Software Defined Networks: It’s about time,” in IEEE INFOCOM, 2016. [2] T. Mizrahi and Y. Moses, “T IME 4: Time for SDN,” IEEE Transactions on Network and Service Management (TNSM), under major revision, 2016. [3] T. Mizrahi and Y. Moses, “OneClock to rule them all: Using time in networked applications,” in IEEE/IFIP Network Operations and Management Symposium (NOMS) miniconference, 2016. [4] T. Mizrahi and Y. Moses, “Time Capability in NETCONF,” RFC 7758, IETF, 2016. [5] T. Mizrahi and Y. Moses, “ReversePTP: A clock synchronization scheme for software defined networks,” International Journal of Network Management (IJNM), accepted, 2016. [6] T. Mizrahi and Y. Moses, “The case for data plane timestamping in sdn,” in IEEE INFOCOM Workshop on Software-Driven Flexible and Agile Networking (SWFAN), 2016. [7] T. Mizrahi, E. Saat, and Y. Moses, “Timed consistent network updates in software defined networks,” IEEE/ACM Transactions on Networking (ToN), 2016. [8] T. Mizrahi, E. Saat, and Y. Moses, “Timed consistent network updates,” in ACM SIGCOMM Symposium on SDN Research (SOSR), 2015. [9] T. Mizrahi, O. Rottenstreich, and Y. Moses, “TimeFlip: Scheduling network updates with timestamp-based TCAM ranges,” in IEEE INFOCOM, 2015.

Bibliography

184

[10] T. Mizrahi and Y. Moses, “Using ReversePTP to distribute time in software defined networks,” in International IEEE Symposium on Precision Clock Synchronization for Measurement Control and Communication (ISPCS), 2014. [11] T. Mizrahi and Y. Moses, “ReversePTP: A software defined networking approach to clock synchronization,” in ACM SIGCOMM Workshop on Hot topics in Software Defined Networks (HotSDN), 2014. [12] T. Mizrahi and Y. Moses, “On the necessity of time-based updates in SDN,” in Open Networking Summit (ONS), 2014. [13] T. Mizrahi and Y. Moses, “Time-based updates in software defined networks,” in ACM SIGCOMM Workshop on Hot topics in Software Defined Networks (HotSDN), 2013. [14] “Topology Zoo,” http://topology-zoo.org/, 2015. [15] ITU-T G.8271/Y.1366, “Time and phase synchronization aspects of packet networks,” ITU-T, 2012. [16] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al., “Spanner: Google’s globally distributed database,” ACM Transactions on Computer Systems (TOCS), vol. 31, no. 3, p. 8, 2013. [17] L. Lamport, “Using time instead of timeout for fault-tolerant distributed systems.,” ACM Trans. Program. Lang. Syst., vol. 6, pp. 254–280, Apr. 1984. [18] Open Networking Foundation, “OpenFlow switch specification,” Version 1.4.0, 2013. [19] J. Case, M. Fedor, M. Schoffstall, and C. Davin, “A simple network management protocol (SNMP),” RFC 1157, IETF, 1990. [20] R. Enns, M. Bjorklund, J. Schoenwaelder, and A. Bierman, “Network configuration protocol (NETCONF),” RFC 6241, IETF, 2011.

Bibliography

185

[21] IEEE TC 9, “1588 IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems Version 2,” IEEE, 2008. [22] M. Reitblatt, N. Foster, J. Rexford, C. Schlesinger, and D. Walker, “Abstractions for network update,” in ACM SIGCOMM, 2012. [23] D. G. Malcolm, J. H. Roseboom, C. E. Clark, and W. Fazar, “Application of a technique for research and development program evaluation,” Operations research, vol. 7, no. 5, pp. 646–669, 1959. [24] Emulab — Network Emulation Testbed http://www.emulab.net, 2015. [25] The DeterLab project http://deter-project.org/about_deterlab, 2015. [26] “PingER,” http://pinger.fnal.gov/, 2014. [27] “AMP Measurements,” http://erg.wand.net.nz, 2014. [28] P. Franc¸ois and O. Bonaventure, “Avoiding transient loops during the convergence of linkstate routing protocols,” IEEE/ACM Transactions on Networking (TON), vol. 15, no. 6, pp. 1280–1292, 2007. [29] S. Bryant, C. Filsfils, S. Previdi, and M. Shand, “IP Fast Reroute using tunnels,” draftbryant-ipfrr-tunnels, work in progress, IETF, 2004. [30] X. Jin, H. H. Liu, R. Gandhi, S. Kandula, R. Mahajan, J. Rexford, R. Wattenhofer, and M. Zhang, “Dionysus: Dynamic scheduling of network updates,” in ACM SIGCOMM, 2014. [31] L. Vanbever, S. Vissicchio, C. Pelsser, P. Francois, and O. Bonaventure, “Seamless network-wide IGP migrations,” in ACM SIGCOMM Computer Communication Review, vol. 41, pp. 314–325, 2011. [32] H. H. Liu, X. Wu, M. Zhang, L. Yuan, R. Wattenhofer, and D. Maltz, “zUpdate: updating data center networks with zero loss,” in ACM SIGCOMM, pp. 411–422, 2013.

Bibliography

186

[33] C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and R. Wattenhofer, “Achieving high utilization with software-driven WAN,” in ACM SIGCOMM, 2013. [34] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Communications of the ACM, vol. 21, no. 7, pp. 558–565, 1978. [35] K. Harris, “An application of IEEE 1588 to industrial automation,” in International IEEE Symposium on Precision Clock Synchronization for Measurement Control and Communication (ISPCS), 2008. [36] IEEE, “Time-Sensitive Networking Task Group,” http://www.ieee802.org/1/pages/ tsn.html, 2012. [37] P. Moreira, J. Serrano, T. Wlostowski, P. Loschmidt, and G. Gaderer, “White rabbit: Subnanosecond timing distribution over ethernet,” in International IEEE Symposium on Precision Clock Synchronization for Measurement Control and Communication (ISPCS), 2009. [38] G. R. Ash, “Use of a trunk status map for real-time DNHR,” in International TeleTraffic Congress (ITC-11), 1985. [39] K. Watsen, “Conditional Enablement of Configuration Nodes,” draft-kwatsen-conditionalenablement, work in progress, IETF, 2013. [40] A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Myers, J. Rexford, G. Xie, H. Yan, J. Zhan, and H. Zhang, “A clean slate 4D approach to network control and management,” ACM SIGCOMM Computer Communication Review, vol. 35, no. 5, pp. 41–54, 2005. [41] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, “Openflow: enabling innovation in campus networks,” ACM SIGCOMM Computer Communication Review, vol. 38, no. 2, pp. 69–74, 2008. [42] Open Networking Foundation, “Openflow management and configuration protocol (ofconfig 1.2),” 2014.

Bibliography

187

[43] B. Dickerson,

“Time in the power industry:

how and why we use it,”

Arbiter Systems, technical report, http://www.arbiter.com/ftp/datasheets/ TimeInThePowerIndustry.pdf, 2010. [44] H. Li, “IEEE 1588 time synchronization deployment for mobile backhaul in China Mobile,” keynote presentation, International IEEE Symposium on Precision Clock Synchronization for Measurement Control and Communication (ISPCS), 2014. [45] IEEE Std C37.238, “IEEE Standard Profile for Use of IEEE 1588 Precision Time Protocol in Power System Applications,” IEEE, 2011. [46] “ONF

SDN

Product

Directory,”

https://www.opennetworking.org/

sdn-resources/onf-products-listing, January, 2015. [47] “Broadcom BCM56840+ Switching Technology,”

product brief,

http://www.

broadcom.com/collateral/pb/56840_PLUS-PB00-R.pdf, 2011. [48] “Broadcom BCM56850 StrataXGS Trident II Switching Technology,” product brief, http://www.broadcom.com/collateral/pb/56850-PB03-R.pdf, 2013. [49] “Centec

CTC6048

Advanced

Product

Brief,”

http://www.valleytalk.org/

wp-content/uploads/2010/12/CTC6048-Product-Brief_v2.5.pdf, 2010. [50] “Intel

Ethernet

Switch

FM5000/FM6000

Datasheet,”

http://www.

intel.com/content/dam/www/public/us/en/documents/datasheets/ ethernet-switch-fm5000-fm6000-datasheet.pdf, 2014. [51] “LSI/Avago

Axxia

Communication

Processor

AXM5500

Family,”

https:

//www.lsi.com/downloads/Public/Communication%20Processors/Axxia% 20Communication%20Processor/LSI_PB_AXM5500_E.pdf, 2014. [52] “Mellanox SwitchX-2 Product Brief,” http://www.mellanox.com/related-docs/ prod_silicon/SwitchX-2_EN_SDN.pdf, 2013.

Bibliography

188

[53] “Tilera TILE-Gx8072 Processor Product Brief,” http://www.tilera.com/sites/ default/files/productbriefs/TILE-Gx8072_PB041-04_WEB.pdf, 2014. [54] “Marvell

ARMADA

XP

Functional

Spec,”

http://www.

marvell.com/embedded-processors/armada-xp/assets/ ARMADA-XP-Functional-SpecDatasheet.pdf, 2014. [55] “Marvell

Xelerated

HX4100,”

product

brief,

http://www.marvell.com/

network-processors/assets/Xelerated_HX4100-02_product%20brief_v8.pdf, 2014. [56] “Interface to the Routing System (I2RS) working group,” https://datatracker.ietf. org/wg/i2rs/charter/, IETF, 2016. [57] “Forwarding and Control Element Separation (ForCES) working group,” https:// datatracker.ietf.org/wg/forces/charter/, IETF, 2016. [58] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, “Hedera: Dynamic flow scheduling for data center networks.,” in NSDI, 2010. [59] S. Jain et al., “B4: Experience with a globally-deployed software defined WAN,” in ACM SIGCOMM, 2013. [60] Open Networking Foundation, “OpenFlow-enabled mobile and wireless networks,” ONF Solution Brief, 2013. [61] Metro Ethernet Forum, “Mobile backhaul - phase 2 implementation agreement,” MEF 22.1, 2012. [62] Metro Ethernet Forum, “Carrier ethernet class of service - phase 2 implementation agreement,” MEF 23.1, 2012. [63] T. G. Edwards and W. Belkin, “Using SDN to facilitate precisely timed actions on realtime data streams,” in ACM SIGCOMM Workshop on Hot topics in Software Defined Networks (HotSDN), 2014.

Bibliography

189

[64] C. Rotsos, N. Sarrar, S. Uhlig, R. Sherwood, and A. W. Moore, “OFLOPS: An open framework for openflow switch evaluation,” in Passive and Active Measurement, 2012. [65] S. Kandula, I. Menache, R. Schwartz, and S. R. Babbula, “Calendaring for wide area networks,” in ACM SIGCOMM, 2014. [66] N. Pippenger, “On rearrangeable and non-blocking switching networks,” Journal of Computer and System Sciences, vol. 17, no. 2, pp. 145–162, 1978. [67] Open Networking Foundation, “Openflow switch specification,” Version 1.5.0, 2015. [68] Open Networking Foundation, “Openflow extensions 1.3.x package 2,” 2015. [69] “T IME 4 source code,” https://github.com/TimedSDN, 2014. [70] J. M. Kleinberg, “Single-source unsplittable flow,” in 37th Annual Symposium on Foundations of Computer Science, FOCS ’96, Burlington, Vermont, USA, 14-16 October, 1996, pp. 68–77, 1996. [71] T. Mizrahi and Y. Moses, “Time Capability in NETCONF,” RFC 7758, IETF, 2016. [72] T. Mizrahi and Y. Moses, “T IME 4: Time for SDN,” technical report, arXiv preprint arXiv:1505.03421v2, 2016. [73] T. Mizrahi and Y. Moses, “Time-based updates in OpenFlow: A proposed extension to the OpenFlow protocol,” technical report, CCIT Report #835, July 2013, EE Pub No. 1792, Technion – Israel Institute of Technology, 2013. [74] “CPqD OFSoftswitch,” https://github.com/CPqD/ofsoftswitch13, 2014. [75] “Precision Time Protocol daemon,” version 2.3.0, http://ptpd.sourceforge.net/, 2013. [76] B. Lantz, B. Heller, and N. McKeown, “A network in a laptop: rapid prototyping for software-defined networks,” in HotNets, 2010.

Bibliography

190

[77] A. Tavakoli, M. Casado, T. Koponen, and S. Shenker, “Applying NOX to the datacenter.,” in HotNets, 2009. [78] A. Tootoonchian, S. Gorbunov, Y. Ganjali, M. Casado, and R. Sherwood, “On controller performance in software-defined networks,” in USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (Hot-ICE), 2012. [79] SMPTE Standard 2022-6, “Transport of High Bit Rate Media Signals Over IP Networks (HBRMT),” 2012. [80] “Iperf - The TCP/UDP Bandwidth Measurement Tool,” https://iperf.fr/, 2014. [81] ITU-T Y.1563, “Ethernet frame transfer and availability performance,” ITU-T, 2009. [82] D. Arnold and H. Gerstung, “Enterprise Profile for the Precision Time Protocol With Mixed Multicast and Unicast Messages,” draft-ietf-tictoc-ptp-enterprise-profile-05, work in progress, IETF, 2015. [83] N. P. Katta, J. Rexford, and D. Walker, “Incremental consistent updates,” in ACM SIGCOMM workshop on Hot topics in Software Defined Networks (HotSDN), 2013. [84] S. K. Fayazbakhsh, L. Chiang, V. Sekar, M. Yu, and J. C. Mogul, “Enforcing networkwide policies in the presence of dynamic middlebox actions using FlowTags,” in NSDI, 2014. [85] M. Canini, P. Kuznetsov, D. Levin, and S. Schmid, “A distributed and robust SDN control plane for transactional network updates,” in IEEE INFOCOM, 2015. [86] Z. Guo, M. Su, Y. Xu, Z. Duan, L. Wang, S. Hui, and H. J. Chao, “Improving the performance of load balancing in software-defined networks through load variance-based synchronization,” Computer Networks, vol. 68, pp. 95–109, 2014. [87] X. Wen, C. Diao, X. Zhao, Y. Chen, L. E. Li, B. Yang, and K. Bo, “Compiling minimum incremental update for modular SDN languages,” in ACM SIGCOMM workshop on Hot topics in Software Defined Networks (HotSDN), 2014.

Bibliography

191

[88] J. C. Corbett et al., “Spanner: Google’s globally-distributed database,” in Proceedings of OSDI, vol. 1, 2012. [89] L. Lamport and P. M. Melliar-Smith, “Synchronizing clocks in the presence of faults,” Journal of the ACM (JACM), vol. 32, no. 1, pp. 52–78, 1985. [90] A. Mukherjee, “On the dynamics and significance of low frequency components of internet load,” Technical Reports (CIS), p. 300, 1992. [91] O. Gurewitz, I. Cidon, and M. Sidi, “One-way delay estimation using network-wide measurements,” IEEE/ACM Transactions on Networking (TON), vol. 14, no. SI, pp. 2710– 2724, 2006. [92] Metro Ethernet Forum, “Ethernet services attributes - phase 3,” MEF 10.3, 2013. [93] Network Test Inc., “Virtual Chassis Performance: Juniper Networks EX Series Ethernet Switches,” white paper, http://www.networktest.com/, 2010. [94] “Connectivity Fault Management,” IEEE Std 802.1ag, 2007. [95] J. Mirkovic and T. Benzel, “Teaching cybersecurity with DeterLab,” Security & Privacy, IEEE, vol. 10, no. 1, pp. 73–76, 2012. [96] Cisco, “Cisco’s Massively Scalable Data Center,” http://www.cisco.com/c/dam/en/ us/td/docs/solutions/Enterprise/Data_Center/MSDC/1-0/MSDC_AAG_1.pdf, 2010. [97] ITU-T G.144, “One-way transmission time,” ITU-T, 2003. [98] D. Kreutz, F. Ramos, and P. Verissimo, “Towards secure and dependable software-defined networks,” in ACM SIGCOMM workshop on Hot topics in Software Defined Networks (HotSDN), 2013. [99] S. Scott-Hayward, G. O’Callaghan, and S. Sezer, “SDN security: A survey,” in IEEE SDN For Future Networks and Services (SDN4FNS), 2013.

Bibliography

192

[100] M. Ambrosin, M. Conti, F. D. Gaspari, and R. Poovendran, “LineSwitch: Efficiently managing switch flow in software-defined networking while effectively tackling DoS attacks,” in ACM Symposium on Information, Computer and Communications Security, ASIA CCS, 2015. [101] T. Mizrahi, “Security requirements of time protocols in packet switched networks,” drafttictoc-security-requirements, work in progress, IETF, 2014. [102] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee, “Devoflow: scaling flow management for high-performance networks,” in ACM SIGCOMM Computer Communication Review, vol. 41, pp. 254–265, ACM, 2011. [103] J. Naous, D. Erickson, G. A. Covington, G. Appenzeller, and N. McKeown, “Implementing an OpenFlow switch on the NetFPGA platform,” in ACM/IEEE ANCS, 2008. [104] F. Long, Z. Sun, Z. Zhang, H. Chen, and L. Liao, “Research on TCAM-based OpenFlow switch platform,” in IEEE ICSAI, 2012. [105] Renesas, “R8A20410BG QUAD-Search TCAM,” datasheet, 2010. [106] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, “Fast and scalable layer four switching,” in ACM SIGCOMM, 1998. [107] A. Bremler-Barr and D. Hendler, “Space-efficient TCAM-based classification using gray coding,” IEEE Trans. Computers, vol. 61, no. 1, pp. 18–30, 2012. [108] T. Mizrahi and Y. Moses, “Software Defined Networks: It’s about time,” in IEEE INFOCOM, 2016. [109] C. Pignataro, “Minutes for IETF87, SDNRG meeting,” meeting minutes, IETF, 2013. [110] O. Rottenstreich, R. Cohen, D. Raz, and I. Keslassy, “Exact worst case TCAM rule expansion,” IEEE Trans. Computers, vol. 62, no. 6, pp. 1127–1140, 2013.

Bibliography

193

[111] H. Liu, “Efficient mapping of range classifier into ternary-cam,” in IEEE Hot Interconnects, 2002. [112] H. Che, Z. Wang, K. Zheng, and B. Liu, “DRES: Dynamic range encoding scheme for TCAM coprocessors,” IEEE Trans. Computers, vol. 57, no. 7, pp. 902–915, 2008. [113] C. R. Meiners, A. X. Liu, and E. Torng, “TCAM Razor: A systematic approach towards minimizing packet classifiers in TCAMs,” in IEEE ICNP, 2007. [114] A. Bremler-Barr, D. Hay, and D. Hendler, “Layered interval codes for TCAM-based classification,” Computer Networks, vol. 56, no. 13, pp. 3023–3039, 2012. [115] K. Kogan, S. I. Nikolenko, O. Rottenstreich, W. Culhane, and P. Eugster, “SAX-PAC (Scalable And eXpressive PAcket Classification),” in ACM SIGCOMM, 2014. [116] E. Norige, A. X. Liu, and E. Torng, “A ternary unification framework for optimizing TCAM-based packet classification systems,” in ACM/IEEE ANCS, 2013. [117] C. R. Meiners, A. X. Liu, E. Torng, and J. Patel, “Split: Optimizing space, power, and throughput for TCAM-based classification,” in ACM/IEEE ANCS, 2011. [118] C. R. Meiners, A. X. Liu, and E. Torng, “Bit Weaving: A non-prefix approach to compressing packet classifiers in TCAMs,” IEEE/ACM Trans. Networking, vol. 20, no. 2, pp. 488–500, 2012. [119] D. Mills, J. Martin, J. Burbank, and W. Kasch, “RFC 5905: Network time protocol version 4: Protocol and algorithms specification,” IETF, 2010. [120] O. Rottenstreich, I. Keslassy, A. Hassidim, H. Kaplan, and E. Porat, “On finding an optimal TCAM encoding scheme for packet classification,” in IEEE INFOCOM, 2013. [121] “Marvell Prestera 98DX4251 Product Brief,” http://www.marvell.com/switching/ assets/Marvell_Prestera_98DX4251-02_product_brief_final2.pdf, 2013.

Bibliography

194

[122] ITU-T G.8275.1, “Precision time protocol telecom profile for phase/time synchronization with full timing support from the network,” 2014. [123] D. L. Mills, “Internet time synchronization: the network time protocol,” Communications, IEEE Transactions on, 1991. [124] K. Lakshminarayanan, A. Rangarajan, and S. Venkatachary, “Algorithms for advanced packet classification with ternary cams,” in ACM SIGCOMM, 2005. [125] “The Internet Toaster,” http://www.livinginternet.com/i/ia_myths_toast.htm. [126] “Writable MIB Module IESG Statement,” https://www.ietf.org/iesg/statement/ writable-mib-module.html, 2014. [127] M. Bjorklund, “YANG - A Data Modeling Language for the Network Configuration Protocol (NETCONF),” RFC 6020, IETF, 2010. [128] “Active Internet-Drafts,” https://datatracker.ietf.org/doc/active/, 2015. [129] B. Claise, “YANG data models statistics,” http://www.claise.be/YANGPageMain. html, 2015. [130] R. Penno, P. Quinn, D. Zhou, and J. Li, “YANG Data Model for Service Function Chaining,” draft-penno-sfc-yang-13, work in progress, IETF, 2015. [131] A. Sehgal, V. Perelman, S. Kuryla, and J. Sch¨onw¨alder, “Management of resource constrained devices in the internet of things,” Communications Magazine, IEEE, vol. 50, no. 12, pp. 144–149, 2012. [132] J. Sch¨onw¨alder and A. Sehgal, “Management of the Internet of Things,” http://cnds. eecs.jacobs-university.de/slides/2013-im-iot-management.pdf, 2013. [133] Metro Ethernet Forum, “Service OAM Fault Management YANG Modules,” MEF 38, 2012.

Bibliography

195

[134] T. Mizrahi and Y. Moses, “Time Capability in NETCONF,” RFC 7758, IETF, 2016. [135] D. Levi and J. Schoenwaelder, “Definitions of managed objects for scheduling management operations,” RFC 3231, IETF, 2002. [136] T. Mizrahi, O. Rottenstreich, and Y. Moses, “TimeFlip: Scheduling network updates with timestamp-based TCAM ranges,” in IEEE INFOCOM, 2015. ¨ uner, and L. C. Potter, “Statistical prediction of task execution times [137] M. A. Iverson, F. Ozg¨ through analytic benchmarking for scheduling in a heterogeneous environment,” IEEE Trans. Computers, vol. 48, no. 12, pp. 1374–1379, 1999. [138] E. Ipek, B. R. de Supinski, M. Schulz, and S. A. McKee, “An approach to performance prediction for parallel applications,” in Euro-Par, 2005. [139] P. Giusto, G. Martin, and E. A. Harcourt, “Reliable estimation of execution time of embedded software,” in Conference on Design, Automation and Test in Europe (DATE), 2001. ¨ uner, and G. J. Follen, “Run-time statistical estimation of task ex[140] M. A. Iverson, F. Ozg¨ ecution times for heterogeneous distributed computing,” in International Symposium on High Performance Distributed Computing (HPDC), 1996. [141] G. Bontempi and W. Kruijtzer, “A data analysis method for software performance prediction,” in Conference on Design, Automation and Test in Europe (DATE), 2002. [142] S. Hares and M. Chen, “Summary of I2RS Use Case Requirements,” draft-ietf-i2rsusecase-reqs-summary-01, work in progress, IETF, 2015. [143] M. Lipinski, “White Rabbit - Ethernet-based solution for sub-ns synchronization and deterministic, reliable data delivery,” http://maciejlipinski.pl/myPage/docs/ presentations/WR_IEEE802-Tutorial-Geneve2013.pdf, 2013. [144] K. M. Chandy and L. Lamport, “Distributed snapshots: Determining global states of distributed systems,” ACM Trans. Comput. Syst., vol. 3, no. 1, pp. 63–75, 1985.

Bibliography

196

[145] S. Chisholm and H. Trevino, “NETCONF Event Notifications,” RFC 5277, IETF, 2008. [146] Toaster YANG Module http://www.netconfcentral.org/modulereport/toaster, 2015. [147] “Cron - Linux man page,” http://linux.die.net/man/8/cron, 2015. [148] J. Lundelius and N. A. Lynch, “A new fault-tolerant algorithm for clock synchronization,” in Symposium on Principles of Distributed Computing (PODC), 1984. [149] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Fluids Engineering, vol. 82, no. 1, pp. 35–45, 1960. [150] A. Papoulis and S. U. Pillai, Probability, random variables, and stochastic processes. Tata McGraw-Hill Education, 2002. [151] OpenYuma https://github.com/OpenClovis/OpenYuma, 2015. [152] “OneClock source code,” https://github.com/TimedSDN/Yuma-Time, 2015. [153] Microsoft Azure https://azure.microsoft.com, 2015. [154] Amazon Web Services http://aws.amazon.com, 2015. [155] D. Carraway, “lookbusy — a synthetic load generator,” https://www.devin.com/ lookbusy, 2015. [156] A. Bierman, M. Bjorklund, and K. Watsen, “RESTCONF Protocol,” draft-ietf-netconfrestconf, work in progress, IETF, 2015. [157] D. Veitch, J. Ridoux, and S. B. Korada, “Robust synchronization of absolute and difference clocks over networks,” IEEE/ACM Transactions on Networking (TON), vol. 17, no. 2, pp. 417–430, 2009.

Bibliography

197

[158] K. Correll, N. Barendt, and M. Branicky, “Design considerations for software only implementations of the IEEE 1588 precision time protocol,” in Conference on IEEE 1588, pp. 10–12, 2005. [159] Y. Zhao, J. Liu, E. Lee, et al., “A programming model for time-synchronized distributed real-time systems,” in IEEE Real Time and Embedded Technology and Applications Symposium (RTAS), 2007. [160] P. Derler, J. C. Eidson, S. Goose, E. A. Lee, S. Matic, and M. Zimmer, “Using ptides and synchronized clocks to design distributed systems with deterministic system wide timing,” in International IEEE Symposium on Precision Clock Synchronization for Measurement Control and Communication (ISPCS), pp. 41–46, IEEE, 2013. [161] ITU-T G.8265.1/Y.1365.1, “Precision time protocol telecom profile for frequency synchronization,” 2010. [162] J. Y. Halpern, B. Simons, R. Strong, and D. Dolev, “Fault-tolerant clock synchronization,” in PODC, pp. 89–102, ACM, 1984. [163] T. Srikanth and S. Toueg, “Optimal clock synchronization,” Journal of the ACM (JACM), vol. 34, no. 3, pp. 626–645, 1987. [164] D. L. Mills, “Improved algorithms for synchronizing computer network clocks,” IEEE/ACM Transactions on Networking (TON), vol. 3, no. 3, pp. 245–254, 1995. [165] R. Tse and C. Ong, “Proposal for a standardized mechanism to transfer timing information from an ingress port to an egress port of a PTP transparent clock,” International IEEE Symposium on Precision Clock Synchronization for Measurement Control and Communication (ISPCS), special session on proposed revisions of IEEE 1588-2008, 2012. [166] Wireshark http://www.wireshark.org/, 2014. [167] G. M. Garner, “Effect of a frequency perturbation in a chain of syntonized transparent,” tech. rep., 2007.

Bibliography

198

[168] T. Mizrahi, “Time synchronization security using IPsec and MACsec,” in International IEEE Symposium on Precision Clock Synchronization for Measurement Control and Communication (ISPCS), pp. 38–43, IEEE, 2011.

II

‫תקציר‬ ‫השימוש בשעונים מסונכרנים החל במאה ה‪ 19-‬בחברת הרכבות הבריטית‪ .‬סנכרון שעונים התפתח‬ ‫משמעותית מאז‪ ,‬והפך לטכנולוגיה בשלה המשמשת לאפליקציות רבות ומגוונות‪ ,‬החל מרשתות‬ ‫סלולריות‪ ,‬ועד למסדי נתונים מבוזרים‪.‬‬ ‫עדכוני קונפיגורצית רשת מתבצעים באופן סדיר‪ ,‬ועליהם להתבצע באופן שמצמצם תופעות מעבר‬ ‫הנגרמות ממצבי ביניים ברשת‪ .‬האתגר הינו קריטי במיוחד ברשתות מוגדרות תוכנה ( ‪Software‬‬ ‫‪ ,)Defined Networks - SDN‬שבהן מישור הבקרה (‪ )control plane‬מנוהל ע"י בקר (‪)controller‬‬ ‫מרכזי‪ ,‬השולח עדכוני קונפיגורציה באופן תדיר למתגים (‪ )switches‬ברשת‪ .‬עדכונים אלה משנים‬ ‫את חוקי הניתוב של המתגים‪ ,‬ולכן משפיעים באופן ישיר על האופן שבו עוברות חבילות מידע‬ ‫ברשת‪ .‬על הבקר לצמצם אנומליות‪ ,‬כגון אובדן חבילות או ניתוב שגוי‪ ,‬שעלולות להגרם במהלך‬ ‫העדכון כתוצאה מחוסר‪-‬עקביות (‪ )inconsistency‬זמני‪ .‬תהליכי העדכון מוכרחים להיות‬ ‫סקלבילים (‪ ,)scalable‬ובעלי רמת מורכבות (‪ )complexity‬נמוכה‪.‬‬ ‫עבודה זו מנתחת את השימוש בזמן ובשעונים מסונכרנים בתור כלי לביצוע עדכונים ברשת‪ .‬אמנם‬ ‫השימוש בזמן במערכות מבוזרות הוא רעיון שנותח רבות בעבודות קודמות‪ ,‬אך שינויים תלויי‪-‬זמן‬ ‫מעולם לא נחשבו מעשיים בהקשר של ניהול רשת ‪ ,‬עקב חוסר הדיוק של מנגנוני סנכרון זמן‬ ‫רשתיים‪ .‬לפני עבודה זו‪ ,‬פרוטוקולי ניהול רשת כגון ‪ ,SNMP ,OpenFlow‬ו‪ NETCONF-‬לא עשו‬ ‫שימוש בזמן מדויק על מנת לתזמן עדכונים ברשת‪ .‬יחד עם זאת‪ ,‬פרוטוקולי סנכרון זמן רשתיים‬ ‫התפתחו בצורה משמעותית בשנים האחרונות‪ .‬פרוטוקול ה‪Precision Time Protocol (PTP)-‬‬ ‫מאפשר לסנכרן שעונים ברשת בדיוק גבוה‪ ,‬על פי רוב בסדר גודל של ‪ 1‬מיקרושניה‪ .‬לפיכך‪ ,‬זמן‬ ‫מדויק הפך לכלי שימושי ונגיש‪ ,‬היכול לשמש לתיאום עדכונים רשתיים‪.‬‬ ‫לעבודה זו ארבע תרומות מרכזיות‪ ,‬כמפורט להלן‪.‬‬ ‫ראשית‪ ,‬זמן מדוייק הוא כלי רב‪-‬עצמה לתיאום בין התקני רשת בארכיקטורות ריכוזיות כגון‬ ‫‪ ;SDN‬השימוש בזמן מאפשר לא רק עדכון בו‪-‬זמני של מספר מתגים ברשת‪ ,‬אלא גם מאפשר‬ ‫עדכונים רב‪-‬שלביים‪ ,‬שבהם כל שלב מיועד לביצוע בזמן אחר‪ .‬מודגם שהשימוש בזמן מדויק‬ ‫משפר את ביצועי הרשת במהלך עדכונים‪ ,‬ומאפשר לצמצם את אובדן החבילות במהלך עדכונים‪,‬‬ ‫להקטין את תקורת המשאבים‪ ,‬ולשפר את הסקלביליות של עדכונים תדירים‪.‬‬ ‫שנית‪ ,‬בעבודה זו מוגדרות הרחבות לפרוטוקולי ניהול רשת סטנדרטיים‪ OpenFlow ,‬ו‪-‬‬ ‫‪ ,NETCONF‬המאפשרות עדכונים תלויי‪-‬זמן‪ .‬בעקבות עבודה זו‪ ,‬ההרחבה המוגדרת כאן‬ ‫לפרוטוקול ‪ OpenFlow‬הפכה לחלק מגרסא ‪ 1.5‬של הפרוטוקול‪ .‬ההרחבה שהוגדרה לפרוטוקול‬ ‫‪ NETCONF‬התפרסמה כ‪ RFC-‬של ארגון ה‪.Internet Engineering Task Force (IETF)-‬‬ ‫שלישית‪ ,‬אחד האתגרים המרכזיים בשימוש בזמן מדוייק הוא יכולת ביצוע מדוייקת של פעולה‬ ‫המיועדת לזמן נתון‪ .‬בעבודה זו מוצגת שיטה מעשית ויעילה לביצוע מדוייק של פעולות מתוזמנות‬ ‫במתגי רשת‪ ,‬המשתמשת בטווחים תלויי‪-‬זמן בזכרונות ‪Ternary Content Addressable‬‬ ‫)‪.Memory (TCAM‬‬ ‫ולבסוף‪ ,‬פרוטוקולי סנכרון רשתיים כגון ‪ PTP‬הם מבוזרים מטבעם‪ ,‬בניגוד לתפיסה הריכוזית ב‪-‬‬ ‫‪ .SDN‬בעבודה זו מוצגת תפיסת סנכרון חדשנית המותאמת לסביבה הריכוזית של ‪.SDN‬‬ ‫קיים פוטנציאל משמעותי לעבודות עתידיות שיתבססו על עבודה זו‪ .‬ראשית‪ ,‬עבודה זו התמקדה‬ ‫בארכיטקורות רשת ריכוזיות כדוגמת ‪ .SDN‬יהיה מעניין לנתח באיזו מידה ניתן לשפר את‬ ‫ביצועיהם של פרוטוקולי רשת מבוזרים ע"י שימוש בשעונים מסונכרנים ובזמן מדוייק‪ .‬שנית‪,‬‬ ‫לעבודה זו יש פוטנציאל גבוה להיות מיושמת ברשתות בעולם האמיתי; ליבת העבודה נכנסה‬ ‫כהרחבה לפרוטוקולי רשת נפוצים‪ ,‬ותוצאות הניסויים שבוצעו בעבודה זו מוכיוחות את ישימות‬ ‫הרעיון ואת התועלת שניתן להפיק ממנו‪ .‬יתרה מזאת‪ ,‬לאבני הבנין העיקריות שהוגדרו קיים אב‪-‬‬ ‫טיפוס שהוא קוד פתוח‪ ,‬המאפשר לחוקרים ולמפתחים להתנסות בעיקרי הרעיון‪ .‬ניתן לצפות‬ ‫שהרעיונות המוצגים כאן ימומשו בציוד רשת מסחרי‪ ,‬ויכנסו לשימוש ברשתות אמיתיות‪.‬‬

‫‪I‬‬

‫המחקר בוצע בהנחיית פרופ' יורם מוזס בפקולטה להנדסת חשמל‬

‫תודות‬ ‫ברצוני להודות לפרופ' יורם מוזס על ההדרכה‪ ,‬התמיכה והעידוד במהלך לימודיי‪.‬‬ ‫הייתי בר‪-‬מזל לעבוד עם מנחה שנתן לי את החופש לחקור בעצמי‪,‬‬ ‫וכן העניק לי את ההשראה שהנחתה אותי למסלול הנכון‪.‬‬

‫אני מודה לטכניון על התמיכה הכספית הנדיבה בהשתלמותי‪.‬‬

‫ברצוני להודות לחברת מארוול על התמיכה והעזרה‪.‬‬ ‫רב תודות לדייויד מלמן‪ ,‬על שהאמין בי ונתן לי את ההזדמנות לצאת למסלול זה‪.‬‬

‫המון תודות להוריי‪ ,‬ציפי וג'ו‪ ,‬על העידוד והתמיכה‪.‬‬

‫ולבסוף‪ ,‬ברצוני להודות לאשתי‪ ,‬חגית‪ ,‬ולשני ילדיי האהובים‪ ,‬עידן ונטע‪,‬‬ ‫שתמכו בי וסבלו אותי לאורך המסע הארוך הזה‪.‬‬

‫שימוש בזמן ברשתות מוגדרות תוכנה‬ ‫חיבור על מחקר‬ ‫לשם מילוי חלקי‬ ‫של הדרישות לקבלת התואר‬ ‫דוקטור לפילוסופיה‬

‫טל מזרחי‬

‫מוגש לסנט הטכניון – מכון טכנולוגי לישראל‬

‫תמוז‪ ,‬התשע"ו‬

‫חיפה‬

‫יולי‪2016 ,‬‬

‫שימוש בזמן ברשתות מוגדרות תוכנה‬ ‫טל מזרחי‬

Suggest Documents