Load Shedding in Network Monitoring Applications - UPC

4 downloads 179370 Views 2MB Size Report
Dec 15, 2008 - CoMo (Continuous Monitoring)1. Open-source passive monitoring system. Framework to develop and execute network monitoring applications.
Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Load Shedding in Network Monitoring Applications Pere Barlet Ros Advisor: Dr. Gianluca Iannaccone Co-advisor: Prof. Josep Solé Pareta Universitat Politècnica de Catalunya (UPC) Departament d’Arquitectura de Computadors December 15, 2008

1 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Outline

1

Introduction

2

Prediction and Load Shedding Scheme

3

Fairness of Service and Nash Equilibrium

4

Custom Load Shedding

5

Conclusions and Future Work

2 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Outline

1

Introduction Motivation Related work Contributions

2

Prediction and Load Shedding Scheme

3

Fairness of Service and Nash Equilibrium

4

Custom Load Shedding

5

Conclusions and Future Work

3 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Motivation Network monitoring is crucial for operating data networks Traffic engineering, network troubleshooting, anomaly detection . . .

Monitoring systems are prone to dramatic overload situations Link speeds, anomalous traffic, bursty traffic nature . . . Complexity of traffic analysis methods

Overload situations lead to uncontrolled packet loss Severe and unpredictable impact on the accuracy of applications . . . when results are most valuable!!

4 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Motivation Network monitoring is crucial for operating data networks Traffic engineering, network troubleshooting, anomaly detection . . .

Monitoring systems are prone to dramatic overload situations Link speeds, anomalous traffic, bursty traffic nature . . . Complexity of traffic analysis methods

Overload situations lead to uncontrolled packet loss Severe and unpredictable impact on the accuracy of applications . . . when results are most valuable!!

Load Shedding Scheme Efficiently handle extreme overload situations Over-provisioning is not feasible 4 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Related Work Load shedding in data stream management systems (DSMS) Examples: Aurora, STREAM, TelegraphCQ, Borealis, . . . Based on declarative query languages (e.g., CQL) Small set of operators with known (and constant) cost Maximize an aggregate performance metric (utility or throughput)

Limitations Restrict the type of metrics and possible uses Assume explicit knowledge of operators’ cost and selectivity Not suitable for non-cooperative environments

Resource management in network monitoring systems Restricted to a pre-defined set of metrics Limit the amount of allocated resources in advance 5 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Contributions 1

Prediction method Operates w/o knowledge of application cost and implementation Does not rely on a specific model for the incoming traffic

2

Load shedding scheme Anticipates overload situations and avoids packet loss Relies on packet and flow sampling (equal sampling rates)

Extensions: 3

Packet-based scheduler Applies different sampling rates to different queries Ensures fairness of service with non-cooperative applications

4

Support for custom-defined load shedding methods Safely delegates load shedding to non-cooperative applications Still ensures robustness and fairness of service 6 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Outline 1

Introduction

2

Prediction and Load Shedding Scheme Case Study: Intel CoMo Prediction Methodology Load Shedding Scheme Evaluation and Operational Results

3

Fairness of Service and Nash Equilibrium

4

Custom Load Shedding

5

Conclusions and Future Work

7 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Case Study: Intel CoMo CoMo (Continuous Monitoring)1 Open-source passive monitoring system Framework to develop and execute network monitoring applications Open (shared) network monitoring platform

Traffic queries are defined as plug-in modules written in C Contain complex computations

1 http://como.sourceforge.net 8 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Case Study: Intel CoMo CoMo (Continuous Monitoring)1 Open-source passive monitoring system Framework to develop and execute network monitoring applications Open (shared) network monitoring platform

Traffic queries are defined as plug-in modules written in C Contain complex computations

Traffic queries are black boxes Arbitrary computations and data structures Load shedding cannot use knowledge of the queries

1 http://como.sourceforge.net 8 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Load Shedding Approach

Working Scenario Monitoring system supporting multiple arbitrary queries Single resource: CPU cycles

Approach: Real-time modeling of the queries’ CPU usage 1

Find correlation between traffic features and CPU usage Features are query agnostic with deterministic worst case cost

2 3

Exploit the correlation to predict CPU load Use the prediction to decide the sampling rate

9 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

System Overview

Figure: Prediction and Load Shedding Subsystem

10 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Traffic Features vs CPU Usage 6

CPU cycles

x 10 4 2 0

0

10

20

30

40

50

60

70

80

90

100

0 5 x 10

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40

50 Time (s)

60

70

80

90

100

Packets

3000 2000 1000 0

Bytes

15 10 5 0 5−tuple flows

3000 2000 1000 0

Figure: CPU usage compared to the number of packets, bytes and flows 11 / 33

Introduction

Load Shedding

Fairness and NE

Custom Load Shedding

Conclusions

Traffic Features vs CPU Usage 6

2.8

x 10

2.6

CPU cycles

2.4

2.2

2

1.8 new_5tuple_hashes < 500 500 r , the predictor lower in the list is removed

Cost O(n p log p) 6 L. Yu and H. Liu. Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution, Proc. of ICML, 2003. 8 / 33

Appendix

Backup Slides

*

Resource Usage Measurements

Time-stamp counter (TSC) 64-bit counter incremented by the processor at each clock cycle rdtsc: Reads TSC counter

Sources of measurement noise 1 2

3

Instruction reordering: Serializing instruction (e.g., cpuid) Context switches: Discard sample from MLR history (rusage structure, sched_setscheduler to SCHED_FIFO) Disk accesses: Query memory accesses compete with CoMo disk accesses

9 / 33

Appendix

Backup Slides

*

Prediction parameters (n and FCBF threshold)

8

MLR error vs. cost (100 executions)

5

FCBF error vs. cost (100 executions)

x 10 2

0.03

x 10

0.03 average error

average error average cost

14

average cost

0.025

0.025

12

0.01

10

8

0.015

6 0.01 4

0.5 0.005

0

Cost (CPU cycles)

1

0.02 Relative error

Relative error

0.015

Cost (CPU cycles)

1.5 0.02

0.005

0

10

20

30

40

50 60 History (s)

70

80

90

0 100

0

2

0

0.1

0.2

0.3

0.4 0.5 FCBF threshold

0.6

0.7

0.8

0 0.9

Figure: Prediction error versus cost as a function of the amount of history used to compute the Multiple Linear Regression (left) and as a function of the Fast Correlation-Based Filter threshold (right)

10 / 33

Appendix

Backup Slides

*

Prediction parameters (broken down by query)

MLR error (100 executions)

FCBF error (100 executions)

0.05

0.05 application counter pattern−search top−k trace flows high−watermark

0.04

Relative error

0.035 0.03

0.04 0.035

0.025 0.02 0.015

0.03 0.025 0.02 0.015

0.01

0.01

0.005

0.005

0

application counter pattern−search top−k trace flows high−watermark

0.045

Relative error

0.045

0

10

20

30

40

50 60 History (s)

70

80

90

100

0

0

0.1

0.2

0.3

0.4 0.5 FCBF threshold

0.6

0.7

0.8

0.9

Figure: Prediction error broken down by query as a function of the amount of history used to compute the Multiple Linear Regression (left) and as a function of the Fast Correlation-Based Filter threshold (right)

11 / 33

Appendix

Backup Slides

*

Prediction Accuracy (CESCA-II trace) Prediction error (5 executions − 7 queries) 0.2 average max

average error: 0.012364 max error: 0.13867 Relative error

0.15

0.1

0.05

0

0

200

Query application counter flows high-watermark pattern-search top-k destinations trace

400

600

Mean 0.0110 0.0048 0.0319 0.0064 0.0198 0.0169 0.0090

800 1000 Time (s)

Stdev 0.0095 0.0066 0.0302 0.0077 0.0169 0.0267 0.0137

1200

1400

1600

1800

Selected features packets, bytes packets new dst-ip,dst-port,proto packets bytes packets bytes, packets 12 / 33

Appendix

Backup Slides

*

Prediction Accuracy (CESCA-I trace) Prediction error (5 executions − 7 queries) 0.2 average max

average error: 0.0065407 max error: 0.19061 Relative error

0.15

0.1

0.05

0

0

200

Query application counter flows high-watermark pattern-search top-k destinations trace

400

600

Mean 0.0068 0.0046 0.0252 0.0059 0.0098 0.0136 0.0092

800 1000 Time (s)

Stdev 0.0060 0.0053 0.0203 0.0063 0.0093 0.0183 0.0132

1200

1400

1600

1800

Selected features repeated 5-tuple, packets packets new dst-ip, dst-port, proto packets packets new 5-tuple, packets packets 13 / 33

Appendix

Backup Slides

*

Prediction Overhead

Prediction phase Feature extraction FCBF MLR TOTAL

Overhead 9.070% 1.702% 0.201% 10.973%

Table: Prediction overhead (5 executions)

14 / 33

Appendix

Backup Slides

*

Load Shedding Scheme7 When to shed load When the prediction exceeds the available cycles avail_cycles = (0.1 × CPU frequency ) − overhead Overhead is measured using the time-stamp counter (TSC) Corrected according to prediction error and buffer space

How and where to shed load Packet and Flow sampling (hash based) The same sampling rate is applied to all queries How much load to shed Maximum sampling rate that keeps CPU usage < avail_cycles srate = avail_cycles pred_cycles 7 P. Barlet-Ros et al. "On-line Predictive Load Shedding for Network Monitoring", Proc. of IFIP-TC6 Networking, 2007. 15 / 33

Appendix

Backup Slides

*

Load Shedding Scheme7 When to shed load When the prediction exceeds the available cycles avail_cycles = (0.1 × CPU frequency ) − overhead Overhead is measured using the time-stamp counter (TSC) Corrected according to prediction error and buffer space

How and where to shed load Packet and Flow sampling (hash based) The same sampling rate is applied to all queries How much load to shed Maximum sampling rate that keeps CPU usage < avail_cycles srate = avail_cycles pred_cycles 7 P. Barlet-Ros et al. "On-line Predictive Load Shedding for Network Monitoring", Proc. of IFIP-TC6 Networking, 2007. 15 / 33

Appendix

Backup Slides

*

Load Shedding Scheme7 When to shed load When the prediction exceeds the available cycles avail_cycles = (0.1 × CPU frequency ) − overhead Overhead is measured using the time-stamp counter (TSC) Corrected according to prediction error and buffer space

How and where to shed load Packet and Flow sampling (hash based) The same sampling rate is applied to all queries How much load to shed Maximum sampling rate that keeps CPU usage < avail_cycles srate = avail_cycles pred_cycles 7 P. Barlet-Ros et al. "On-line Predictive Load Shedding for Network Monitoring", Proc. of IFIP-TC6 Networking, 2007. 15 / 33

Appendix

Backup Slides

*

Load Shedding Algorithm Load shedding algorithm (simplified version) pred_cycles = 0; foreach qi in Q do fi = feature_extraction(bi ); si = feature_selection(fi , hi ); pred_cycles += mlr(fi , si , hi ); [ ) then if avail_cycles < pred_cycles × (1 + error foreach qi in Q do bi = sampling(bi , qi , srate); fi = feature_extraction(bi ); foreach qi in Q do query _cyclesi = run_query(bi , qi , srate); hi = update_mlr_history(hi , fi , query _cyclesi );

16 / 33

Appendix

Backup Slides

*

Reactive Load Shedding

Sampling rate at time t:    avail_cyclest − delay sratet = min 1, max α, sratet−1 × consumed_cyclest−1 consumed_cyclest−1 : cycles used in the previous batch sratet−1 : sampling rate applied to the previous batch delay : cycles by which avail_cyclest−1 was exceeded α: minimum sampling rate to be applied

17 / 33

Appendix

Backup Slides

*

Alternative Prediction Approaches (DDoS attacks)8

6

6

x 10

4

4

4

3 2 actual predicted 0

5

10

15 Time (s)

20

25

3 2 actual predicted

1 0

30

5

10

15 Time (s)

20

25

0.4 0.3 0.2

0.3 0.2 0.1

0

0

0

5

10

15 Time (s)

20

(a) EWMA

25

30

2

0

actual predicted 0

5

10

15 Time (s)

20

25

30

0

5

10

15 Time (s)

20

25

30

0.5

0.4

0.1

3

1 30

0.5 Relative error

0.5

0

Relative error

0

CPU cycles

5

1

Relative error

6

x 10 5 CPU cycles

CPU cycles

x 10 5

0.4 0.3 0.2 0.1

0

5

10

15 Time (s)

(b) SLR

20

25

30

0

(c) MLR

Figure: Prediction accuracy under DDoS attacks (flows query)

8 P. Barlet-Ros et al. "Robust Resource Allocation for Online Network Monitoring", Proc. of IT-NEWS (QoSIP), 2008. 18 / 33

Appendix

Backup Slides

*

Impact of DDoS Attacks9

6

12

x 10

0.5 no load shedding load shedding CPU threshold 5% bounds

11 10

no load shedding packet sampling flow sampling

0.45 0.4 0.35 Relative error

CPU cycles

9 8 7

0.3 0.25 0.2 0.15

6 0.1 5 4

0.05

0

5

10

15

20

25 Time (s)

30

35

(a) CPU usage

40

45

50

0

0

5

10

15

20

25 Time (s)

30

35

40

45

50

(b) Error in the query results

Figure: CPU usage and accuracy under DDoS attacks (flows query)

9 P. Barlet-Ros et al. "Robust Resource Allocation for Online Network Monitoring", Proc. of IT-NEWS (QoSIP), 2008. 19 / 33

Appendix

Backup Slides

*

Notation and Definitions Symbol Q C c d q mq c cq p pq aq a−q uq

Definition Set of q continuous traffic queries System capacity in CPU cycles Cycles demanded by the query q ∈ Q (prediction) Minimum sampling rate constraint of the query q ∈ Q Vector of allocated cycles Cycles allocated to the query q ∈ Q Vector of sampling rates Sampling rate applied to the query q ∈ Q Action of the query q ∈ Q Actions of all queries in Q except aq Payoff function of the query q ∈ Q

Max-min fair share allocation policy A vector of allocated cycles c is max-min fair if it is feasible, and for each q ∈ Q and feasible c¯ for which cq < c¯q , there is some q ′ where cq ≥ cq ′ and cq ′ > c¯q ′ . 20 / 33

Appendix

Backup Slides

*

Nash Equilibrium Payoff function

uq (aq , a−q ) =

X  ai ), a + mmfs (C −  q q  

if

ai ≤ C

i:ai ≤aq

i:ui >0

   0,

X

if

X

ai > C

i:ai ≤aq

Definition: Nash Equilibrium (NE) A NE is an action profile a∗ with the property that no player i can do better by choosing an action profile different from a∗i , given that every player j adheres to a∗j Theorem Our resource allocation game has a single NE when all players C demand |Q| cycles. 21 / 33

Appendix

Backup Slides

*

Proof Proof: a∗ is a NE a∗ is a NE if ui (a∗ ) ≥ ui (ai , a∗−i ) for every player i and action ai , if C all other players keep their actions fixed to |Q| ai >

C |Q| :

ui (ai , a∗−i ) = 0

ai
C: queries with largest demands have incentive to obtain a non-zero payoff P i ai < C: queries have incentive to capture spare cycles P C i ai = C and ∃i : ai 6= Q : incentive to disable the query with largest demand 22 / 33

Appendix

Backup Slides

*

Accuracy Function

1

1−ε accuracy

q

0 1

mq

sampling rate

0

Figure: Accuracy of a generic query

23 / 33

Appendix

Backup Slides

*

Accuracy with Minimum Constraints

Query application autofocus counter flows high-watermark pattern-search super-sources top-k trace

mq 0.03 0.69 0.03 0.05 0.15 0.10 0.93 0.57 0.10

no_lshed 0.57 ±0.50 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00 0.62 ±0.48 0.66 ±0.08 0.00 ±0.00 0.42 ±0.50 0.66 ±0.08

Accuracy (mean ±stdev, K = 0.5) reactive eq_srates mmfs_cpu 0.81 ±0.40 0.99 ±0.04 1.00 ±0.00 0.00 ±0.00 0.05 ±0.12 0.97 ±0.06 0.02 ±0.12 1.00 ±0.00 1.00 ±0.00 0.66 ±0.46 0.99 ±0.01 0.95 ±0.07 0.98 ±0.01 0.98 ±0.01 1.00 ±0.01 0.63 ±0.18 0.69 ±0.07 0.20 ±0.08 0.00 ±0.00 0.00 ±0.00 0.95 ±0.04 0.67 ±0.47 0.96 ±0.09 0.99 ±0.03 0.63 ±0.18 0.68 ±0.01 0.64 ±0.17

mmfs_pkt 1.00 ±0.03 0.98 ±0.04 0.99 ±0.01 0.95 ±0.06 0.97 ±0.02 0.41 ±0.08 0.95 ±0.04 0.96 ±0.07 0.41 ±0.08

Table: Sampling rate constraints (mq ) and average accuracy with K = 0.5

24 / 33

Appendix

Backup Slides

*

Example of Minimum Sampling Rates

1 0.9 0.8

q

accuracy (1−ε )

0.7 0.6 0.5 0.4 0.3 high−watermark top−k p2p−detector

0.2 0.1 0 1

0.9

0.8

0.7

0.6 0.5 0.4 sampling rate (p )

0.3

0.2

0.1

0

q

Figure: Accuracy as a function of the sampling rate (high-watermark, top-k and p2p-detector queries using packet sampling)

25 / 33

Appendix

Backup Slides

*

Results: Performance Evaluation 1

1 no_lshed reactive eq_srates mmfs_cpu mmfs_pkt

0.9 0.8

0.8 0.7 accuracy (min)

accuracy (mean)

0.7 0.6 0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.2

0.2

0.1 0

no_lshed reactive eq_srates mmfs_cpu mmfs_pkt

0.9

0.1

0

0.1

0.2

0.3

0.4 0.5 0.6 overload level (K)

0.7

0.8

0.9

1

0

0

0.1

0.2

0.3

0.4 0.5 0.6 overload level (K)

0.7

0.8

0.9

1

Figure: Average (left) and minimum (right) accuracy with fixed mq constraints

Overload level (K) C ′ = C × (1 − K ),

K =1−

P

C

q∈Q

dbq 26 / 33

Appendix

Backup Slides

*

Enforcement Policy 9

x 10 selfish prediction

actual CPU usage expected CPU usage

selfish actual

2.5

custom prediction 10

custom actual

10

custom−corrected prediction

2 CPU cycles

CPU cycles (log)

custom−corrected actual

1.5

1 9

10

0.5

0

0.1

0.2

0.3

0.4 0.5 0.6 load shedding rate

0.7

0.8

0.9

1

0

kq = 0.82

0

0.1

0.2

0.3

0.4 0.5 0.6 load shedding rate

0.7

0.8

0.9

1

Figure: Actual vs predicted CPU usage

Figure: Actual vs expected CPU usage

kq parameter

Corrected load shedding rate r′  rq = min 1, kqq

kq = 1 −

dq,rq =0 dq,rq =1

27 / 33

Appendix

Backup Slides

*

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

accuracy

accuracy

Performance Evaluation

0.5 0.4

0.5 0.4

0.3

0.3 no_lshed eq_srates mmfs_pkt custom

0.2 0.1 0

no_lshed eq_srates mmfs_pkt custom

0

0.1

0.2

0.2 0.1

0.3

0.4 0.5 0.6 overload level (K)

0.7

0.8

0.9

1

0

0

0.1

0.2

0.3

0.4 0.5 0.6 overload level (K)

0.7

0.8

0.9

1

Figure: Average (left) and minimum (right) accuracy at increasing K levels

28 / 33

Appendix

Backup Slides

*

Performance Evaluation 9

cycles available to CoMo

cycles

x 10

accuracy

delay (s)

relative error

0

overhead

prediction

actual (after sampling)

10

0

200

400

600

800 1000 prediction error

1200

1400

1600

1800

0

200

400

600 800 1000 1200 delay compared to real time

1400

1600

1800

0

200

400

600

1200

1400

1600

1800

0

200

400

600

800 1000 1200 load shedding overhead

1400

1600

1800

0

200

400

600

1400

1600

1800

0.5

0 0.5

0

800 1000 system accuracy

1 0.5 0 0.5

0

800 1000 time (s)

1200

Figure: Performance of a system that supports custom load shedding 29 / 33

Appendix

Backup Slides

*

Performance Evaluation

accuracy

delay (s)

relative error

cycles (log)

cycles available to CoMo

prediction

actual (after sampling)

10

10

0

200

400

600

800 1000 prediction error

1200

1400

1600

1800

0

200

400

600 800 1000 1200 delay compared to real time

1400

1600

1800

0

200

400

600

1400

1600

1800

0

200

400

600

1400

1600

1800

0.5

0 0.5

0

800 1000 1200 system accuracy

1 0.5 0

800 1000 time (s)

1200

Figure: Performance with a selfish query (every 3 minutes)

30 / 33

Appendix

Backup Slides

*

Online Performance

9

18

14 CPU usage [cycles/sec]

4

x 10

16

x 10 5

CoMo + load_shedding overhead Query cycles Predicted cycles CPU limit

4.5

total packets buffer occupation (KB) DAG drops

4 3.5

12

3

10

2.5 8 2 6 1.5 4

1

2 0 09:00

0.5

09:05

09:10

09:15 time [hh:mm]

09:20

09:25

09:30

0 09:00

09:05

09:10

09:15 time [hh:mm]

09:20

09:25

09:30

Figure: Stacked CPU usage (left) and buffer occupation (right)

31 / 33

Appendix

Backup Slides

*

Publication Statistics USENIX Annual Technical Conference Estimated impact: 2.64 (rank: 7 of 1221)10 Acceptance ratio: 26.5% (31/117)

Computer Networks Journal Impact factor: 0.829 (25/66)11

IFIP-TC6 Networking Acceptance ratio: 21.8% (96/440)

ITNEWS-QoSIP Acceptance ratio: 46.3% (44/95)

IEEE INFOCOM (under submission) Estimated impact: 1.39 (rank: 133 of 1221)10 Acceptance ratio: 20.3% (236/1160) 10 According 11 Journal

to CiteSeer: http://citeseer.ist.psu.edu/impact.html Citation Reports (JCR) 32 / 33

Appendix

Backup Slides

*

Load Shedding in Network Monitoring Applications Pere Barlet Ros Advisor: Dr. Gianluca Iannaccone Co-advisor: Prof. Josep Solé Pareta Universitat Politècnica de Catalunya (UPC) Departament d’Arquitectura de Computadors December 15, 2008

33 / 33