LTE NFV Rollback Recovery

1

LTE NFV Rollback Recovery Muhammad Taqi Raza, Zhaowei Tan, Yuanjie Li, Songwu Lu, and Mario Gerla Computer Science Department, UCLA

Abstract—Network Function Virtualization (NFV) migrates the carrier-grade LTE Evolved Packet Core (EPC) that runs on commodity boxes to the public cloud. In the new virtualized environment, LTE EPC must offer high availability to its mobile users upon failures. Achieving high service availability is challenging because failover procedure must keep the latency-sensitive, control-plane procedures intact during failures. Through our empirical study, we show that, existing recovery mechanisms on the cloud and standardized LTE solutions are coarse grained, thus unable to quickly recover from failures. They incur LTE service outage, lost network connectivity, and slow recovery in tens of seconds. To address these issues, we describe a new design for faulttolerant LTE EPC. It provides quick failure detection, and timely recovery from failed operations. To reduce failure detection time, it leverages frequent retransmission of LTE control-plane signaling within EPC as an indication of failure. To recover from failure, it adopts checkpointing based rollback recovery approach in LTE context and addresses the shortcomings known in classic checkpointing approach. Our design is LTE standard compliant, and works as plug-and-play without modifying existing LTE implementations. Our results show that this approach can recover from the failure in 2.6 seconds for most of the cases, which is 6.5× faster than existing approaches, and only incurs tens of milliseconds of overhead.

I. I NTRODUCTION The LTE network must be highly available to provide its mobile subscribers with services (voice, data, AR/VR, and others). Telecom vendors have since promised high availability of five-nines by using their vender specific solutions over carrier grade boxes. Each LTE “network function” is developed by a single telecom vendor addressing specific requirements of that network function. They employ (a) hardware level mechanisms to tolerate occasional internal components and modules failures; (b) software level techniques to ensure redundancy, both for error detection and error recovery; and finally (c) overly manager instruments that ensure strong coupling between underlying hardware and software [1], [2], [3], [4]. The Network Function Virtualization (NFV), however, moves away from propriety solutions and brings LTE implementation over public clouds. Providing high LTE service availability over general-purpose hardware and software platforms is challenging [5], [6], [7], [8]. In this paper, we study the fault tolerance for the highlyavailable LTE NFV. We first perform an empirical assessment and reveal that current LTE and cloud based mechanisms perform poorly during failures. Existing approaches cause LTE service disruption, loss of network connectivity, and take tens of seconds to recover LTE network service. This is mainly because existing NFV failure recovery solutions are too coarse grained for LTE. These solutions periodically take

snapshots of the virtual machine and do not log LTE controlplane signalling messages; hence fail to recover LTE service during failover. Further, contemporary approaches rely on heartbeat messages – implemented over cloud platform (such as ZooKeeper [9]) and LTE protocols (such as SCTP [10]) – to detect the failure. This is too slow for LTE control-plane procedures which timeout in a few seconds. We next put forward our design to address slow failure detection and inefficient failure recovery mechanisms in existing LTE-NFV. Our design exploits LTE specific information to achieve high availability. To perform timely recovery, our design proposes checkpoint based rollback recovery with receive determinism algorithm. Because LTE control-plane procedures have causality relationship (in which one message triggers the execution of the other message and so on), blocking mode execution (a device can initiate only one procedure at a time), and receive determinism (same message is replayed during rollback that was saved in checkpoint), our design can mitigate the issues found in classical checkpointing based rollback recovery approaches [11], [12], [13]. Finally, our design makes uses of available LTE path management feature in its advantage to provide quick failure detection. It uses retransmission of signalling messages within EPC as a prediction of failure and starts frequent probing towards nonresponding EPC NF to confirm the failure. We implement our design and show that our design does not require changes in the existing LTE standard and source code. It can be deployed as plug-and-play that uses system level mechanisms to monitor and perform rollback recovery, while fully implementing our algorithms. We implement our design over OpenAirInterface [14], an open source LTE implementation. Our evaluation shows that our approach: (1) detects failure in approximately 2.5 seconds with no false positives, which is 4× improvement over existing cloud mechanism, (2) reduces recovery time by up to 6.5× compared to state-of-thearts implementation, and (3) incurs negligible checkpointing overhead of tens of milliseconds. Our approach promises the high availability of LTE control-plane procedures by ensuring that these procedures never terminate even under the worst case of the failure scenario. II. BACKGROUND LTE network consists of three main components: User Equipment (UE), evolved Node Base-station (eNodeB), and Evolved Packet Core (EPC), as shown in Figure 1. The eNodeB anchors as a radio interface between UE and EPC. EPC communicates with packet data networks in the outside world such as the Internet. The EPC is composed of a number of Network Functions (NFs): the Serving Gateway (SGW),

2

TABLE I: Downtime allowed for high availability. Availability % Downtime/year 99.95% (“three & half 4.38 hours nines”) 99.99% (“four nines”) 52.56 minutes 99.999% (“five nines”) 5.26 minutes

Downtime/week 5.04 minutes

Downtime/day 43.2 seconds

1.01 minutes 6.05 seconds

8.64 seconds 864.3 milliseconds

the PDN Gateway (PGW), the Mobility Management Entity (MME), the Home Subscriber Server (HSS), and a few others. These LTE EPC NFs handle control-plane and data-plane traffic through separate network interfaces and protocols. As shown in Figure 1, the control-plane traffic from radio network is sent to MME, whereas data-plane traffic is forwarded to SGW. MME acts as a central management entity that authenticates and authorizes UE, handles device procedures (such as device registration, handover, location update, and service provisioning), and maintains SGW and PGW connections for data-plane traffic. In this paper, we focus on LTE control-plane reliability. This is consistent with research efforts in software defined networking and Network Function Virtualization (NFV) that solve the control-plane issues [5], [15], [16], [17], [18], [19], [6] and let OpenFlow routers to handle the data-plane traffic [20], [21], [22], [23]. EPC MME UE

Control-plane HSS

eNodeB SGW

PGW

Data-plane Internet

Cloud Platform Fig. 1: LTE architecture over NFV: an overview.

Network Function Virtualization (NFV) for 4G LTE: In LTE–NFV, LTE software implementation is deployed over cloud platforms. There exist a number of LTE software implementations that include OpenEPC [24], OpenAirInterface [14], OpenLTE [25] and others that can be deployed over cloud platforms, such as OpenStack [26], OPNFV [27], vSphere [28], etc. The cloud platform acts as a manager and provides virtualization of hardware resources to LTE NFs. Softwarization of LTE NFs accelerates the innovation by lowering operational and capital expenditures [29], [30]. The network operators, such as Verizon [31], Vodafone [32], SK Telecom [33], and others, are currently considering to deploy LTE–NFV for existing and future 5G applications. III. AVAILABILITY IN V IRTUALIZED LTE The availability of LTE network is usually expressed as a percentage of uptime in a given day, week or a year. Table I shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. LTE networks are expected to achieve five-nine (99.999%) high availability [6], [34]. This availability is defined under the assumption of highly available channel

(i.e. the unavailability due to bad signals or no coverage is not considered) [35], [34]. Failure recovery mechanisms today: LTE network operators aim to provide all-time service access to their subscribers. In order to provide end-to-end LTE services, LTE NFs must be operative even during faults. Conventionally, carrier grade LTE NFs employ sufficient mechanisms at their hardware and software to tolerate faults and provide high availability of network resources. These include Cisco ASR 5500 platform [1], Ericsson Blade System MkX platform [2], AlcatelLucent 5620 SAM platform [3], Huawei’s ATCA platform [4], and a few others. The NFV that moves away from propriety solutions relies on 3GPP standard and cloud platforms to meet high availability requirements. Failure model: Similar to these efforts [5], [36], [37], [38], we consider the fail-stop failure, where under failure “the component changes to a state that permits other components to detect that a failure has occurred and then stops”. We focus on the network function (NF) failures, in which the process of the LTE functions (e.g. MME, S-GW/P-GW) crashes. Scope: LTE–NFV efforts span on virtualizing LTE EPC [33], [19] and LTE radio access network [39], [40]. In this paper, we limit our scope on NFV of LTE EPC. IV. E MPIRICAL A SSESSMENT OF THE S TATE - OF - THE -A RTS In this section, we unveil the deficiencies of current cloud-based (§IV-A), and LTE’s built-in failover mechanisms (§IV-B). We summarize these deficiencies in Table II. We discover that both mechanisms are insufficient for five-nine LTE high availability in NFV. The causes are diverse, including the mismatch between cloud and NF-level failure recovery, coarse-grained recovery, long recovery time, and slow failure detections. They motivate us to devise new fault-tolerance mechanisms. Methodology: We run experiments on open source LTE platform, OpenAirInterface [14]. We consider the shortest timer values and best possible solutions available among (1) 3GPP standard, (2) 3GPP-NFV documents, and (3) Cisco vendor platform1 [10]. We do not assume hot-standby configuration, a na¨ıve solution used by the telecom industry to address fault tolerance [41], [42], [43] in NFV, which is resource inefficient and doubles the cost of the network [21], [44]. A. Deficiencies in Cloud-Level Recovery We discover that existing cloud-based failure recovery mechanisms are neither sufficient nor efficient for LTE controlplane functions in NFV. 1) Coarse-Grained Failure Recovery: We find that the existing cloud-based failure recoveries are coarse-grained and thus potentially waste storage resource. We analyze both the open-source platform (OpenStack), and the commercial platform (VMware vSphere). 1 The reason we chose Cisco as our vendor choice for comparison is that its configuration and implementation specifications are available on the Internet.

3 TABLE II: Summary of limitations of existing approaches. Recovery at VMware

Recovery mechanisms Yes

Failure resilience No

Resource efficient No

OpenStack

Yes

No

No

ETSI 3GPP standards ETSI NFV documents

Yes

No

No

Yes

No

No

Key limitations Snapshots are expensive, VM stucks during snapshot, not scalable. Work-in progress, procedures abort on failure, re-attach req. after recovery. Slow, procedures abort on failure, devices re-attaches after recovery. procedures abort, recovery happens after several timeouts.

• OpenStack [45]: We find that OpenStack solutions for compute node’s high availability are immature and are still inprogress [45]. Their solutions cannot recover the subscribers session information during failure and require all UEs to reattach with the network after failure recovery. • VMware vSphere [46]: It provides failure recovery mechanism through periodic snapshot. We find that vSphere’s snapshot approach has several limitations and cannot be used for delay sensitive LTE NFV failure recovery. First, when it takes the snapshot, the entire state of the VM is freezed; VMware calls this as stunned operation, and VM becomes non-responsive [46]. Second, the snapshot requires complete VM memory backup, even if there are few changes in the memory since last snapshot. We show this in Figure 2 that VMware snapshot approach takes at least 43,100 MBs for backup purpose compared to the required 110 MB – which is at least 40× more expensive in terms of storage space. Impact The OpenStack detects the failure through heartbeat messages (which are in the order of a couple of seconds [6]) and failure is recovered by assigning an alternative VM instance. As a result, there is a long service disruption and loss of network connectivity. Similar to OpenStack, the VMware solution has two major impacts. First, VMware’s stunned operation [46] results into temporary LTE service disruption which can take up to 60 seconds [47]. Second, there is a tradeoff between frequency of snapshots and fine-grained recovery of LTE failed procedures. LTE procedures require snapshots to be taken at the granularity of milliseconds (as it takes few seconds for one LTE procedure to conclude) which is not possible with VMware approach. Root Cause Current cloud systems do not provide fault tolerance for session-level resilience. Even though, VMware snapshot approach is available for cloud based service robustness, it is designed for VM backup purpose rather than VM applications’ high service availability. Issue 1: State-of-the-art cloud based approaches are coarse-grained and thus consume more storage resources. 2) Cloud Availability 6= LTE NF Availability: We find that the five-nine clouds today [6] can only ensure three and half nine availability for LTE control-plane in NFV. Table III shows the downtime of the cloud nodes and corresponding LTE network functions (using OpenAirInterface [14]). Table III shows the downtime per day for MME, SGW/PGW and HSS under Attach Request, Tracking Area Update (TAU), and Service Request (SR) LTE procedures. LTE failure recovery time includes the time taken by cloud platform, LTE NF bootstrap and connection setup time, and the time UE waits for timeout to re-initiate the failed procedure. We can see

from Table III that the total user-perceived2 downtime for Attach/TAU and SR is 10 seconds and 5 seconds, respectively. Although the cloud’s downtime meets five-nine criteria, the LTE control functions do not (as like the case of other applications [48]): The bootstrap and connection setup time for MME, SGW/PGW and HSS are 2 seconds, 1 second, and 2.5 seconds, respectively. Further, these NFs need to wait either 5 seconds or 10 seconds for SR or TAU/Attach request procedure to be re-initiated from the device. As a result, the high availability for LTE procedures are ‘three and a half nines’3 , even if the cloud platforms could achieve five nines of infrastructure availability. Root cause: The cloud’s failure recovery only ensures high availability at VM level, not NF level. Issue 2: Cloud-level failure recovery is insufficient for five-nine LTE control-plane, because of the network-side NF interplays and device-side slow failure detections. TABLE III: Downtime per day for LTE control-plane procedures. One failure per day for EPC NF is considered.

Proc

MME failure Cloud MME UE 5 nines Attach 864.3ms 2s 10s TAU 864.3ms 2s 10s SR 864.3ms 2s 5s

SGW/PGW failure Cloud S/P UE 5 GW nines 864.3ms 1s 10s 864.3ms 1s 10s 864.3ms 1s 5s

HSS failure Cloud HSS UE 5 nines 864.3ms 2.5s 10s 864.3ms 2.5s 10s 864.3ms 2.5s 5s

B. Deficiencies Rooted in LTE Design We discover that in case of failure LTE, standardized mechanisms also bring LTE service disruption during failure and loss of network connectivity after failure recovery. 1) Loss of Network Connectivity after Failure Recovery: LTE-NFV solutions from 3GPP (refer to 3GPP specification reports that discuss scalable LTE-NFV architecture for reliability [49], and end-to-end LTE-NFV reliability models [50]) are coarse grained. When the failure occurs, VNF failure recovery procedure selects an alternative VNF (that VNF does not have UE contexts) [50], [49]. To quantify this, we consider SR procedure and bring the failstop failure. Once the LTE service is back in about 32 seconds, an alternative MME takes charge of the failed MME. The UE sends SR message, which is returned with an error cause #9 (UE identity cannot be derived by the network) (refer to 3GPP NAS specification section 5.6.1.5: Service request procedure not accepted by the network [51]). This has been shown in device side logs snapshot in Figure 4. On receiving the service reject message, the UE enters into the state EMM-DEREGISTERED and automatically initiates the registration procedure. Once the device is registered with LTE network, it can get the LTE service. Impact There are several issues in LTE NFs restoration procedure after failure. First, available LTE recovery mechanisms result into loss of network connectivity. Second, EPC NF failure is propagated to device which is against the 2 The time after which the user device will re-initiate the failed procedure, which is typically equivalent to LTE procedure timeout value. 3 Assuming good radio conditions with no radio failures.

4

10000

Processing capacity (# of subscribers)

Logs for recovery (MBs)

prot_disc = EPS mobility management messages msg_type = Service request lte_emm_msg emm_serv_req_msg

1000

VMware snapshot Subscribers desired states and variables

8000

800

6000 4000

600

Service outage

400

Time to detect failure

200

2000

0

0

200

400

600

Number of subscribers

800

1000

0

8

16

24

32

Time (seconds)

40

48

56

prot_disc = EPS mobility management messages msg_type = Service reject lte_emm_msg emm_serv_rej emm_cause cause_value = 9 (0x9) (UE identity cannot be derived by the network)

Fig. 2: For failure recovery: VMware approach takes Fig. 3: Slow failure detection results into LTE service Fig. 4: Device side logs show that the service request was several thousands MBs of snapshot, even though it is disruption. rejected on failure recovery. sufficient to backup only subscriber related messages, states and variables of few MBs.

philosophy of fault tolerance [52]. Third, EPC relies on device to recover from failure by performing re-registration procedure with LTE network. Fourth, LTE control-plane failure not only terminates any control-plane session, but data-plane traffic is also aborted, where device initiates re-registration procedure with LTE network. Root Cause The standardized mechanisms are designed by considering highly reliable EPC NFs (five-nine availability) where failure is only anticipated for vendor specific platforms that have built-in redundancy for fault tolerance [1], [2], [3], [4]. The 3GPP standards provide only load balancing, load re-balancing [53] and service restoration [54] mechanisms to distribute the workloads. It does not standardize the LTE control-plane procedure level recovery mechanisms. Issue 3: Existing failure recovery brings loss of network connectivity and results in termination of on-going data-plane traffic. 2) Slow Failure Detection: We find that LTE recovery mechanisms are also slow to detect the failure. As shown in Figure 3, before any failure our MME has processing capacity of handling 1,000 UEs (considering both their control and data plane traffic). In our experiment, we launch LTE control-plane procedures with LTE and stopped MME instance (imitating the fail-stop failure of MME). As soon as the failure occurs (at 10th second), the MME stops responding LTE controlplane signalling messages. The subscriber devices timeout and retry their LTE procedures. They keep retrying until a response is received to their requests. It takes them 32 seconds to receive any response to their requests (when the VM has recovered), after which the alternative MME serves the subscriber requests. Impact Slow failure detection has two major impacts. First, the subscriber devices will go through service outage during failure, which takes around 32 seconds. This value is not acceptable for delay-sensitive LTE control-plane procedures. The retry interval for Attach/TAU and SR procedures are 10 seconds (timer T3411 [51]) and 5 seconds (timer T3417 [51]), respectively. Another major issue is that eNodeB receives higher signaling load. All the devices retry for failed service (e.g. SR is retried every 5 seconds) and bring signaling spikes at eNodeB. Root Cause We find that 3GPP standard does not provide explicit failure detection mechanism. Rather, it relies on communication interfaces’ heartbeat to record the failure. When MME goes down, the failure is detected by both eNodeB and

SGW. The eNodeB detects the MME failure through heartbeat mechanism provided by SCTP protocol interface whose value is 30 seconds (according to Cisco implementation [10]). The SGW also detects the MME failure through path management procedure [19], [55]. The SGW takes 25 seconds (timer T3-Response value choice by Cisco [56]) to detect the MME failure and mark it unreachable. However, the SGW cannot select an alternative MME and it relies on eNodeB to assign a different MME from its MME pool [57]. This new MME contacts SGW and starts serving the subscribers. We should add that, although the cloud platforms provide mechanisms (like ZooKeeper [9] or Pacemaker [58]) faster than eNodeB failure detection through SCTP interface, these solutions do not work in LTE, because no mechanism exists in which an alternative MME can ask eNodeB to associate with it. Issue 4: Fail-stop failure detection is slow that causes service disruption during failure. V. D ESIGN R ATIONALE AND OVERVIEW We want to achieve the following three goals in our design. 1) Fine-grained reliability: We want to provide reliability to delay-sensitive LTE procedures. 2) Less overhead: We aim to provide a solution that has less system overhead. 3) 3GPP standard compliant: We want our solution to be fully compliant with existing LTE standards. A. Design Insights We leverage the way LTE procedures get executed in designing our solution. Before we describe our solution, we want to highlight the key LTE specific design insights. Messages causality (I 1 ): To execute one LTE procedure, EPC NFs exchange messages with each other. All these messages have causality relationship or happen-before relationship. That is, a message m1 may causally affect another message m2 , if and only, if message m2 can only execute once message m1 has been executed. This means, m1 happens before m2 , and forms a causality relationship as m1 → m2 . For example, in device attach procedure, the device is first authenticated (authentication request, followed by authentication response messages) at MME; thereafter, SGW creates device sessions, when it receives create session request message from MME, processes it and sends the create session request message to PGW; the PGW processes the request (creates PDP context, assigns bearers and QoS, and enables charging

5

operation) and replies create session response message to SGW; the SGW replies create session response message to MME. All these messages have happen-before relationship. Device procedures execute in blocking mode (I 2 ): We refer to LTE NAS standard and find that LTE NAS procedures from a device execute in blocking mode. That is, when one procedure is initiated by a device, it cannot initiate any other procedure until the previous one results into success/failure. Refer to Figure 5.1.3.2.2.7.1: EMM main states in the UE in the NAS spec [51] that explains our insight through transitions between procedures’ init states and device’s registered state. Replay messages are receive-deterministic (I 3 ): LTE states are non-deterministic where the non-determinism comes from the decision logic and actions due to timers. However, it is not the case for LTE signalling messages that need to be replayed (when previously executed control-plane messages are resent for failure recovery). The receiver will get the same message that it has executed before, and it produces the same result sent earlier. Therefore, we say that the replayed messages are deterministic at receiver (that generates the same response message which was sent earlier). Split and join relationship of concurrent execution of messages (I 4 ): Certain messages are allowed to execute concurrently by LTE standard (they are not causally related according to LTE standard), such as Authentication Request and Security Mode command messages, TAU related and Identity Request messages, and a few others (refer to message collision at [51]). When the concurrency happens, the main procedure execution splits into two parts; both parts execute concurrently. Later, when both parts finish their tasks, they join together and create one execution path again; afterwards the rest of the messages are executed following happen-before relationship. Because we already know from the standard which messages can be executed concurrently, we can employ specific mechanism for those few messages. LTE path-management mechanism retransmits messages on timeout (I 5 ): LTE EPC leverages its pathmanagement mechanism to retransmits GTP-C signalling messages in case the receiver EPC NF does not respond in time. This retry interval is calculated based on echo-request and echo-response values between two EPC NFs [55]. We can leverage this insight to quickly detect the failure. B. Solution Overview Solution components: Our solution has two main components: (a) retransmission-induced quick failure detection, and (b) receive-deterministic rollback recovery through checkpointing on failure. Our design architecture is shown in Figure 5. During normal working, every NF independently takes periodic checkpoints. Every checkpoint saves subscribers session states, controlplane procedures’ messages and their execution, and NF configurations, since preceding checkpoint into highly reliable HSS database [59], [60], [61], [62]. Our solution leverages LTE control-plane messages retransmission mechanism between two EPC NFs to predict the failure, in case one NF stops responding the control-plane signalling messages. The

predicted failure is reported to recovery engine. The recovery engine confirms the failure by frequently probing the nonresponding NF. If failure is confirmed, it replaces the failed NF with an alternative one. The alternative NF retrieves the last saved checkpoint and rollbacks to that checkpoint. Then it replays the control-plane messages to reach the state at which the failure has occurred. Once all the recorded messages are replayed, the halted control-plane procedures resume without breaking the connectivity.

Recovery Engine EPC1 MME SGW PGW EPCn

HSS

MME SGW PGW Fig. 5: Our design architecture.

Our solution addresses the shortcomings of existing failure recovery mechanisms (as discussed in §IV). Our approach takes periodic checkpoints to provide failure resilience. These checkpoints only record LTE specific information instead of complete VM snapshot, addressing issue 1. Our solution makes use of LTE path management mechanism to quickly detect the failure and solves issue 4. After failure detection, the failed LTE control-plane procedures are quickly restored by rolling back to the last saved checkpoint and replaying the checkpoint messages, solving issues 2 and 3. Roadmap: Our design consists of two solution components: (1) Enabling efficient LTE control-plane rollback recovery, and (2) Quick LTE failure detection. For (1) we first describe why existing rollback recovery mechanism using checkpointing does not work. Then we discuss LTE controlplane recovery through (a) our independent checkpointing approach, and (b) ensuring NFs state consistencies after rollback recovery. For (2) we first briefly discuss limitations of existing failure detection mechanisms, followed by our retransmissioninduced failure detection approach. Scope: We limit our scope to LTE NAS procedures initiated by device (Attach, Detach, TAU, and SR). VI. D ESIGN We put forward a design to achieve efficient LTE controlplane rollback recovery, and quick failure detection. A. Enabling LTE Control-Plane Rollback Recovery We first brief the limitation of existing available rollback recovery mechanism in LTE context. Thereafter, we discuss our approaches of communication-induced checkpointing and receive-deterministic rollback recovery. 1) Limitation of Existing Rollback Recovery Mechanism: Conceptually, the simplest solution is that different LTE NFs take their checkpoints more or less independently. When a failure occurs, the LTE NF is rolled back to the latest checkpoint and resumes the LTE failed control-plane procedure.

6

However, this simple approach can cause the domino effect during rollback recovery [12], [63], [64]. The domino effect appears when a subset of LTE NFs, which have to be resumed after a failure, rollback unboundedly while determining a set of mutually consistent checkpoints among all NFs. In the worst case, all the NFs have to roll all the way back to the beginning (the first most checkpoint when these NFs started) [12], [63], [64]. We find that in LTE context the domino effect can be even worse, where it can take the rollback recovery to the first most checkpoint, when the device was initially registered – days or even weeks before the failure.

ALGORITHM 1: Checkpointing for LTE control-plane procedures. U ESessions = Session states of all devices; N F Conf igs = NFj configurations; U EP rocM sgs = Messages of all procedures of devices; Let CkP t = U ESessions + N F Conf igs + U EP rocM sgs; if the concurrent messages execution is starting for procedurek then Take checkpoint (CkP ti ) at NFj ; SEND (i, NFj , CkP ti ) to HSS; if the concurrent messages execution has finished for procedurek then Take checkpoint (CkP ti ) at NFj ; SEND (i, NFj , CkP ti ) to HSS; if certain time has passed at NFj then Take checkpoint (CkP ti ) at NFj ; SEND (i, NFj , CkP ti ) to HSS; end

Service Request as an example Receive-Determinism: Principles

eNB MME SGW PGW PCRF

m1: Service Req

m2: Initial Context Setup Req

m3: Initial Context Setup Resp m4: Modify Bearer Req

C

C

m5: Modify Bearer Req m6: CCR

C

m7: CCA

m8: Modify Bearer Resp

C

Fig. 6: Service request as an example. Every EPC NF will take its checkpoint independent from other NFs. Before/after sending/receiving messages every NF may have some internal execution of messages (not shown in the Figure). In this Figure, we only show the SR as message exchange.

Service Request procedure as an example: We understand the domino effect during rollback in LTE by using LTE SR control-plane procedure as an example. As shown in Figure 6, assume the failure at SGW occurs after sending message m5 and just before receiving message m8 . The SGW rolls back to its last checkpoint (which is in between messages m4 and m5 ). From SGW point of view, the message m5 was never sent (as the checkpoint was taken before message m5 was sent); whereas from PGW point of view, m5 was sent by SGW and it has progressed towards message m8 . These inconsistencies between SGW and PGW bring the phenomena of orphan messages – the messages which are not mutually claimed by any two NFs, and cannot establish send/receive relationship. Furthermore, SGW will not re-send the message m5 because it requires message m4 to be received before sending m5 . To deal with this problem, protocols based on checkpointing force the orphan messages NFs (i.e. PCRF and PGW) to rollback to the checkpoint preceding the orphan messages reception. This rollback may trigger more rollbacks at other NFs (such as MME) to find the consistent global state. In the worst case, all NFs end up rolled-back to their initial checkpoints. This initial state might be the start of LTE EPC NFs that means no control-plane procedure was ever executed. Furthermore, this approach can lead to storage overhead in which all the checkpoints need to be saved because the recovery procedure could rollback to the first most checkpoint (i.e. due to domino effect). In this paper, we use LTE specific insights and design our checkpointing and rollback recovery algorithms that solve the domino effect without requiring different NFs to coordinate to do checkpointing (solutions that are considered to address domino effect problem [11], [12], [13]). 2) Our Checkpointing and Rollback Recovery Mechanisms: Independent checkpointing In our approach, every NF takes its own checkpoint whenever it wants. Our check-

pointing approach does not put any restriction on NF regarding when it should take a checkpoint and how many checkpoints should be taken in a given period of time; except the case of concurrent messages execution. To make sure that after failure the concurrent messages execution are replayed exactly in the same order as they were executed before the failure, our approach ensures either all of the concurrent messages will be re-executed or none during rollback recovery. To achieve this, our approach puts a forced checkpoint as soon as the concurrent messages execution starts, and takes another forced checkpoint when the concurrent execution stops. It means we isolate the execution block of concurrent messages execution. Our approach can do so because of split and join phenomena to achieve concurrency (Insight: I 4 ). To take an example, when concurrency happens in LTE control-plane procedure, the main thread responsible of executing LTE procedure creates a new thread() (i.e. child thread) and delegates mutually exclusive part of procedure’s execution to the child thread. This is the 5 point where the procedure execution splits. Both parent and child threads execute the part of procedure independently. When the child thread has finished its task, it joins() with the parent thread and afterwards the parent thread executes the rest of the procedure. Because we can easily know when the procedure is being split() and join(), we can enforce the checkpoint (Insight: I 4 ). We propose that all checkpoints should be stored in HSS database. We believe that HSS database is critical component of LTE that hosts all subscribers related records and their location information, and is available at all time. The cloud operators have extra mechanisms to provide high availability of those database records. This assumption is consistent with both industry and academic research communities [59], [60], [61], [62]. Checkpointing algorithm We explain our checkpointing procedure through Algorithm 1. The checkpoint is mainly taken when the procedure is being split or join or certain time has passed. Every checkpoint mainly saves the following three key records (instead of taking complete VM snapshot – discussed in issue 1): (1) device specific session states, (2) NF configurations, and (3) device control-plane procedures and their messages. We summarize desired checkpointing information in Table IV. Note that, our checkpoint algorithm does not need to take a separate checkpoint for every device; rather it takes a checkpoint of the NF.

7

ALGORITHM 2: Receive-deterministic rollback recovery through checkpointing. An alternative NF assigned; Checkpoints are read; and UE sessions restored; Let CkP t = U EP rocM sgs; while all messages are not replayed from CkP t do for each neighboring NF, NFj do if CkP ti is a message to be resent then compute SENTi→j (CkP ti ); send a Rollback (i, SENTi→j (CkP ti )) message to NFj ; else if CkP ti is a message that was not sent then Let message timeout happen; compute SEND ON TIMEOUTi→j (CkP ti ); send a Rollback (i, SEND ON TIMEOUTi→j (CkP ti )) message to NFj ; end for every Rollback(j, c) message received from a neighbor NFj do if RCVDi←j (CkP ti ) > c then Discard received message; compute SENTi→j (CkP ti ); send a Rollback (i, SENTi→j (CkP ti )) message to NFj ; end end

TABLE IV: Example of information to be saved during checkpointing. Type UE Sessions NF Configurations Procedures

Required Information Auth vector, ciphering keys, IP, APN, QoS param, Bearer & session IDs, GUTI, IMSI and GUMI, charging ID, user location, and others. MME-FQ-CSID, SGW-FQ-CSID, S1-U eNB F-TEID, S5/S8-U SGW FTEID, local IP address, local port #, protocol configs, and others. LTE procedures, their messages & values. E.g. auth. req/resp, create/modify session req/resp, etc.; Msgs timers, input/output values.

States consistencies after rollback recovery Our checkpointing rollback recovery solution can simply replay the messages one by one (Insights: I 1 and I 2 ). It avoids the domino effect by ensuring that all states remain consistent after failure. We have three key ideas to achieve this. First, the LTE messages to be replayed during rollback are “receive deterministic” (Insight: I 3 ). That is, the receiver of the replayed messages knows that these messages were executed before. Therefore, instead of actually re-executing those messages, the receiver NF simply produces the same response that was generated earlier. The second idea is that we let the alternative NF re-send the message on timeout (i.e. the orphan message will eventually be re-sent on timeout). This ensures that all of the checkpoints messages will eventually be replayed and bring EPC NFs to consistent state. The third key idea is about handling of the message that is perceived to be not sent by alternative NF. This is the case when alternative NF recovers quickly and receives the response of the message that was sent before the failure. Because the alternative NF rolls back to previous checkpoint (that does not capture that sent message), this response message brings states inconsistencies. Our solution is that the alternative NF should discard such message – the future message in the past state. This message will eventually be recovered during rollback recovery. Checkpoint replay algorithm We explain our checkpointing procedure as shown in Algorithm 2. The while() loop ensures that all checkpoint messages are replayed. On each iteration (the first for() loop), we check whether the message was sent earlier or not. If it was not sent during rollback then this message is immediately resent, otherwise the message is sent on timeout. The second for() loop checks all messages that are received from the neighboring NF. If we receive the response of the message that has not been sent (according to checkpoint), then we discard the received response message. Once all the messages are replayed, the system reaches to the

ALGORITHM 3: Quick LTE failure detection. T monitor = Monitoring thread; P thread = Probing thread; T monitor finds the re-transmission of signalling message(s) from NFi to NFj ; if T monitor finds signalling re-transmission more than threshold then Mark NFj as failure prone; Report NFj to Recovery Engine; if Recovery Engine receives a marker from T monitor then Create a new thread P thread; while NFj is not probed for certain times do P thread keeps probing NFj ; end if P thread does not receive a response from NFj for certain number of probes then Start Rollback recovery (Algorithm 2); end

state at which the failure has occurred and the control-plane procedure progress has halted. This is a consistent state among all EPC NFs; and the halted procedures can start progressing again (solving issues 2 and 3). B. Quick LTE Failure Detection Limitation of existing failure detection mechanism In order to provide high LTE availability, it is important that the failure is quickly detected. We refer to §IV and recall that both cloud platform and LTE standardized mechanisms are slow and can take tens of seconds to detect the failure (issues 2 and 4). To address this limitation, we provide our retransmission-induced quick failure detection approach. Retransmission-induced quick failure detection Our failure detection solution exploits the LTE signalling messages for failure prediction and quick failure detection. LTE uses GTPC signalling messages for control-plane path management, tunnel management and for mobility management [55]. We argue that these messages can also be used for quick failure detection (Insight: I 5 ). To take an example, when MME sends create session request to SGW, and it does not receive a response message (create session response) within a certain amount of time; it will re-send that signalling message (create session request). The retry interval is in the order of a few milliseconds [19]. MME will keep sending create session request message until the SGW failure is detected by GTP path management heartbeat timer expiration or the device timesout and re-initiates the control-plane procedure. We propose to leverage no-response of the signalling messages as a prediction of failure. The NF predicted to be experiencing a failure will be further probed to confirm the prediction; and the failover procedure can start, thereafter, if the failure has been confirmed. Failure detection algorithm We explain our failure detection solution using Algorithm 3. Our approach is to launch a monitoring thread, T monitor, in every NF that monitors whether the other peer has provided the response to GTP-C signalling message or not4 . Note that, it is sufficient to monitor the signalling messages transmission of any single device. This is because peer NF fail-stop failure for one device is the failure for all devices that are communicating with the failed peer NF. When monitoring thread does not find the signalling message response for certain number of retries (5 retry attempts in 4 To detect the failure between MME and HSS, T monitor monitors the pending message queue that significantly increase during failure, see section 5.5.4 in [65]

8

our implementation), it marks the non-responsive peer NF as failure-prone. T monitor then notifies the failure-prone NF to recovery engine. On receiving the failure prediction message, the recovery engine creates NF specific thread, P thread, and frequently probes the failure-prone NF. If P thread does not receive the response for certain number of retries then the probed NF will be declared as failed. Thereafter, the failure recovery will start using our rollback recovery mechanism (Algorithm 2). Handling false positive It is possible that our approach causes false positive when the NF fails to respond due to congestion but it is wrongly marked as failed. We argue that declaring such non-responsive peer NF as failed is only to improve the efficiency where an alternative NF takes charge to serve LTE subscribers. Our rollback recovery mechanism ensures that subscribers do not lose the LTE service, therefore, these false positives will not be perceived at UEs. C. Correctness of Our Approach We prove that our approach resumes the failed operation without causing the domino effect. Proof: We define the NF has a set of states, an initial state (from this set), and an ordered set of events Es . The event can be a sending message send(m), a receiving message receive(m), or the internal computation that the NF performs on sending/receiving the message. Let’s consider R be the set of rolled back messages of NFj belonging to the latest checkpoint Ck . Once NFj fails it rolls back to Ck . The alternative NF, NFa , takes charge of NFj . Events send(m) ∈ Es and receive(m) ∈ Es form a causality relationship as mi → mj . As a rollback recovery protocol, all n messages in Ck will be replayed in-order. If mi → mj is send(m) ∈ Es , then message is sent. In case mi → mj is receive(m) ∈ Es , then a response from peer NFp need to be received to progress the rollback recovery. NFp will get the execution request of same message which it has already executed. NFp will simply produce the same output (from cache) that was generated earlier instead of executing the message, mj . When there exists a message mi , such that mi ∈ / send(m) and mi ∈ / receive(m), mi will be resent on timeout. During the rollback recovery, it is possible that NFa receives a response of a message that was never sent. In that case, this response will be discarded and the discarded message will be replayed on timeout. To understand this, we assume the case when NFj has sent the message mi and the failure occurs. NFa comes up (by replacing NFj ) and rolls back to Ck . Now assume that NFa receives the response of mi before replaying messages from Es . For NFa this is a response of a non-sent request, which is invalid. Therefore, NFa will discard the received response to mi . We see from this correctness proof that our approach does not lead to orphan message phenomena, and can successfully replay the rolled back messages. Hence, it does not cause any domino effect. D. Discussion Our design is complementary to the approaches that discuss fail-stop failures caused by hypervisor, memory, storage, I/O

operations and due to other system component failures [66], [67], [8]. In all these cases, the EPC NF stops responding which is quickly detected by our recovery engine. The recovery engine selects an alternative NF (possibly at different hypervisor) and LTE procedure can resume. Further, our design timely performs the garbage collection by deleting old checkpoints. For this, we are required to ensure that we do not delete any checkpoints which will be used for rollback recovery in future. Because our approach does not incur domino effect, the failed NF does not let other neighboring NFs to rollback. This makes our garbage collection simple in which we are required to keep the latest two checkpoints and can delete the rest ones. VII. I MPLEMENTATION In our implementation, we use OpenAirInterface (OAI) [14], an open source LTE platform to implement EPC functionality according to 3GPP standard. The OAI provides basic implementation of EPC NFs (such as MME, SGW/PGW, and HSS) that can be deployed over Unix-based platforms like Linux, BSD or Solaris. The OAI being open source provides full flexibility on how LTE EPC NFs are managed on cloud platform. It provides abstraction of common hardware resources through virtualization and meets compute, networking and storage demands of different NF applications. We spent significant efforts to modify the OAI to be 3GPP compliant. In our implementation, we followed Cisco EPC configurations [56], [10] to match real-life LTE EPC deployment. A. Implementation of the State-of-the-Art We modified EPC NFs network config, and network interfaces files to ensure MME should be able to communicate with more than one SGW NF to perform fail-over. We modify the SCTP implementation to enable SCTP maximum number of retransmissions, timeouts according to Cisco implementation [10]. We enhance the path management implementation of OAI by ensuring the retransmission of packets are done according to GTP-C tunneling protocol [55], and use the GTPC timer values as defined by Cisco implementation [56]. B. Implementation of Proposed Design The two main goals in implementing our solutions are: (1) easiness of implementation, and (2) incremental deployment without requiring LTE EPC source code changes. These are mainly due to the fact that network operators (such as AT&T, Verizon, and others) are interested in NFV for new and expanding deployments, but might be less enthusiastic about re-writing their EPC implementations. Overview: As shown in Figure 7, our implementation has a recovery engine that consists of (a) checkpoint manager: prepares an alternative NF and asks that NF to retrieve the desired checkpoint during failover, and (b) failure detector: runs P thread and probes the NF which is predicted to be failed. We implement a separate middleware (we call it stub) along side EPC NFs. The stub has three main components (i) checkpoint agent: responsible for taking checkpoints, (ii) local

9

Recovery Engine Checkpoint Manager

Failure Detector

Virtual Replayer

VM1

NF

Stub

Local Monitor

VMn

NF

Stub

Checkpoint Agent

Fig. 7: Implementation components of our design

monitor: runs the failure prediction thread, T monitor, and (iii) virtual replayer: generates the messages from the cache when its peer NF is performing rollback recovery. The virtual replayer does not let the EPC NF re-execute the messages which are sent during rollback recovery of its peer NF. This choice avoids the state inconsistencies and domino effect. 1) Rollback Recovery: For rollback recovery, as discussed in §VI-A2, every VM is required to take periodic checkpoints which are saved in stable storage (i.e. MySQL database). To take the checkpoint without modifying the source code, we develop a checkpoint agent that takes checkpoints following our Algorithm 1. It only records desired information for rollback recovery that includes UE messages, their states, and NF configurations, and does not take complete VM snapshot. The agent then directly stores the checkpoint information into MySQL database of HSS and informs checkpoint manager in the recovery engine about this. The checkpoint manager basically keeps track of all checkpoints, and also performs the garbage collection by deleting the old checkpoints from MySQL database. Once the failure has been detected, the checkpoint manager selects the alternative NF by forwarding the checkpoint of the failing NF. It also informs all other EPC NFs’ stubs to be ready to handle (through virtual replayer) rolled back messages. The alternative NF rolls back LTE procedures to the last saved checkpoint and replays the logs. During the replay, this alternative NF sends messages to peer NFs. The stub at each peer NF intercepts these replayed messages and forwards them to virtual replayer that generates the response from the cache (the same response that was sent earlier). Once all rolled back messages are replayed, all NFs come into steady state (i.e. the time when the failure happened). In the steady state, the messages are executed at peer NFs, rather than by the virtual replayer; hence the halted control-plane procedure due to NF failure resumes its operation. The checkpoint manager is also responsible for deleting old messages that will never be used by virtual replayer in future. 2) Qucik Failure Detection: The fast failure detection is implemented in two steps. First, a local monitor determines whether there are retransmissions of GTP-C messages of a UE (it randomly selects one UE and observes its GTPC send/receive messages). If the same GTP-C message is retransmitted for 5 times (in our implementation), it informs the recovery engine about predicted failure. In the second step, once the failure detector receives the failure prediction flag from the local monitor, the failure detector creates a

new thread and frequently probes (at an interval of 500 milliseconds in our implementation) the EPC NF predicted to be failed. If it does not receive a response for 5 probes (our implementation choice) the failure is confirmed and rollback recovery procedure starts as shown in §VII-B1. VIII. E VALUATION We evaluate the fault-tolerance mechanisms of our proposed approach. The baseline LTE mechanisms are those described in §VII-A. We run our tests on Ubuntu Server 14.04.3 LTS as virtualized instance running on Intel Core i7 3517U Processor x 2 with maximum turbo frequency of 3.00 GHz, 4 MB SmartCache and 6 GB DDR3 RAM. We discuss our results regarding (1) failure detection, (2) failure recovery, (3) overhead. Finally, we briefly discuss overall failure recovery and high availability of our design. A. Failure Detection How quickly we detect the failure? To find out how quickly our approach can detect the failure, we initiate the control-plane procedures and let one of the EPC NF crash. The monitoring thread (T monitor) counts five number of retransmissions and reports the failure prediction to recovery engine. The probing thread (P thread) at recovery engine probes the failure-prone NF at fixed interval of 500 milliseconds for 5 number of retries. Thereafter, the failure is confirmed. Figure 8 shows that on average our approach can detect the failure in about 2.7 seconds; and in 3.5 seconds for 80% of the cases. At worst case, it takes about 4.9 seconds to detect the failure. This is mainly due to the varying retransmission in LTE path management due to network congestion (i.e. echo request and response take longer if one of the peer NF is severely congested). False positives? We are also interested in finding how many false positives our failure prediction can trigger. Our failure prediction relies on the number of re-transmissions between two EPC NFs (we consider 5 retransmissions to be treated as failure prediction). This retransmission value is calculated from LTE path management technique5 . If we consider the number of retransmissions as it is (calculated based on path management value), our approach can bring up to 12.4% of false positives6 , as shown in Table V. However, when we relax our failure prediction by doubling the time retransmission is being happened then we can significantly reduce the false positives to 3.8%. We can reduce false positives to zero if we exponentially back-off the time retransmission is being happened in order to predict the failure. In our result, we have achieved zero false positive on 2nd time back-off. In short, in worst case, our false positives are not beyond 12.4%. B. Failure Recovery How quickly we recover from failure? Figure 9 shows the comparison of failure recovery between our approach and the 5 echo-req/echo-resp value plus some alpha (we set it to be 20% of echo-req/echo-resp value) 6 We observe these false positives when our EPC NF is overloaded and some of the response messages take unusually long time

10

1

Our Approach Baseline

0.6

0.4 0 2.5

0.5

0.4

0.2

Our Approach (SR/TAU/Attach) SR

1

CDF

CDF

0.6

1

0.8

0 150

200

250

4.5

6.5

8.5

10.5

12.5

Failure Detection Time (s)

Fig. 8: Failure detection time comparison.

3000

5000

0.4

7000

Recovery Time (ms)

80

without checkpointing with checkpointing

0.2 9000

11000

Fig. 9: Failure recovery time comparison.

without checkpointing with checkpointing

100

Latency (ms)

1000

0.6

Attach/TAU

300

0.2 0

0.8

CDF

1 0.8

0 150

200

250

300

Attach Procedure Time (ms)

350

Fig. 10: Time overhead of keeping checkpoints.

TABLE VI: Storage overhead on checkpoint and comparison among different approaches.

Approach Lightly loaded Medium loaded Fully loaded Our Approach 0.2MB 52MB 110MB VMware 8031MB 4251MB 4311MB OpenStack 8031MB 8091MB 8251MB

60 40 20

test Enabling Ciphering

Establishing Session

Fig. 11: Breakdown of latency overhead for keeping checkpoints. TABLE V: Relaxing the failure prediction as retransmission time in path management leads to no false positive.

Criteria False Positive Taking path management retransmissions values as it as 12.4% Relaxing path management retransmissions values 3.8% Using path management retransmissions values with backoff 0%

baseline when SR and Attach/TAU control-plane procedures are used. In baseline, the device’s control-plane procedure always times-out – costing 5 seconds for SR and 10 seconds for Attach/TAU case – before recovering the failure (through re-attach). In contrast, our approach can take a few hundred milliseconds to recover from the failure by using our rollback recovery mechanism. In our approach, the recovery engine instructs the alternative VM (in cold-standby) to read the logs from the stable storage, and then the alternative NF simply replays the logs. Because these logs are being replayed using virtual replayer, it does not incur radio transmission delays during failure recovery. 1) Negligible Overhead: Checkpointing operation overhead during normal working Our design is required to take periodic checkpointing so that it could rollback and replay the checkpoint logs during failure recovery process. We show that our checkpointing approach does not have much overhead on LTE control-plane procedures before failure. Figure 10 shows the overhead for Attach Request control-plane procedure. The reason we consider Attach Request is that attach procedure requires more number of signalling messages exchange compared to SR and TAU. As we can see from Figure 10, our checkpointing approach only costs up to 50 milliseconds of overhead for LTE control-plane procedure. The overhead of 50 milliseconds is small when compared to overall timeout value of 5,000 milliseconds for SR, and 10,000 milliseconds for Attach/TAU procedures. Latency breakdown during checkpointing Figure 11 shows the latency break down as micro benchmark for checkpointing before failure. We consider the authentication process (that includes the rounds of communication with HSS to retrieve authentication vector), ciphering process, and session creation process (that includes assigning IP address, settingup QoS etc). We can see from Figure 11 that on average our checkpointing approach just adds tens of milliseconds (i.e. 5% to 30%) more processing while executing LTE control-plane procedure.

Storage overhead on checkpointing and its comparison Table VI shows the checkpoint storage of our approach compared to VMware’s vSphere [46] and OpenStack [26]. We consider three cases, lightly loaded (few users accessing the network), medium loaded (system resources utilization reaches 50%), and fully loaded (when LTE NF is almost overloaded). Our approach that records only procedures messages, states, and configurations related parameters7 , brings 39× and 75× less overhead, at worst case, when compared to VMware and OpenStack. OpenStack that needs to take full VM snapshot has almost constant overhead of about 8 GB per checkpoint. VSphere uses disk chains concept and stores delta VM image and configurations (vmdk, vmsd, and vmsn) as well as snapshot of VM memory (4 GB in our case). 20 18 16

Detection Recovery

14

Latency (s)

Enabling Authentication

12 10 8 6 4 2

Proposed Baseline

Proposed Baseline

Lightly Loaded Medium Loaded

Proposed Baseline

Fully Loaded

Fig. 12: Overall failover time (from failure happening to failure recovery) in case of lightly loaded, medium loaded and highly loaded EPC NF. The comparison of our approach with baseline.

2) Discussion: We finally show overall recovery time (from failure happening to its recovery), and discuss high availability by using our failure recovery design. How fast can we recover from failure? Figure 12 shows the failover procedure (time between failure occurring and getting back to service) under lightly loaded, medium loaded and highly loaded EPC NF cases. We consider SR LTE procedure and compare our approach with the baseline. Figure 12 shows that on average our approach fully recovers from failure in about 3.5 seconds, and it takes about 8 seconds to recover when failure detection probing is delayed due to overload. On the other hand, baseline approach always takes more than 14 seconds to recover, mainly due to long failure detection and full VM recovery and device re-attach issues. How our approach achieves high availability In this paper, we define high availability in terms of LTE service provisioning. Even if the system recovers in less than a second (five-nine availability case), requiring the device to re-attach 7 We assume that the network operators save other related parameters, such as EPC NF overload threshold, dynamic resource utilization etc. in their configuration files[68]

11

fails to provide high availability. We argue that in LTE context those systems are highly available who do not let LTE controlplane procedures to abort. We show that our design can achieve users perceived high availability for 80% of the time (no LTE procedure timeout case). Through our approach when we consider worst case scenario (when the system is overloaded), UEs experience timeout only once for their LTE procedures and never terminate their procedures. IX. R ELATED W ORK We present related work for fault tolerance and high availability in telecom–NFV, cloud and distributed computing. The most recent works on telecom–NFV related reliability are [6], [7], [19], [18], [69], [70], [35]. [6] re-designs EPC architecture for public cloud deployment and guarantees reliable LTE operations. The main idea of [6] is to break EPC functionality into stateless and stateful components. The stateful information is stored into highly reliable storage that achieves the high availability. Contrary to [6], our solution does not re-design LTE EPC architecture and acts like plug-and-play with existing EPC architecture. It uses rollback recovery in LTE context and recovers from failure. [19], [18] consolidate the LTE processing for fast execution of LTE procedures. [7], [69] provide reliability to IMS core by exploiting IMS domain knowledge and refractoring IMS software modules to achieve fault tolerance. [7], [69] cannot recover LTE failures. [70] puts forward the need of reliable LTE design in NFV. Our paper addresses these concerns by providing fault tolerance to LTE procedures. [35] shows high availability in cellular networks is hard to achieve and provides recommendations to improve it. Making the similar argument as in [35], our work is able to achieve users perceived high availability. Works on cloud reliability include [5], [71], [72], [73]. [5] discusses rollback recovery for middleboxes. It records the middlebox states and releases them in parallel to improve recovery. In case of LTE, we are required to ensure messages causality and need to avoid domino-effect; and [5] cannot achieve our goals. [71] studies server failures and hardware repairs for large data centers. [72], [73] show that failures are common in both Microsoft and Google data centers and it is hard to achieve high availability. More recent works have explored the impact of fault tolerance in cloud storage [74], gateways [75], hyperviser [66], and enterprise services [76]. Other works, such as [77], [78], and [79] discuss placement problems of NFs to improve the performance of LTE NFV. Different to these works, our work aims to provide fault tolerance to LTE core NFs. Finally, failures and their impact have also been studied in distributed and high performance systems. These approaches are broadly categorized in uncoordinated checkpointing, coordinated checkpointing and model based checkpointing. The uncoordinated checkpointing is simple in which different distributed nodes do not communicate to take checkpoint. However, it causes the domino effect [11], [12], [13]. Recent works have used domain specific knowledge to address domino effect in MPI applications [80], [81], [82]. On the other hand, our approach which is similar to uncoordinated checkpointing

avoids domino effect in LTE. Coordinated checkpointing [83], [84] addresses the issues in uncoordinated checkpointing but causes checkpointing overhead, which is not feasible for delay sensitive LTE control-plane procedures. Model based checkpointing such as [85] and [86] requires complete system model to ensure optimal checkpoint location. However, it is not possible to find/create complete LTE model (as LTE is a complex system) that may lead to incorrect checkpointing strategy. X. C ONCLUSION This paper studies the high availability of LTE controlplane procedures in NFV. Our paper addresses three critical challenges in providing high availability in LTE: fast failure detection, recovering the lost states and messages, and incurring less overhead. By leveraging LTE specific design insights we have solved these challenges. The failure is detected through LTE packets retransmission and probing, only when it is required. The failed control-plane procedures are restored through checkpointing and rollback recovery mechanisms. Our solution exploits LTE messages causality, and split/join mechanisms to take checkpoints. It uses receive-deterministic rollback recovery and ensures the EPC NFs states remain consistent after failover. Our approach only stores the most desired information for rollback recovery and saves several order of magnitude storage space when compared with traditional snapshot approaches. R EFERENCES [1] Cisco ASR 5500: elastic packet core network from Cisco. https://www.cisco.com/c/en/us/products/wireless/asr-5500/index.html. [2] Ericsson Blade System (EBS) for EPC. https://www.ericsson.com/ourportfolio/digital-services-solution-areas/sgsn-mme? nav=fgb 101 0909\%7Cfgb 101 256. [3] Alcatel-Lucent 5620 SAM . http://users.ipfw.edu/chansavj/ebook/Alcatel Lucent epc solution.pdf. [4] Huawei eCNS300 solution for EPC. http://m.huawei.com/enmobile/enterprise/products/wireless/gsm-r/gsm-r/ hw-200203.htm. [5] J. Sherry, P. X. Gao, S. Basu, A. Panda, A. Krishnamurthy, C. Maciocco, M. Manesh, J. Martins, S. Ratnasamy, L. Rizzo, et al. Rollback-recovery for middleboxes. In ACM SIGCOMM, August 2015. [6] Binh Nguyen and et al. A reliable distributed cellular core network for public clouds. In Technical Report: Microsoft Research, 2018. [7] Raza, Muhammad Taqi and Lu, Songwu. Enabling low latency and high reliability for IMS-NFV. In IEEE CNSM, 2017. [8] Carbonari, Amanda and Beschasnikh, Ivan. Tolerating Faults in Disaggregated Datacenters. In ACM HotNets, 2017. [9] Apache ZooKeeper. https://zookeeper.apache.org. [10] CISCO – SCTP Parameter Template Configuration Mode Commands. https://www.cisco.com/c/en/us/td/docs/wireless/asr 5000/20/CLI/books/R-Z/20 R-Z CLI-Reference/20 R-Z CLI-Reference chapter 010110.pdf. [11] Cao, Guohong and Singhal, Mukesh. On coordinated checkpointing in distributed systems. IEEE Transactions on Parallel and Distributed Systems, 9(12):1213–1225, 1998. [12] Koo, Richard and Toueg, Sam. Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on software Engineering, (1):23–31, 1987. [13] Kim, Junguk L and Park, Taesoon. An efficient protocol for checkpointing recovery in distributed systems. IEEE Transactions on Parallel and Distributed Systems, 4(8):955–960, 1993. [14] Open source LTE platform. http://www.openairinterface.org. [15] Khalid, Junaid and et al. Paving the Way for NFV: Simplifying Middlebox Modifications Using StateAlyzr. In USENIX NSDI. [16] Khalid, Junaid and et al. A standardized southbound API for VNF management. In ACM Hot topics in Middleboxes and Network Function Virtualization, 2016. [17] Palkar, Shoumik et al. E2: a framework for NFV applications. In ACM SOSP, 2015. [18] Qazi, Zafar Ayyub and et al. A High Performance Packet Core for Next Generation Cellular Networks. In ACM SIGCOMM, 2017. [19] M. T. Raza and et al. Rethinking lte network functions virtualization. In IEEE ICNP, 2017. [20] Panda, Aurojit and et al. NetBricks: Taking the V out of NFV. In USENIX OSDI, 2016. [21] Kablan, Murad and et al. Stateless Network Functions: Breaking the Tight Coupling of State and Processing. In NSDI, 2017.

12

[22] Jaehyun Nam and et al. Probius: Automated Approach for VNF and Service Chain [54] Analysis in Software-Defined NFV. ACM SOSR, 2018. [55] [23] Ramakrishnan, KK. Software-Based Networks: Leveraging high-performance NFV [56] platforms to meet future communication challenges. In IEEE ICDCS, 2016. [24] Open EPC - Software LTE implementation. http://www.openepc.net/. [57] [25] OpenLTE: Open source LTE implementation. [58] http://openlte.sourceforge.net. [26] OpenStack Open Source Cloud Computing Software. [59] https://www.openstack.org/software/. [60] [27] Open Platform for NFV (OPNFV). https://www.opnfv.org/. [61] [28] NF virtualization using VMware’s vSphere: The Efficient and Secure Platform. https://www.vmware.com/products/vsphere/features/. [62] [29] R. Mijumbi and et al. Network function virtualization: State-of-the-art and research challenges. IEEE Communications Surveys & Tutorials, 18(1):236–262, 2016. [30] Bringing Network Function Virtualization to LTE, November 2014. [63] http://www.4gamericas.org/files/1014/1653/1309/4G Americas - NFV to LTE November 2014 - FINAL.pdf. [64] [31] Boost agility with virtual networks that deliver quickly on demand. http://www.verizonenterprise.com/products/networking/sdn-nfv/ [65] virtual-network-services/. [66] [32] Exploring the opportunities in NFV. http://www.vodafone.com/business/group-enterprise/ [67] software-defined-networking-and-network-function-virtual. [33] SK Telecom Creates its Own In-House NFV MANO. [68] https://www.sdxcentral.com/articles/news/sk-telecom-creates-its-own-in-house-nfv-mano/ 2017/07/. [34] High Availability is more than five nines. [69] https://archive.ericsson.net/service/internet/picov/get?DocNo=10/ 28701-FGB101256&Lang=EN&HighestFree=Y. [70] [35] Elmokashfi, Ahmed and et al. Adding the Next Nine: An Investigation of Mobile Broadband Networks Availability. In ACM Mobicom, 2017. [71] [36] E. Zhai and et al. Heading Off Correlated Failures through Independence-as-aService. In OSDI, 2014. [72] [37] D. Yuan and et al. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. [38] K. Bhat and et al. OSIRIS: Efficient and consistent recovery of compartmentalized [73] operating systems. [39] DOCOMO Verifies Advanced C-RAN in Outdoor Commercial Environment. [74] https://www.nttdocomo.co.jp/english/info/media center/pr/2015/0205 00.html. [40] Foukas, Xenofon and et al. Orion: RAN Slicing for a Flexible and Cost-Effective [75] Multi-Service Mobile Network Architecture. In ACM Mobicom, 2017. [76] [41] ALCATEL-LUCENT VIRTUALIZED EPC DELIVERING ON THE PROMISE OF NFV AND SDN. [77] http://www.tmcnet.com/tmc/whitepapers/documents/whitepapers/2014/ 10743-alcatel-lucent-virtualized-epc-delivering-the-promise-nfv.pdf. [42] Ericsson Virtual Router. https://archive.ericsson.net/service/internet/picov/get?DocNo=1/ [78] 28701-FGB1010557&Lang=EN&HighestFree=Y. [43] OPNFV – Building fault management into NFV deployments . https://www.opnfv.org/wp-content/uploads/sites/12/2016/11/opnfv faultmgt final. [79] pdf. [44] S. Rajagopalan, D. Williams, and H. Jamjoom. Pico replication: A high availability framework for middleboxes. In ACM Symposium on Cloud Computing, page 1, [80] 2013. [45] Configuring the compute node – work is in progress. [81] https://docs.openstack.org/ha-guide/compute-node-ha.html. [46] Understanding VM snapshots. [82] https://kb.vmware.com/s/article/1015180. [47] VMWare virtual machines become unresponsive for over 30 minutes. [83] https://kb.vmware.com/s/article/2039754. [48] J. Kim and et al. Database high availability using SHADOW systems. In ACM Symposium on Cloud Computing, 2015. [84] [49] ETSI GS NFV-REL 002: Report on Scalable Architectures for Reliability Management, 2017. [50] ETSI GS NFV-REL 003: Report on Models and Features for End-to-End Reliability, [85] 2017. [51] 3GPP. TS24.301: Non-Access-Stratum (NAS) protocol for Evolved Packet System (EPS); Stage 3, Jun. 2013. [86] [52] C. Guidi and et al. Dynamic error handling in service oriented applications. Fundamenta Informaticae, 95(1):73–102, 2009. [53] 3GPP. TS23.401: General Packet Radio Service (GPRS) enhancements for Evolved Universal Terrestrial Radio Access Network (E-UTRAN) access, 2011.

3GPP TR 23.857: Study of Evolved Packet Core (EPC) nodes restoration, 2012. 3GPP. TS29.274: Tunnelling Protocol for Control plane (GTPv2-C), 2014. CISCO – Troubleshooting GTPC and GTPU and associated path failures. https://www.cisco.com/c/en/us/support/docs/wireless/asr-5000-series/ 200026-ASR-5000-Series-Troubleshooting-GTPC-and.html. 3GPP. TS23.401: GPRS Enhancements for E-UTRAN Access, 2011. Making Nova Database Highly Available with Pacemaker. https://wiki.openstack.org/wiki/HAforNovaDB. R. Aranha and et al. Database system with active standby and nodes. Lang, Willis and et al. Microsoft azure SQL database telemetry. In ACM Symposium on Cloud Computing, 2015. Lin, Wei and et al. StreamScope: Continuous Reliable Distributed Processing of Big Data Streams. In USENIX NSDI, volume 16, pages 439–453, 2016. Zhang, Yiying and et al. Mojim: A reliable and highly-available non-volatile memory system. In ACM SIGARCH Computer Architecture News, volume 43, pages 3–18, 2015. Strom, Rob and Yemini, Shaula. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems (TOCS), 3(3):204–226, 1985. Czaja, Ludwik. Failures and Damages in Distributed Systems. In Introduction to Distributed Computer Systems, pages 157–185. Springer, 2018. Fajardo, et al. Diameter Base Protocol, 2012. RFC 6733. Gopalan, Kartik and et al. Multi-hypervisor virtual machines: enabling an ecosystem of hypervisor-level services. In USENIX Annual Technical Conference, 2017. Nikolaev, Ruslan and Back, Godmar. VirtuOS: an operating system with kernel virtualization. In ACM Symposium on Operating Systems Principles, 2013. State-Location Information Retrieval for Cisco MME. https://www.cisco.com/c/en/us/td/docs/wireless/asr 5000/21-5 N5-8/MME/ 21-5-MME-Admin/21-5-MME-Admin chapter 0110100.html. Raza, Muhammad Taqi and et al. Modular Redundancy for Cloud based IMS Robustness. In ACM MobiWac, 2017. Gonzalez, Andres and et al. Service Availability in the NFV Virtualized Evolved Packet Core. In IEEE Globecomm, 2015. R. Chen and et al. Towards Reliable Application Deployment in the Cloud. In ACM CoNEXT, 2017. Gill, Phillipa and Jain, Navendu and Nagappan, Nachiappan. Understanding network failures in data centers: measurement, analysis, and implications. In ACM SIGCOMM, 2011. Govindan, Ramesh and et al. Evolve or die: High-availability design principles drawn from googles network infrastructure. In ACM SIGCOMM, 2016. T. T. Chekam and et al. On the synchronization bottleneck of OpenStack Swift-like cloud storage systems. In IEEE INFOCOM, 2016. K. Tan and et al. Protego: cloud-scale multitenant IPsec gateway. 2017. D. Lugones and et al. AidOps: a data-driven provisioning of high-availability services in cloud. In ACM Symposium on Cloud Computing, 2017. M. Bagaa and T. Taleb and A. Laghrissi and A. Ksentini and H. Flinck. Coalitional Game for the Creation of Efficient Virtual Core Network Slices in 5G Mobile Systems. IEEE Journal on Selected Areas in Communications, 36(3):469–484, 2018. I. Benkacem and T. Taleb and M. Bagaa and H. Flinck. Optimal VNFs Placement in CDN Slicing Over Multi-Cloud Environment. IEEE Journal on Selected Areas in Communications, 36(3):616–627, 2018. A. Laghrissi and T. Taleb and M. Bagaa. Conformal Mapping for Optimal Network Slice Planning Based on Canonical Domains. IEEE Journal on Selected Areas in Communications, 36(3):519–528, 2018. Guermouche, Amina and et al. Uncoordinated checkpointing without domino effect for send-deterministic mpi applications. In IEEE IPDPS, 2011. Lefray, Arnaud and et al. Replication for send-deterministic MPI HPC applications. In ACM Workshop on Fault-tolerance for HPC at extreme scale, 2013. O. Subasi and et al. Toward a General Theory of Optimal Checkpoint Placement. In IEEE Cluster Computing (CLUSTER), 2017. Z. Zheng, L. Yu, and Z. Lan. Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart. IEEE Transactions on Computers, 64(5):1402–1415, 2015. K. B. Ferreira and et al. Understanding the effects of communication and coordination on checkpointing at scale. In IEEE Conference for High Performance Computing, Networking, Storage and Analysis, pages 883–894, 2014. K. Sato, Mohror, and et al. A user-level infiniband-based file system and checkpoint strategy for burst buffers. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pages 21–30, 2014. S. Gao, B. He, and J. Xu. Real-time in-memory checkpointing for future hybrid memory systems. In ACM Conference on Supercomputing, pages 263–272, 2015.