An Application of Discrete Event Simulation on Wireless Network Reliability Robert Jakubek
Abstract—In this paper, a discrete event simulation is used to study the reliability and potential failure points of a wireless network during a disaster. The simulation configuration is based on a real wireless network for a mid-sized city, and is configured as a 112 cell site network that supports LTE data and code division multiple access (CDMA) voice services. SimPy, a general event-driven simulation framework, is used as simulation software. Each network component, eNode-B, CDMA base transceiver station (BTS), cell site router, antenna, and generator is created as a model object. Simulation design attributes were developed from real network engineering parameters, analytical analysis, and academic research. Four periods were simulated: a 7-day pre-disaster period, a 4-day disaster (when no repairs are allowed), a 6-hour no travel period, and a post disaster period for repairs. The simulation provides an understanding of total network reliability, individual failure points, lost voice calls due to failed cell sites, and the value of design choices. It shows the complex interactions of the backup systems and the attempted restoration by simulated engineers. This work demonstrates that simulations can help engineers understand the effects of natural disasters on wireless networks, and allows scenario testing to improve network reliability. Keywords—Network reliability, Simulation, Wireless
a Monte Carlo simulation, the problem inputs are varied by a random process [1]. The output is then analyzed. To produce valid results, this process of feeding randomized input into an algorithm is repeated many times [1]. This research paper focuses on a discrete event type simulation. Similar to a Monte Carlo simulation, a discrete event simulation varies the input to the model by a random process and then measures the output; in a discrete event simulation, the system changes states at points in time, processes run, resources are requested and events occur as the clock ticks forward [1]. This analysis is an example of a discrete event simulation where the simulation model represents the important parameters of a real wireless network in Madison, Wisconsin. The model consists of 112 cell sites that provide the wireless service for Madison. Many of the key parameters, e.g., battery backup levels, generators, fuel levels, failure rates, the basic equipment configuration, and the voice traffic load represent the real wireless network in that area. This model moves through a series of events to include normal operation, damage caused by severe storms, loss of commercial power, and return to normal operation. The model is developed in Python’s SimPy package and the data are analyzed with R [2][3]. The objective of this simulation is to demonstrate the value of modeling of the reliability of an existing wireless network; with the end goal of improving disaster reliability and response.
I. INTRODUCTION
II. PREVIOUS RESEARCH
any problems are too complex to solve deterministically. These problems often have multiple variables, each with differing degrees of uncertainty. A more effective method to understand the problem in such situations is to use a simulation. In a simulation, the engineer develops a model that best represents the parameters required to represent the real situation. In the wireless field, simulations can be effective tools for modeling LTE demand at a specific event, the performance of a new technology that has yet to be implemented, the failure rate of a complex wireless network, and even the effects of team workload post reorganization. The objective of this paper is to demonstrate a method to simulate wireless network reliability, by utilizing a model that represents an existing network design. The model parameters are directly from a live commercial network. While it is very common to model RF coverage, modeling of network reliability is rare. Computer simulations can be divided into three categories: Monte Carlo, continuous, and discrete event simulations [1]. In
From 2004 to 2006, Sandia National Laboratories and Bell Labs performed a series of simulations of natural disasters that affected the critical infrastructure [4][5][6][7]. The simulation tool used was N-SMART. The most significant aspect of this project was the work done to forecast the possible impact that Hurricane Rita would have on Galveston, Texas [7]. Shortly after the Hurricane Katrina disaster in 2005, a second storm formed on the Gulf coast. Hurricane Rita had the potential to cause severe damage to the Galveston, Texas area. O’Reilly et al. developed a simulation to understand the impact that Rita could have on the critical infrastructure of the area [7]. The Galveston community supports 4.1 million wireless and wireline customers. The researchers built three different simulations. First, the baseline simulation established a normal working environment. The second simulation was of a large-scale power outage. The final scenario was a large-scale power outage with disruptions due to flooding. The results of these simulations showed a substantial impact on the telecommunications network. Twelve mobile switching centers, 18 local telephone exchanges, access tandems, and long distance switching offices all failed. From the customer
M
Dr. Robert Jakubek is the Sr. Director of Network Operations and Engineering Central Region at U.S. Cellular, Madison, WI 53718 (email:
[email protected]).
viewpoint, this meant that there were 6 hours where all calls were blocked, and 95% blocking was experienced for many additional hours. This was caused by both failed equipment and call attempts at 20 times the normal rate [7]. Other significant disaster research often focuses on the post-disaster results. Examples of this type of research include a study of the physical destruction of telecommunication equipment by Hurricane Katrina by Kwasinski et al. [8], the power and flooding effects on telecommunication equipment caused by Hurricanes Isaac and Sandy [9], and the research on the effects of the ice storm of February 1998 by Lecomte et al. [10]. This type of research often focuses on the physical destruction of telecommunication equipment and methods to limit the impact of similar future disasters on this equipment. The simulation in this paper is designed to build a model that can predict the impact of a disaster before it occurs, thereby allowing wireless companies to modify their designs and policies to make the network more reliable when a disaster does occur. Based on this simulation methodology, designs can be created that can protect specific parts of the network, prevent total network collapse, or measure the resilience of the existing design. The simulation model uses many of the real world engineering parameters for the wireless network in Madison to provide the most accurate representation of the behavior of that network. This provides an example of prediction of the resilience of an existing network design. III. METHODOLOGY The simulation model is designed to represent the key parameters of a Long-Term Evolution standard (LTE) data and code division multiple access (CDMA) voice wireless network. To accomplish this, a SimPy class was created and was called “part”. The part class has attributes that can describe the characteristics of an individual part. Most importantly, the part class contains the characteristics of time to failure, dependence on other parts, the state of the part, expected repair time, and, finally, the probability distribution that represents its failure characteristics. There are two types of parts. The first type is a part that has an infinite capacity. These parts have one in-service state and multiple out-of-service states. They never run out of capacity and, unless they are out of service, they are always running. The second type of part is a consumption-type part. These parts have all the same characteristics of a standard part but also have a consumption characteristic. Non-consumption parts are assigned to items like cell site routers, antennas, eNode-B’s, and other devices that do not consume resources. Consumption parts are items that have a limited amount of some resource. Examples of limited resources include batteries and propane tanks. Commercial power is considered a non-consumption part because the consumption required to create the power occurs outside the model’s world view. A number of random distributions were used to build this model. The beta distribution was used for the consumption parts to determine the current levels of individual consumption parts. This distribution has values ranging between 0 and 1. The distribution shape is determined by two parameters: alpha and beta. The alpha parameter determines the shape of the distribution, while the beta parameter determines the spread of
the distribution. Using these two parameters, a random distribution can be created to describe the actual environment. This allows the beta distribution to fit a wide variety of situations. The exponential distribution was used to model the time to failure for the equipment. The exponential distribution often accurately represents time to failure for equipment in the steady state part of its lifecycle [11]. Other advantages of the exponential distribution are that it is easy to calculate and that it has a memory-less property. This simulation is designed to model a relatively short period of time and assumes that all equipment is within the steady state stage of its lifecycle. Other distributions, such as the Weibull distribution, would be used for longer simulations where the equipment would progress through transitions from periods of having a higher initial failure rate, a steady state failure rate, and finally a higher end of life failure rate [11]. Each generator object contains a fuel tank object with a specified amount of fuel that determines how long the generator will run. Tank size is the actual size of the fuel tank at the physical cell site that is used for the model. The tank size was extracted from an appropriate engineering database. In the simulation, the fuel level was established by random assignment based on a beta distribution multiplied by the maximum capacity of the fuel tank. The beta distribution used has alpha as four and beta as one. This equates to roughly 60% of all fuel tanks having a fuel level that is greater than 80%, and 80% of all fuel tanks having a level that is greater than 60% of the maximum level, as shown in Fig. 1. The beta distribution provides estimates of the actual fuel levels of tanks with the organizational goal of being diligent at keeping these fuel tanks full. Alternate scenarios could be created. From this analysis, the best-case fuel tank policy was selected.
Fig. 1. Estimated capacity of the generator fuel tanks.
In the simulation, every cell site has two battery parts. This was done to model both battery consumption and the possibility of total battery failure. The first part is the battery failure. The second part is the battery consumption. The consumption part cannot fail, and can only be consumed. The battery failure part has two states: working or failed. In the simulation, the two types of parts were used to represent one physical piece of equipment. The battery level part is a consumption part. The amount of capacity of each battery plant is the actual battery
plant capacity at the physical cell site used in this model. As with the fuel in the propane tanks, the battery capacity will vary. All batteries are assumed to be fully charged, because that is the way that the physical network is designed. However, as batteries age, their maximum capacities degrade. To model this variable, the beta distribution was used again. The wireless company that this simulation was modeled on replaces their battery strings if the capacity of the string degrades to less than 70%. To simulate that real world maintenance policy, a beta distribution with alpha of 10 and beta of 0.5 was used, as shown in Fig. 2. This is a much tighter distribution than that used in the propane levels. There is a less than 1% chance that a cell has a capacity of 70% or less, a 3.9% chance of 80% of capacity or less, and a 15.2% chance of capacity of 90% or less.
Fig. 2. Estimated capacity of a battery system. A value of 1 represents 100% of engineered capacity.
Each part contains a state variable to represent the operational state of the part. The state variable describes three different operating conditions. State level one represents a fully functional part. State level two represents a part that is out of service because of a failure that requires repair. State level three is reserved and was not used in this simulation. Finally, state level four is a condition where the part is out of service because a dependent part is out of service. When the dependent part is repaired, the part will return to state level one. As an example, the cell site router is dependent on both the backhaul and the electricity supply. If backhaul is a state level two (i.e., part failed), the cell site router would be state level four. In this example, the whole cell site would be marked out of service with a state level four because operation of the cell site is dependent on the cell site router. Hardware failure rates, including battery failure rates, follow an exponential distribution, each with their own mean time between failure (MTBF) rate. Table 1 gives the MTBF and mean time to repair (MTTR) rates for each part in normal operation. The MTBF rate was calculated for each network element from empirical data. The network failure rates for the previous 12 months are presented in Table 1.
TABLE 1 TIME BETWEEN FAILURE AND MEAN TIME TO REPAIR Part Name
MTBF (hours)
MTTR (hours)
LTE eNode B Router Antenna Radio Unit Generator Power Company Battery CDMA BTS Misc Cell Equipment
35,040 84,096 78,840 67,890 26,280 30,660 61,320 8,760 105,120
5 5 12 5 12 6 24 5 5
This table includes the time to fail and time to repair parameters used during the normal period of the simulation.
The MTTR is described with an exponential distribution. MTTR is the sum of all activities between part failure identification and part restoration. This includes acquisition of technical repair resources, drive time, troubleshooting, and component replacement. In this simulation, no repair dispatch is allowed during the disaster or for 6 hours after the disaster. When dispatch is allowed, the MTTR follows an exponential distribution, where cell site equipment is fixed with an average MTTR of 5 hours, power and backhaul are fixed in 6 hours, external vendor equipment is fixed in 24 hours, and hard to replace items are also fixed in 24 hours. This again represents the best-case scenario. Alternate analyses can be performed where different MTTRs are selected. There are a total of five resources that repair equipment. In the physical world each resource can be thought of as a technician or repairperson with the skills to repair failures. The simulation software used was Python 2.7 with the SimPy 3.0 modules. With SimPy, each part is represented by an object, where the object has the characteristics of interest. In a non-consumption part, the characteristics of interest are the MTBF and MTTR. In addition to these characteristics, for consumption parts, the items of interest also include maximum level, current level, and burn rate. Each part has two loops. One loop is the working loop. If the part is working, then each loop is one period of work. The second loop is the break loop. The break loop interrupts the working loop. When the working loop is interrupted, the simulation requests a resource to fix the working loop. One of the five available resources, after a random period defined by MTTR, fixes the working loop. The period between breaking the working loop and fixing the working loop is defined as the MTTR. Figure 3 shows the concept code for what has been described above. This code is incomplete and is only used for the purposes of display but provides a general idea of the simulation. The code for consumption parts is very similar. The difference is that in the working loop, the resources are consumed or replenished. One example is the battery levels. If the commercial power part loop is running, then the battery resource is increased until it reaches its maximum capacity. If the commercial power part loop has failed, then the resources are consumed.
class part (object): def __init__(self, env, nfe, name = "devicename", ttf = 99999,mttr = 3600) self.name = name self.env = env self.nfe = nfe self.work = 0
hours of day 11 are the period of no travel. It took 4 additional days to clear the work queue and get all cell sites back into service. Normal operations resumed on day 15.
def workingloop(self, nfe): while True: try: self.parent.partstatus[self.name] = self.watchdog() self.work = self.work + 1 yield self.env.timeout(1) except simpy.Interrupt: self.parent.partstatus[self.name] = 2 with nfe.request(priority=1) as req: yield req mttr = TimeToRepair(self.env,req,self.mttr).getTTR() def break_machine(self): """Break the machine every now and then.""" while True: yield self.env.timeout(self.time_to_failure()) if self.parent.partstatus[self.name] == 1: #1 is inservice part # Only break the part if it is currently working. self.partprocess.interrupt()
Fig. 4. Numbers of cell sites out of service each day.
However, not all part failures caused a cell site to go out of service. Figure 5 shows a complete listing of parts that failed. The total part failures peaked at 99 failures on day 11. All equipment that failed during the disaster was repaired by the end of day 14. At this point, the repair crews cleared their queue of work.
Fig. 3. Example of simulation code. This example contains two processes, called working and break machine.
The simulation is broken into four periods. The first period is the 7 days before the disaster. The second period is the disaster. The length of the disaster is 4 days. There is then a 6-hour no-travel period, and finally, there is a 7-day restoration period. This simulates a high impact, medium duration natural disaster. The profile that was developed is one where there may or may not be an early warning of the disaster. The period of destruction is measured in multiple hours or days. If people are not evacuated in the early warning period, then the recommendation is that they stay indoors and do not travel. After the disaster, there is a period where it remains unsafe to travel. Disasters that fit this profile include severe snowstorms, ice storms, hurricanes, flooding, wildfires or elongated severe thunderstorms. During the disaster period, the MTBF parameters are changed to estimate the impact of the disaster. At the end of this period, all MTBF parameters are returned to normal operation states. During period four, the restoration period, all failed equipment is repaired. The cell sites operate with their normal MTBF parameters. IV. RESULTS The simulation consisted of 112 cell sites that cover Dane County, Madison, Wisconsin. Periods one to six use the normal failure rates of the hardware, as shown in Table 1. The disaster starts at 00:00 on day 7. It ends on 23:59 on day 10. During days 11 to 14, there is a backlog of failed equipment. Beginning on day 15, the system returns to normal operation. The results show that the number of cell site outages increases at the start of the disaster, as shown in Fig. 4. A cell site outage is defined as a failure that completely stops voice call processing or data service. This is expected, and was coded into the simulation. The number of outages peaked at 43 cell sites out of service (OOS) on day 11, the day after the disaster ended. The first 6
Fig. 5. Total numbers of cell site parts that were out of service each day.
All 112 cell sites have battery backup, and 22 also have fixed generators installed. The company’s engineering standard is that each cell site would have at least 8 hours of battery backup, while some cell sites had more than 24 hours of battery backup. The objectives of the battery backup and the generators are to delay the cell sites from going OOS. In the simulation, a total of 42 cell sites lost commercial power and 36 were off the air because of a lack of power, so while the cell sites had battery backup and many also had generators, the duration of the disaster meant that these backup systems were exhausted. Within the first 24 hours, 11 sites lost commercial power and five exhausted their batteries. On day 8, 13 additional cell sites lost commercial power and an additional seven exhausted their batteries. Two days into the disaster, 50% of the cell sites were without commercial power and were running on either generators or batteries. On day 9, an additional 11 sites lost commercial power and 12 exhausted their batteries. On day 10, the last day of the disaster, 42 cell sites were without commercial power, which was the highest level of cell sites without commercial power. Day 11 saw the peak of cell site outages due to battery exhaustion. On day 11, 36 sites were out of service due to battery exhaustion. In this network design, every cell site had at least 8 hours of battery backup and 20% of
the cell sites had a fixed generator. By the end of the simulation, 85.7% of all cell sites were without commercial power and 37.5% of the total number of cell sites had exhausted their batteries and had gone OOS. In this simulation commercial power failure leading to battery exhaustion was the single largest cause of cell site failure. Based on the collected simulation data, it is possible to forecast the point at which all cell sites would have total power failure. Figure 6 shows the probability of cell site failure because of total loss of power over time. This life expectancy plot shows that if the disaster lasted for 150 hours (6.25 days) then 50% of the cell sites would experience total power failure. After 400 hours (16 days), all of the cell sites would suffer total power failure.
example, a natural disaster affected a wireless network of 112 cell sites. The outcome at the peak was that 43 of the cell sites failed because of either damage at the cell site or damage to dependent services, the backhaul or the commercial power supply. In this example, the network was designed with each cell site engineered to have at least 8 hours of battery backup, and 20% of the cell sites had fixed generators, but 37.5% of the cell sites still experienced a failure because of a lack of commercial power. It took 2 days post disaster to clear the backlog of failures. The power of this analysis lies in the multitude of additional scenarios that can quickly be created. What if 50% of all cell sites had generators, or what if the disaster was more severe and more equipment failed? How many repair crews are required to restore the service within 24 hours of failure? All these possible questions can be answered quickly using this model. Rerunning the simulations and comparing the results offers many possible areas of future research. REFERENCES AND FOOTNOTES [1]
Fig. 6. Weibull analysis of the power system failure.
While loss of commercial power was the largest cause of cell sites being OOS, the second largest cause was backhaul failure. The backhaul is fragile and often experiences outages during disasters. In this simulation, 11 cell sites, or 9.8% of the total number of sites, experienced backhaul failure. The last of the backhaul failures was repaired on day 14. Given that this simulation was based on the parameters of a live wireless network, an estimate of the lost traffic can be calculated. By aligning the simulation time to 6/1/2014, the total amount of voice traffic that would be lost because of failed cell sites is 3,505,108 voice minutes of use. This assumes that the disaster will not change the calling pattern. In this model, a number of assumptions were made. First, the time to failure under normal conditions was estimated. The time to failure in the disaster period was designed based on a storm of moderate impact. One way to improve this research is to correlate the equipment failure rates with the intensity of the storm. That information is currently unavailable, and thus the failure rates in the disaster period are estimates. The simulation’s world view ended at the backhaul and the commercial power company’s network. It did not look into these vendors’ networks. It is possible that there are additional failure points inside the power company’s network or that the interdependency of other telecommunication networks could lead to additional failures. This has been demonstrated in other simulations [5]. V. CONCLUSION Discrete event simulation is a useful tool to aid understanding of the behavior of complex networks. In this
Nance RE. History of Discrete Event Simulation Programming Languages. Proc. 2nd ACM SIGPLAN History of Programming Languages Conf., Cambridge, MA, April 20–23 1993. Reprinted in ACM SIGP LAN Notices 1993; 28(3): 149–175. [2] R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria, 2014. Retrieved from http://www.R-project.org. Last accessed August 2014. [3] SimPy Development Team. SimPy: Simulating Systems in Python. Retrieved from https://simpy.readthedocs.org/en/latest/index.html. Last accessed June 2014. [4] Beyeler WE, Conrad SH, Corbet TF, O’Reilly GP, Picklesimer DD. Inter‐ infrastructure modeling—Ports and telecommunications. Bell Labs Techn J 2004; 9(2): 91–105. [5] Jrad A, Uzunalioglu H, Houck DJ, O’Reilly G, Conrad S, Beyeler W. Wireless and wireline network interactions in disaster scenarios. Proc. IEEE Military Commun Conf 2005; 1: 357–363. [6] O’Reilly GP, Houck DJ, Kim E, Morawski TB, Picklesimer DD, Uzunalioglu H. Infrastructure simulations of disaster scenarios. Proc 11th Int Telecom Netw Strategy and Planning Symp 2004; 205–210. Retrieved from http://ieee.org. [7] O’Reilly GP, Jrad A, Nagarajan R, Brown T, Conrad S. Critical infrastructure analysis of telecom for natural disasters. Presented at the Telecom Netw Strategy and Planning Symp, New Delhi, India. November 2006. [8] Kwasinski A, Weaver WW, Chapman PL, Krein PT. Telecommunications power plant damage assessment caused by Hurricane Katrina—Site survey and follow-up results. Proc 28th Int Telecom Energy Conf (INTELEC ’06) 2006; 1–8. doi:10.1109/INTLEC.2006.251644 [9] Kwasinski, A. Effects of Hurricanes Isaac and Sandy on Data and Communications Power Infrastructure. Proc 35th Int Telecom Energy Conf: Smart Power and Efficiency (INTELEC) 13–17 Oct. 2013; 1–6. [10] Lecomte E, Pang A, Russell J. Ice storm’98. Institute for Catastrophic Loss Reduction and Institute for Business and Home Safety. Toronto, Canada; 1998. [11] Ayers ML. Telecommunications system reliability engineering, theory, and practice. Hoboken, NJ: John Wiley & Sons, Inc.; 2012.
Dr. Robert Jakubek, DCS is the Senior Director of Engineering and Network operations at U.S. Cellular, Central Region. Dr. Jakubek is responsible for engineering, operation and maintenance of U.S. Cellular’s wireless network in Wisconsin, Illinois, Iowa, Nebraska and Minnesota. Having served in his current role since 2005, Dr. Jakubek continues to increase the network’s reliability and performance for U.S. Cellular. Since assuming the role, Robert and his team have won 16 J.D. Powers awards for call quality. He is committed to providing the customer with the most reliable wireless network in the industry by ensuring the engineers are close to the customer and follow best in class practices.
Dr. Jakubek earned his Bachelor’s degree in electronic management from Southern Illinois University, his Master’s degree in business administration from University of Colorado, a certificate in mid-level management from University of Wisconsin, and a doctorate in Computer Science from Colorado Technical University’s Institute of Advanced Studies. Dr. Jakubek is a senior member of IEEE. His area of research is disaster preparations and recovery of wireless networks. Dr. Jakubek is a member of Milwaukee School of Engineering’s and Milwaukee Area Technical College’s Industry Advisory board. He currently lives in Cottage Grove, Wisconsin. His hobbies are marathon running, triathlons, rock climbing, ice climbing, biking, and spending time outdoors.