COVER FEATURE
Reliability and Survivability of Wireless and Mobile Networks As wireless and mobile services grow, weaknesses in network infrastructures become clearer. Providers must now consider ways to decrease the number of network failures and to cope with failures when they do occur.
Andrew P. Snow Upkar Varshney Alisha D. Malloy Georgia State University
he world is becoming more dependent on wireless and mobile services, but the ability of wireless network infrastructures to handle the growing demand is questionable. Failures not only affect current voice and data use but could also limit emerging wireless applications such as e-commerce and high-bandwidth Internet access. As wireless and mobile systems play greater roles in emergency response, including 911 and enhanced 911 services, network failures take on life-or-death significance. For wireless (and wireline) networks, a network’s ability to avoid or cope with failure is measured in three ways:
T
• Reliability is a network’s ability to perform a designated set of functions under certain conditions for specified operational times. • Availability is a network’s ability to perform its functions at any given instant under certain conditions. Average availability is a function of how often something fails and how long it takes to recover from a failure. • Survivability is a network’s ability to perform its designated set of functions given network infrastructure component failures, resulting in a service outage, which can be described by the number of services affected, the number of subscribers affected, and the duration of the outage. Reliability, availability, and survivability have long been important areas of research for wireline networks, such as the public switched telephone network (PSTN)1,2 and asynchronous transfer mode (ATM) 0018-9162/00/$10.00 © 2000 IEEE
networks.3 This has resulted in many improvements in increased regulator, carrier, and vendor focus on the design and implementation of serviceable PSTN switching, transmission, and signaling systems. The focus on PSTN reliability is partly the result of a Federal Communications Commission requirement for reporting of all major failures affecting 30,000 or more customers for 30 minutes or longer.4 Similar attention, including FCC reporting requirements, has not been directed toward wireless and mobile networks, even though they are more prone to failure and loss of access. Their vulnerability can be seen in the 1998 failure of the Galaxy IV satellite, which disabled 90 percent of US pagers, several data networks, and many credit-card verification systems. Most paging companies using the satellite—including PageNet with 10 million customers and SkyTel with 1.4 million customers—did not have a backup system through other satellites. The failure has since spurred some companies to use more than one satellite provider, and even satellite providers have begun distributing paging traffic through multiple satellites. In the near future, three factors will likely pressure service providers to seriously look at their infrastructures and evaluate how best to handle new demands: • tremendous competition in which reliability and survivability become major competitive advantages (or even requirements), • increased user awareness and control over services provided, and • possible changes in FCC reporting requirements. July 2000
49
HLR/VLR
BS BSC
MSC High-capacity trunks
SS7 SS7
•
HLR/VLR
SS7
PSTN
BS BSC
MSC SS7
High-capacity trunks
SS7
•
HLR/VLR BS BSC HLR MSC SS7 VLR
Base station Base station controller Home location register Mobile switching center Signaling system 7 Visiting location register
Figure 1. Wireless infrastructure and components.
•
BS • BSC
MSC
COMPONENT FAILURES As Figure 1 shows, a typical cellular or personal communications services network infrastructure consists of a number of components, any of which could fail, affecting different numbers of users5: • A base station serves hundreds of mobile users in a given area (cell) by allocating resources that allow users to make new calls or continue their calls if they move to the cell. • A base station controller provides switching support for several neighboring base stations, serving thousands of users. Links between base station controllers and base stations usually have
been wireline or fiberline, but they also can be wireless microwave links. A mobile switching center is a larger switch that is capable of serving more than 100,000 users. Links between mobile switching centers and base station controllers are also increasingly wireless. Home location registers and visiting location registers keep track of users who are permanently registered or who are just visiting the area, respectively. Signaling system 7 (SS7) performs the call setup between mobile switching centers and also to the PSTN. High-capacity trunks (T1 or T3) carry calls between mobile switching centers and the PSTN.
In this configuration, a failure in a mobile switching center, a home location register/visiting location register, a mobile switching center-PSTN link, an SS7, or a PSTN trunk could affect nearly all customers under a mobile switching center—perhaps hundreds of thousands of people. Failure in other components would be less severe but still significant. Even if components are not likely to fail in periods of months or years, failures in large networks with thousands of components are likely to occur. Of course, varying degrees of redundancy in these different types of wireless components affect failure frequency. User susceptibility to component failure depends on three factors: • mean time between failures, obtainable from vendor estimates and testing or from the carrier’s field experiences,
Table 1. Wireless component failure impact and mitigation strategies. No. of users affected
Time to fix
Hardware, cosmic factors Hardware, software, operators
Paging, voice, data
1,000-10 million
Voice, data
Hardware, software, nature Hardware, software
Reasons
Satellite
Mobile switching center Base station/ base station controller Home location register/visiting location register Device interface
50
Services affected
Failure
Hardware
Computer
Ways to improve
Time to fix after improvements
Weeks to months
Hot standby or backup satellites
Minutes to hours
100,000
Hours to days
Seconds to minutes
Voice, data
1,000-20,000
Hours
Voice, data, other services
100,000
Hours to days
Redundant components, redundant power, smaller switches, Sonet ring, training Overlay base stations, redundant components, Sonet rings Replicated databases, redundant components
Voice, data, paging, other services
1
Hours to days
Multiple interfaces to access different wireless networks
Seconds to minutes
Seconds to minutes Seconds to minutes
Table 2. Examples of the ANSI wireless outage index.
Outage
Outage description
Outage A
Peak hour Moderate duration (1/2 hour) Users, blocked registration: 5,000 Blocked calls: 60,000 Peak hour Long duration (3 hours) Users, blocked registration: 5,000 Blocked calls: 60,000 Peak hour Short duration (1/6 hour) Users, blocked registration: 5,000 Blocked calls: 120,000 Nonpeak hour Very long duration (6 hours) Users, blocked registration: 5,000 Blocked calls: 4,800
Outage B
Outage C
Outage D
Registration blocking (IRB)
Call blocking (ICB)
Wireless outage index (IRB + ICB)
0.053
0.320
0.373
0.101
0.606
0.707
0.005
0.115
0.120
0.116
0.004
0.120
• potential number of wireless users impacted by a particular wireless component failure, and • mean time to restore, which includes fault isolation, repair or replacement, and testing times. Wireless carriers will naturally focus on components that impact the most users, but they also should focus on increasing mean time between failures and decreasing mean time to restore to minimize frequency of failures and to recover faster from those failures that do occur. Table 1 shows customer impact, services affected, and high-level strategies for decreasing mean time to restore. Switch failure will usually have a high impact, with most subscribers unable to access service and many user connections severed. Cellular and PCS users will lose initial registration and call delivery. Fortunately, a number of central office switches, which are very reliable when properly deployed and operated, are configured and deployed as mobile switching centers. Wireless and mobile networks, databases, base stations, mobile devices, and links can fail. Every network has a database to store, maintain, and update location information for mobile users. To avoid failure, the server can mirror and replicate databases at multiple places in the network. Base station failures can be reduced through redundant components, an overlay architecture, or interconnections using a Sonet ring.
OUTAGE INDEX One possible reason for the lack of published survivability research on wireless networks is that the FCC does not require wireless carriers to report outages, so nonproprietary outage data simply is not readily available. The ANSI T1A1.2 committee has defined an outage index for both wireline and wireless networks,6 and the wireline industry extensively uses this index to measure and publicize the impact of
large-scale FCC-reportable outages. However, if wireless carriers use this index, the information is not available to the public. Like companies in other industries, most wireless companies are reluctant to publicly report such data since they aren’t required to do so. The wireless outage index quantifies an outage’s impact, taking into account both magnitude—the number of customers impacted—and duration—the number of minutes that service is impacted. Calculating the index for each outage incident (or nonsurvivable event) allows a variety of analysis alternatives such as time series—aggregate impact per month—or aggregating impact by causal component category such as mobile switching center or base station. The index has two service components: • registration blocking, when the customer is unable to initially register with the wireless network (as required in cellular and PCS networks for making or receiving calls); and • blocking of call attempts to and from registered customers. An outage could involve one or both of these service components—each perhaps affecting different numbers of subscribers for different amounts of time. The outage index for each event is the sum of the outage index contribution for each service component impacted. The outage contribution involves magnitude and duration weights and the importance of the affected service component. Magnitude and duration weights are nonlinear mappings from the number of customers affected and the amount of time they are affected, respectively.6 They produce one scalar number that is often not intuitive and cannot easily be compared to indices from other outages. Table 2 shows how counterintuitive the ANSI wireless index can be. Outage A is of moderate duration and outage B is of long duration, both during July 2000
51
LAN adapter
Wireless LAN
AMPS/digital cellular, and cellular-PCS/Iridium. When they experience a failure, most users of these phones will simply switch to their alternative network instead of making several attempts to access the network. Therefore, measured registration and call-blocking attempts may be significantly lower, leading to underestimation of the impact of failure. Although attempts have been made to develop more appropriate outage indices for wireless and PCS systems, including the addition of traffic mobility assumptions,7 additional changes in this arena are still needed.
Satellite adapter Cellular/PCS
Cordless/FRA adapter
ARCHITECTURAL CHANGES Cordless/fixed radio access
Figure 2. Using multimode devices to improve reliability.
peak periods. In each, network failures have prevented 5,000 subscribers from registering and have blocked 60,000 calls. The two outages differ only in duration, from one-half hour to three hours—a sixfold increase—but the outage index did not even double. Outage C, occurring during a peak period, has a short duration and a large magnitude; outage D, occurring during a nonpeak period, has a long duration and small magnitude, but the differences balance out, giving both the C and D outages the same outage index. The outage index could also be inaccurate because it does not account for the growing use of dualfunction/mode phones, including AMPS/CDMA,
Several configurations and architectures can improve survivability, including Sonet rings to interconnecting base stations and mobile switching centers,8 multifunction/multimode devices, and overlay networks. Base-station architectures also enhance survivability through an improved signal-to-noise ratio, resulting in fewer radio link failures.
Sonet rings Adding redundancy is one way to enhance reliability through the end-to-end connection. One way to do this is by using a fault-tolerant Sonet ring to link the switched network with multiple base stations for the same geographic area. Ring deployments such as these can tolerate either a single fiber cut or a single transceiver failure because a counterrotating ring has multiple paths (similar to a fiber-distributed data interface, an FDDI ring deployed in a fault-tolerant configuration).
Figure 3. Using an overlay network to improve reliability.
UAP Overlay network
UAP Wireless LAN UAP
UAP UAP
Cordless/fixed radio access UAP: Universal access point
52
Computer
Figure 4. Ensuring end-to-end connection reliability through heterogeneous networks.
End-to-end reliability
Fixed user
Public switched telephone network 1
Base station 3
Mobile user 5
4
2
1 Access link 4 Base station 2 Switched network 5 Wireless link 3 Link from switched network to base station
Multifunction/multimode devices
END-TO-END RELIABILITY AND SURVIVABILITY
Another way to improve survivability is to use multifunction/multimode devices in which a single terminal offers multiple interfaces, as shown in Figure 2. Early examples of this architecture include the dualfunction advanced mobile phone system (analog cellular)/code division multiple access (PCS standard) satellite/cell phone, the emerging group system for mobile communications, and Digital Enhanced Cordless Telephony, the European PCS standard. This architecture provides overlapped services to ensure wireless coverage in case of network, link, or switch failure. It may also increase the effective coverage area. Each network could deploy a database that tracks network conditions and the user’s location, device capabilities, and preferences and then adapt the user contents before transmitting the information over the network. The network needs location information to complete calls to the user, to alert services sending short messages to users, and to implement enhanced 911 services.
We should not limit our view of wireless network reliability to the context of mobile communications. Most mobile applications involve connections through fixed networks. A connection usually consists of the concatenation of fixed and mobile network infrastructure circuits, so any consideration of reliability must consider the reliability of the entire endto-end connection, as Figure 4 shows. This perspective has not received much attention, for two reasons:
Overlay network Yet another way to improve survivability and hide network failure is to deploy an overlay network. As Figure 3 shows, in this architecture, a user accesses an overlay network consisting of several universal access points, which choose a wireless network for the user based on availability, specified quality of service, and user-specified choices. A universal access point performs protocol and frequency translation, as well as content adaptation. All of these techniques involve capital investment. It is up to each carrier to evaluate the trade-off between the increased expenditures and customer satisfaction— a difficult decision-making process that will become more necessary in the future as dependence on wireless grows.
• The mobile provider is often a separate carrier from the fixed service carrier and has little influence over the fixed carrier’s reliability engineering and investment—and vice versa (even if in the same company, they are often in different divisions with separate profit and loss centers). • An end-to-end connection’s reliability is no better than its worst component’s reliability—traditionally the radio frequency path. So far, poor radio path reliability has shaped user perceptions of wireless reliability. This end-to-end reliability perspective will, however, require more focus because users will demand it. In addition, as a result of the US Telecommunications Act of 1996, carriers will begin offering more end-toend connections through assets under their total control. And, with improved encoding techniques and in-creased investment in base station coverage, radio frequency path reliability will improve. The end-to-end perspective is real, as evidenced by the nationwide calling plans offered by companies such as Sprint PCS and AT&T Wireless. These plans involve roaming agreements with multiple carriers because a single carrier cannot provide service in every location. In this way, only one carrier’s poorly July 2000
53
Table 3. Reliability and redundancy of end-to-end connection components. Item
Description
Degree of redundancy
Access link
Local loop; copper twisted-pair for single user; high-capacity fiber for some customers; wireless and cable local loops emerging Switches interconnected by fiber links; could be multiple networks in tandem; circuit-switched or ATM Multichannel line-of-sight microwave or multichannel fiber-optic link
Little to none: Single users have no redundancy; high-capacity links may offer alternate circuits. High: Usually rich link and switch redundancy to ensure rapid connection recovery. Low: Line-of-sight and fiber systems can have hot standby elements; loss of path means link failure (radio frequency loss or cable cut). Low: May have battery backup and some hot standby electronics. Little to none: No redundancy unless there is overlapping cell coverage.
Switched network
Link from switched network to base station Base station
A switch and a radio frequency transceiver
Wireless link
Radio frequency path covering a particular geographic area
BS MSC BS
Anchor switch
Tandem
S BS MSC
Central office
BS BS Central office
MSC
4 Tandem
S
Anchor switch
S BS
3
S BS MSC
2 BS
Subscriber transport service
1 S S
Single-thread infrastructure
1. Intracell: customers covered by the same base station in a cell
Si-BS-Sj
2. IntraMSC: customers covered by the same mobile switching center
Si-BSi-MSC-BSj-Sj
3. InterMSC: customers covered by the same anchor switch
Si-BSi-MSCi-Anchor-MSCj-BSj-Sj
4. Interanchor: customers covered by different anchor switches
Si-BSi-MSCi-Anchori-PSTN-Anchorj MSCj-BSj-Sj
Figure 5. End-to-end wireless transport service connections between two subscribers. Services 1,2, and 3 are likely to be served by the same wireless carrier, while 4 could involve two or more carriers.
designed wireless network can impact the customers of several different carriers. Each component that the end-to-end connection traverses is a potential point of failure in which reliability depends on the degree of built-in redundancy. Table 3 shows how reliability and redundancy differ 54
Computer
between components. For example, switched networks involve circuit switching, ATM, or a concatenation of several different networks. Such networks have a high degree of redundant paths and switches, assuring fast restoration or establishment of a new connection if one fails. Access links and radio fre-
quency links, however, typically have little or no redundancy, making them the weak link in end-to-end connections. Having more reliable components in the connections increases their reliability. It is possible to add redundancy in access and radio links, but it is typically not done due to cost and limited frequency spectrum. Figure 5 further illustrates the escalating vulnerability of wireless transport services to single points of failure. The service reliability threads, shown by dotted lines, illustrate the impact of individual component failures on survivability. The length of a thread increases the probability of individual connection disruption or service denial. There are also even longer and more complex threads than those depicted, such as the intercarrier wireless transport service.
urrently, wireless and mobile networks are more prone to failure and loss of access than their wired counterparts. A failure can involve one or more of a wireless or mobile network’s components— switches, base stations, databases, mobile devices, and wireless links. So far, users seem willing to expect less, but with increasing usage, diversity, and emphasis on subscriber services, expectations will change as users compare wireless and mobile networks with the high degree of dependability of telephone and Internet access. Therefore, in addition to directing some attention to designing survivable wireless and mobile networks, developers must also keep in mind that increasingly pervasive and demanding services will further escalate the importance of reliability and survivability requirements. In the future, in addition to making wireless networks more survivable, the engineering and operational goals of the surviving carriers should focus on providing end-to-end service for voice and data mobile users in a mixed wireline/wireless infrastructure. ❖
C
References 1. A. Snow, “A Survivability Metric for Telecommunications: Insights and Shortcomings,” Proc. Information Survivability Workshop, IEEE CS Press, Los Alamitos, Calif., Oct. 1998; http://www.cert.org/research/isw98/ all_the_papers/no32.html. 2. F. Schneider and S. Bellovin, “Evolving Telephone Neworks,” Comm. ACM, Jan. 1999, p. 160. 3. R. Kawamura, “Architectures for ATM Networks Survivability,” IEEE Comm. Survey, 1999; http://www. comsoc.org/pubs/surveys.html. 4. Alliance for Telecommunications Industry Solutions, “Network Reliability Steering Committee,” Macro Analysis: Second Quarter, 1999, Arlington, Va., 1999.
5. T.S. Rappaport, Wireless Communications, Prentice Hall, Upper Saddle River, N.J., 1996. 6. ANSI T1A1.2 Working Group, A Technical Report on Network Survivability Performance, Tech. Report 24A, ANSI, 1997; http://www.t1.org/t1a1/_a12-hom.htm. 7. S. Ramaswamy, A Framework for Survivability Analysis for Cellular and PCS Networks, master’s thesis, Dept. of Information Science and Telecommunications, Univ. of Pittsburgh, 1997. 8. U. Varshney, A. Snow, and A. Malloy, “Designing Survivable Wireless and Mobile Networks,” Proc. IEEE Int’l Wireless Comm. Networking Conf., IEEE CS Press, Los Alamitos, Calif., 1999, pp. 30-34.
Andrew P. Snow is an assistant professor in the Department of Computer Information Systems at Georgia State University. His research interests include network survivability, software reliability, and software engineering. Snow received a PhD in information science from the University of Pittsburgh. He is a member of the IEEE and the IEEE Computer, Communications, and Engineering Management Societies. Contact him at
[email protected].
Upkar Varshney is an assistant professor in the Department of Computer Information Systems at Georgia State University. His research interests include wireless technologies for mobile networks, multicasting, ecommerce, and network reliability. Varshney received a PhD in telecommunications and computer networking from the University of Missouri-Kansas City. He is a member of the ACM and the IEEE Computer, Communications, and Vehicular Technologies Societies. Contact him at
[email protected].
Alisha D. Malloy is a doctoral student in the Department of Computer Information Systems at Georgia State University. Her research interests include quality of service, reliability, and survivability of wireless networks. Malloy received an MS in engineering management from Old Dominion University. She is a student member of the IEEE and the ACM. Contact her at
[email protected]. July 2000
55