Automated Troubleshooting in WLAN Networks

9 downloads 0 Views 56KB Size Report
Abstract—The automation of troubleshooting in mobile and wireless networks is essential to guarantee an efficient network operation with high performance and ...
Automated Troubleshooting in WLAN Networks A. Samhat1 , R. Skehill2 , and Z. Altman1 1

France Telecom R&D Division, RESA-NET, 38-40 rue du G´en´eral Leclerc, 92794 Issy Moulineaux, France. 2 Wireless Access Research Center, Foundation Building, University of Limerick, Ireland.

Abstract— The automation of troubleshooting in mobile and wireless networks is essential to guarantee an efficient network operation with high performance and low cost. In this paper, we focus on automated diagnosis of faults in WLAN networks and propose a framework based on Bayesian Networks to relate faults to symptoms. Candidate list of faults and symptoms for the diagnosis are proposed and the manner to construct the diagnosis model is described. To illustrate the proposed approach, some results are examined based on statistics from WLAN network simulator.

I. I NTRODUCTION Wireless infrastructure is evolving towards a heterogeneous wireless access network encompassing various radio technologies including GERAN, UTRAN, and WLAN. In this context of increasing network complexity, automated management is issential to provide high-quality services and to achieve optimal cooperation between radio access networks. With legacy management methods, manual processes were made whereas nowadays, operators are interested in automating management tasks in order to reduce operational cost and to increase operational efficiency. Troubleshooting (TS) or fault management is an important building block in wireless networks resource management. It consists of three steps: detecting problems (fault detection), identifying the cause (cause diagnosis) and fixing the problem (problem solving). When a fault is detected by a diagnosis system, the difficult and the time-consuming task is to locate the cause of the fault. In current cellular networks, this step is manually accomplished by staff, with high expertise, dedicated to such process. One or more possible causes will be identified and the corresponding healing action will be tested to fix the problem. As automated troubleshooting helps in dramatically lowering operational cost, intense research activities have already been carried-out in in this field for GSM networks [1], [2], [3] and extensions are currently investigated for UMTS networks [4]. In [1], an autonomic diagnosis model for cellular networks is proposed. It identifies the fault cause based on the values of performance indicators. Performance indicators have been modelled as continuous random variables and the statistical relations between symptoms and causes are specified by means of the parameters of beta probability density functions (pdf). The proposed model was tested in a live GSM network showing high diagnosis accuracy. In [2] and [3], Bayesian Networks This work has been carried out in the framework of the European CELTIC Gandalf project.

were selected as the reasoning method for automatic diagnosis in cellular networks. A similar approach based on Bayesian Networks is presented in [4] to provide automated diagnosis and TS tasks in UMTS networks. BN is a knowledge based method, often referred to as a Bayesian cognitive approach, that learns how to build a model from data. The data consists of the cumulative TS experience, namely the list of observed symptoms and alarms with the corresponding faults that have been inserted by the radio expert into a database. The BN uses the data to create a statistics based mathematical model which relates symptoms to causes of faults. The quality of the model improves with the volume of data in the database. To our knowledge, little work investigating the autotroubleshooting in WLAN has been repported in the literature [5]. In this work, we focus on troubleshooting in WLAN networks, mainly those based on IEEE 802.11 standard. WLAN-based systems are considered as potential wireless access system; they are considered as an interesting option for the cellular operators to cover the hotspot areas and an increasing number of Internet Service Provider (ISP) provides wireless network connectivity using WLAN. This surge of WLAN deployment has created an new breed of performance and troubleshooting challenges. Similar to cellular networks, to ensure that the WLAN network is providing maximum performance, automated troubleshooting should be applied. The aim of this paper is to show how to adapt the Bayesian Network model to automated diagnosis in WLAN networks. A list of the fault encountered in the 802.11 WLAN and different quality indicators are given. Particular attention is given to identifying the fault/symptoms relationships using simulations. This work is carried out within the European CELTIC Gandalf project that focuses on automating management tasks in heterogeneous radio access networks. The rest of the paper is organized as follows. In Section II, we present the diagnosis system model and we formulate the mathematical model for automated diagnosis based on Bayesian Network (BN). We then identify in section III the WLAN input information for the model, i.e. symptoms, faults, as well as their relationships. In Section IV a case study showing how to estimate the parameters of the diagnosis model is given, followed by concluding remarks in Section V. II. S YSTEM

DIAGNOSIS MODEL

As stated above, the first step in troubleshooting is fault detection, which identify the network elements with poor performance, based on symptoms and alarms. Symptoms are

quality indicators or Key Performance Indicators (KPI) that are used to monitor quality of service perceived by the user and performance indicators characterizing the network functioning. Alarms can be triggered when there is material failure or when certain KPI value exceeds a pre-determined threshold. The second step, the diagnosis, is the cause identification, that is the automatic reasoning mechanisms to find out the cause of the problems. There are several parameters and possible hardware failures that could deteriorate the WLAN network performance and cause alarms. Furthermore, one fault could often trigger a series of alarms. To achieve a conclusive diagnosis, not only alarms should be taken into account, but also performance indicators. The cause of the poor WLAN network performance could be a hardware problem or bad parameter value. A parameter value could be too big or too small (with respect to an optimal value) and not suitable for the current network state. Finally, problem solving is the execution of the actions to solve the problems to get the network element up running normally again. In this work, we investigate the automatic diagnosis step and we present in the next subsection the mathematical framework to perform diagnosis using Bayesian networks. A. Mathematical framework Denote by Ci a cause for bad functioning or alarms in the WLAN network; by Sj a symptom that can contribute to assessing the WLAN network quality; and by E a set of N symptoms {Sj }. The diagnosis consists of identifying the cause with the highest probability given a set of symptoms. More than one cause can be identified with a corresponding probability. This process can be seen as a classification process in which each class corresponds to a given cause. Using Bayes’ rule, one can calculate the probability for the cause Ci to occur given the set of observed symptoms E: P (Ci |E) =

P (Ci )P (E|Ci ) P (E)

(1)

P (Ci |E) is known as the posterior (conditional) density and P (Ci ) as the prior density. For a fixed Ci , P (E|Ci ) is known as the likelihood function or distribution. Typically, certain conditions on the diagnosed network are known a priori, such as the type of services provided, the number of frequencies used by the AP, or any other a priori knowledge. We denote the set of conditions by D. Adding the set D to (1) gives P (Ci |E, D) =

P (Ci |D)P (E|Ci ) P (E)

(2)

It is assumed in (2) that the symptoms are independent of the conditions. When this assumption is not verified, the conditions should be added to the likelihood distribution and to the P (E) term. Calculating the joint likelihood distribution P (E|Ci ) is difficult and impractical. We therefore utilise the Naive Bayesian Network, which makes the assumption that symptoms given the causes are independent. With this assumption, one only needs to specify the likelihood distribution of each symptom, P (Sj |Ci ), separately. This assumption is an approximation, however the Naive Bayesian classifier

remains efficient even with significant dependencies between the symptoms [6]. Hence, the classification in the diagnosis process will be performed using the approximation: QN P (Ci |D) j=1 P (Sj |Ci ) (3) P (Ci |E, D) = P (E) At this stage, the use of the model requires the following inputs for WLAN network: a list of KPIs (symptoms) and a list of faults (causes) as well as the statistical relations between a fault and a set of KPIs which give the conditional probability of each KPI given each fault. III. M ODEL REQUIREMENTS A. Key Performance Indicators in WLAN In current infrastructure mode WLAN systems the following KPIs (symptoms) are considered: • Receive Signal Strength Indicator (RSSI): it is a coverage indicator that gives an indication to the perceived signal strength at the transmitter (TxRSSI) and at the receiver (RxRSSI). The higher the RSSI, the larger the AP coverage is. • Noise level: noise refers to the background RF radiation present in the receiver’s environment. Every environment has some noise and sources of interference, like cordless phones and Bluetooth etc. The Signal to Interference and Noise Ratio (SINR) compares peak signal strength to noise. The higher the SINR, the more stable and usable the WLAN service. • Number of users: it gives number of stations currently associated with the AP. • Throughput or inverse of throughput: the throughput is defined as the transmitted data divided by the transmission duration. The inverse of the throughput indicates the service time of a data unit. The KPIs described above are considered to be generic 802.11 KPIs and can be extracted from any AP. However, extended KPIs, in the form of tallies, are available with Prism II based APs [7] used in our local testbed [8]. The KPIs can be used to differentiate further between hardware, configuration and error situations. These extended KPIs are classified into Transmission (Tx) and Received (Rx) KPIs. The following are key performance indicators regarding the Reception of packets at the access point. • FCS errors: number of received messages or message parts that contained an erroneous value and had to be deleted. • Buffer not available: number of times an incoming message could not be received due to a shortage of receive buffers on the AP. A non-zero value identifies heavy data traffic for an AP; for example, when the AP is receiving large amounts of data. • Message in message fragments: number of times messages were received while another transmission was in progress. • Message in bad message fragments: number of times messages were received while a transmission elsewhere in the wireless network was in progress.

TABLE I E XTENDED KPI S FROM P RISM II CHIPSET BASED AP S Parameter

Value

Indicates

RxFCSErrors RxDiscardsNoBuffer RxMessageInMsgFrags RxMessage InBadMsgFragments

High Non Zero Non Zero Non Zero

Interference or Hardware malfunction High Data Load Multiple users simultaneously using the wireless medium Heavily Loaded

TxDiscardsWrongSA TxDeferredTx

Non Zero High

Error in Communication between protocol stack and driver. Multiple Users simultaneously using the wireless medium

The following are key performance indicators regarding the Transmission of packets at the access point. • Wrong station address on transmission: number of times a message transmission was not done because a wrong MAC address was used by the protocol stack. • Deferred transmissions: the number of MSDUs (MAC Service Data Unit) for which one or more transmission attempt(s) was deferred to avoid a collision. • Single Retry Frames Transmitted: the number of MSDUs successfully transmitted after one (and only one) retransmission (on the total of all associated fragments). • Transmit Retry Limit Exceeded: the number of times a MSDU was not transmitted successfully because the retry limit (either the ShortRetryLimit or the LongRetryLimit) was reached, due to no acknowledgment or CTS received. • Discarded transmissions: the number of transmit requests that were discarded to free up buffer space on the NIC. Table I summarizes the indications of some KPIs. When the WLAN system supports radio resource management mechanisms such as admission control and load control algorithms, the following KPIs are available for the diagnosis system: • Blocking rate: it is used to monitor the results of execution of the admission control algorithm. • Dropping rate: it is used to monitor the results of execution of the load control algorithm. B. Faults encountered in WLAN Among the faults that could occur in the WLAN networks we list the following: • Bad coverage: After installing a WLAN network, some obstacles may appear that were not there before. These obstacles will cause an attenuation of the signal which decreases the performance of the network. • High noise: this will reduce the SINR. The result is low throughput per user. (Reduce RF or increase transmission power in the affected region) • High load: When there is a large number of active wireless users. The result is a high collision rate which leads to a low throughput per user. • Broken Antenna or bad tilt: this cause is a hardware fault which leads to a poor performance. • MAC buffer problem: in this case an inappropriate configuration of the MAC buffer size at the AP yields bad performance.

When supporting RRM algorithms, a fault could be a bad RRM parameter value which leads to a poor network performance. C. Symptoms/faulty relationships A simple example of faults/symptoms relationships is given in Table II based on the expertise of WLAN designers. Such table is useful for small networks and manual diagnosis. However, for the auto-diagnosis in large scale network the symptoms/faulty relationships viz the conditional probability of each KPI given each fault, could be determined based on statistics from a live WLAN network. In this study, a WLAN simulator coupled with the expertise of human troubleshooters is used to estimate the required parameters as well as thresholds. Simulations have the advantage of generating a large amount of cheap data. The probability distributions can be derived from histograms of the KPIs in faulty and normal conditions. For each KPI, the probability distribution of the KPI is determined in the absence of faults i.e. normal conditions. Then, a fault is introduced in the network and the probabilities are learned from the observed values calculated by the simulator. A fault and a KPI are considered to be associated if the fault will trigger the KPI to enter into an abnormal state. Consequently, the conditional distribution of each KPI given each associated fault models the strength of fault-KPI relation as detailed in the next section. TABLE II E XAMPLE OF FAULTS / SYMPTOMS RELATIONSHIPS

RSSI Noise level Number of users Throughput

Broken Antenna (tilt)

high load

Bad coverage

High noise

Low Low Low Low

Normal Normal High Low

Low Normal Normal Low

Normal High Normal Low

IV. WLAN N ETWORK

STATISTICS FOR AUTOMATED

DIAGNOSIS

The aim of the diagnosis is to identify causes of dysfunctions and poor performance due to hardware problems and/or to bad parameter settings using a set of symptoms (KPIs). Most symptoms are continuous. Hence, in order to use discrete variables, they should be discretized according to some predefined thresholds [1]. We have generally used a three-state discretization of all symptoms, where the states are labelled:

50

40 35 30 25 20 15 10 5 0

A. Scenario Setup The NS2 simulator [9] with the TeNS extension is used to create a network of IEEE 802.11 access points with directional antennas. The IEEE 802.11 stations are grouped in clusters of 4 and positioned on a flat open freespace area. Each access point is represented by APNum(x,y), where Num is an integer from 0 to 39 and (x,y) are coordinates in meters. We consider an example of 4 access points: AP0(10,10) AP1(10,600), AP2(600,600), AP3(600,2300) form a multihop network. AP1 and AP3 are equipped with one directional antenna while AP1 and AP2 have two. A Constant Bit Rate (CBR) traffic between AP0 and AP3 is started with a packet size 1420 bytes with an interval of 0.005s. In the simulation, a fault is introduced at AP2 by altering the tilt of the directional antenna pointing towards AP3. The directional antenna has the following specification from Hyperlink Technologies (HG2414P) • Gain: 14 dBi • Horizontal Beam Width: 30 Degrees • Vertical Beam Width: 30 Degrees • Frequency Range: 2400-2500 MHz In the first step, normal statistics are calculated for the network without introducing faults (normal conditions). One simulation is carried out, and the computed KPIs are stored for each APs. In the second step, the tilt fault is introduced and a sufficient number of simulations (about 80) are run to give statistics that are used by the BN model to perform statistical inference. Each symptom is discretized into three states, normal, high and very high. For example, Figure 1 compares the histograms of the inverse of the throughput for the normal and faulty APs in the case of a antenna tilt problem. One can notice a shift to the right of the antenna tilt histogram for the faulty case indicating quality degradation due to the increase of the service time of a data unit. The resulting statistics for the inverse of the throughput symptom are summarized in Table III. It is used by the BN to perform statistical clustering. Having the likelihood distribution of each symptom, the model learning phase of the BN model can be achieved. V.

CONCLUSION

In this paper, we have presented the necessary adaptations of a Bayesian Networks model to perform automatic diagnosis

Faulty case Normal case

45

PDF

- Normal for what is considered the normal range of that symptom - High for the range where the symptom may indicate a problem - Very high for more extreme range where we are definitely outside the normal range The estimated threshold separating normal and high values has been set to the 85 percentile of that distribution of the symptom values. Similarly, the threshold separating high and very high states has been set to the 90 percentile. The prior probabilities of the faults, P (Ci ), depends on how often the faults were inserted into the simulator. The remaining probabilities in Eq. (3) were learned from the observed values calculated by the simulator when faults were introduced.

4

4.5

5

5.5 6 Inverse of throughput

6.5

7

7.5 −3

x 10

Fig. 1. Histograms of the inverse of the throughput for normal and antenna tilt fault TABLE III I NVERSE OF THE THROUGHPUT STATISTICS FOR NORMAL AND FAULTY CASES FOR ANTENNA TILT FAULT

Inverse of throughput 0 - 0.0054 0.0054 - 0.0056 0.0056 - 1

normal case

Antenna tilt fault

0.94 0.04 0.02

0.0933 0.08 0.8267

for WLAN networks. We have listed a number of symptoms and faults and we have shown an example of how to build the statistics database required to relate faults to symptoms in order to learn the auto-diagnosis model. Future work will use statistics from real WLAN network in the learning phase. R EFERENCES [1] R. Barco, V. Wille, and L. Diez, System for automatic diagnosis in cellular networks based on performance indicators, European Trans. Telecommunications, vol. 16, no. 5, pp. 399-409, Oct. 2005. [2] R. Barco, L. Nielsen, R. Guerrero, G. Hylander, S. Patel, Automated troubleshooting of a mobile communication network using bayesian networks, in Proc. IEEE International Workshop on Mobile and Wireless Communications Networks (MWCN02), Stockholm, Sweden, Sept. 2002. [3] R. Barco et al., Comparison of probabilistic models used for diagnosis in cellular networks, in Proc. IEEE 63th Vehicular Technology Conference, Melbourne, Australia, May 2006. [4] R. Khanafer, L. Moltsen, H. Dubreil, Z. Altman, R. Barco, A Bayesian approach for automated troubleshooting for UMTS networks, in Proc. IEEE Intern. Symp. PIMRC06, Helsinki, 11-14 Sept., 2006. [5] S. Felis, J. Quittek, L. Eggert, Measurement-Based Wireless LAN Troubleshoting, First Workshop on Wireless Network Measurements (WinMee 2005), Trentino, Italy. [6] F. Jensen, Bayesian Networks and decision graphs, New York, USA: Springer-Verlag, 2001. [7] Intersil, Intersil Prism Programmers Guidebook 2001. [8] R. Skehill, D. Picovici and K. Mc Grath, Wireless MANET Testbed for Reproducible Voice over IP Evaluation”, International Communication Sciences and Technology Association MeshNes, Budapest, July 2005. [9] The Network Simulator-ns2 (2006), ns-2.30, Available at http://www.isi.edu/nsnam/ns.

Suggest Documents