Fault Management for VoIP Applications over wireless ...

Fault Management for VoIP Applications over wireless and wired NGN Networks: an operational prospective Luca Monacelli and Roberto Francescangeli Dipartimento di Ingegneria Informatica ed Elettronica - DIEI Università degli Studi di Perugia Perugia, Italy [email protected], [email protected]

Abstract— This paper shows the practical aspects to be considered in order to face the Fault Management problem in a network deploying VoIP services for both wireless and wired access. The real network analyzed includes all the entities necessary for implementing a plain VoIP service, according to the NGN architecture. We provide the KPI definition, along with a detailed description which refers to the practical experience of a real nation wide network, and their usage within a Fault management framework. Index Terms—NGN, IMS, VoIP, Fault Management

I.

INTRODUCTION

Telecommunication networks are complex systems of interconnected hardware and software entities coordinated by communications protocols. Network and application services are becoming extremely complex and require QoS constraints [1]. Each component of a network might undergo different problems, such as faults, misconfigurations, and overwhelming usage. A service provider or a network operator should face or, even better, avoid such problems, by eventually using specific management and control tools. In 1997 the ISO, according to a previous ITU-T model referred to as TMN (Telecommunication Management Network), has defined a management framework for telecommunication networks, known as FCAPS, which stands for Fault, Configuration, Accounting, Performance and Security, corresponding to five relevant management categories [2]. The Performance Management includes quantifying, measuring, reporting, analysing and controlling performance of network components; the Fault Management means acquiring, revealing, and counteracting faults in the network; the Configuration Management is used to obtain and control configuration parameters from the network entities; the Accounting Management includes usage statistics and allocation of costs, and relevant pricing; Security Management consists of monitoring accesses to the network resources according to pre-defined policies. In NGN networks, management types have been extended by including Subscriber and Roaming management , Software

management, Fraud management, User and equipment management, QoS management [3][4], which is in turn very strictly related to pricing management [5][6][7], and equipment trace management [8]. This paper focuses on the fault management high level requirements for VoIP services to be deployed within an IP NGN network. Fault management has been widely investigated under a theoretical perspective. In this paper we highlight the associated practical aspects to be considered in order to handle the Fault Management in the signaling domain. The network used to collect the performance figures shown in what follows deploys VoIP services. The fault management framework illustrated in this paper aims to allow achieving a signaling reliability level for VoIP services similar to the one of traditional telephone networks. II.

NETWORK ARCHITECTURE

The network used can support a plethora of advanced multimedia services. Nevertheless, due to space limitations, this paper focuses only on the elements of the network essential for delivering VoIP services. Figure 1 illustrates the general structure of the network, which includes an IMS-based core [2] [9]. The interested reader can find a detailed description of the elements of this architecture in [9]. We will refer to a practical aggregation of these entities defined according to the management requirements. In particular we identify the following fundamental components: Session Director (SD), the SIP Server (SS) and the Universal Subscriber Database (UDB). This essential architecture allows users to access plain VoIP services only. A complete model, usable for accessing even other advanced services, such as video-communications and interworking with PSTN/ISDN networks, should include also the other entities of the NGN architecture, shown in Figure 1. The three aggregated entities considered are the heart of the architecture, which is anyway involved in all types of services. Thus, the analysis shown below may even be regarded as representative of the general operation of the network, but limited to the common functions. In this paper we assume that the reader is familiar with the most common SIP entities and signaling messages, such as the those used for session setup and the relevant response. A description of this subject can be found in [11].

UDB AS

HSS

Core IMS BGCF

Network Attachment Subsystem

S-CSCF

P-CSCF

MRFC

Charging

SLF

IWF

I-CSCF

IBCF

MGCF

SGF

MRFP

PST N /ISD N

Resource and Admission Control User Equipment A-BGF

O ther IP N etworks

SIP SERVER

ABGF AS I-BGF BGCF CSCF DB HSS HTTP IBCF I-CSCF IWF LB MGCF MRFC MRFP P-CSCF RS S-CSCF SD SGF SIP SLF T-MGF UDB WS

T-MGF I-BGF

IP Transport

Access Border Gateway Function Application server Interconnect Border Gateway Function Breakout Gateway Control Function Call/Session Controll Function Data Base Home Subscribe Server Hypertext Transfer Protocol Interconnection Border Control Function Interrogating-CSCF Interworking Function Load Balancer Media Gateway Control Function Multimedia Resources Function Controller Media Resource Function Protocol Proxy-CSCF Real Server Serving-CSCF Session Director Signaling Gateway Function Session Initiation Protocol Subscription Locator Function Trunking Media Gateway Function Universal Subscriber Database Web Server

SD UDB

RS WS1

LB

HTTP

LB

WS3

…

SD

WS2

DB

WS4

SIP SERVER

DBS

Figure 1. Reference network model

centralized database for obtaining and updating the localization, authentication and authorization information. They are also in charge, in conjunction with the DB Server illustrated below, of creating Call Detail Records (CDR) relevant to the SIP invite procedure for accounting and statistics purposes.

A. Description of the Network Entities 1) SESSION DIRECTOR The Session Director (SD), including the P-CSCF and the A-BGF shown in Figure 1, is the first point of contact in the IMS signaling plane between the terminal and the network. From the SIP point of view the SD is acting as an outbound/inbound SIP proxy server; this means that all the requests initiated by the IMS terminal or destined for the IMS terminal traverse the SD. The SD forwards SIP requests and responses in the appropriate direction (i.e., toward the IMS terminal or toward the IMS network). It may also support resource and admission control capabilities. 2) SIP SERVER The SIP Server (SS), which realizes the S-CSCF functions, is made up of three functional sub-systems: •

•

Load Balancer: This element (LB) has to distribute the SIP requests, sent to the SS, among Real Servers (RS); the typical approach is to select the least loaded RS, which is using the least number of threads. For packets coming out of the SS, the LB has to route them towards the suitable network node (SD, AS…). Real Server: Real Servers process the SIP messages coming from LB and the http responses coming from UDB, by using in our implementation a set of specific Java-based threads. According to the type of the processed message, they can issue queries to the

•

DB Server: It is a local database staying within SS. The RSs query and/or keep updated the DB Server according to the specific SIP message. The higher the number of invite messages, the higher the number of messages exchanged between RS and DB Server. Queries between RS and DB Server are issued by using SQL.

As regards traffic delivery, our practical experience has shown that the Load Balancer is usually not a bottleneck, since its typical load is much lower that its capacity. Despite this, its proper operation must be checked due to its importance. Differently, in the network analysed we have observed that the DB Server is typically a bottleneck for the delivery of the SIP invite messages. 3) UDB The Universal Subscriber DataBase (UDB), which implements the HSS functions, is the central repository of user-related information. The UDB contains all the user subscription data required to handle multimedia sessions. These data include, among other items, location information, security information (including both authentication and

authorization information), and user profile information (including the services that the user is subscribed to). The UDB architecture includes a load balancer, databases and some Java-based Web servers, each capable of handling hundreds of sockets, used to deliver queries issued by RSs. As mentioned above, in our experiments we have used a real network of a national operator. This signaling network is composed of about a hundred of Layer-7 nodes, mainly SIP (P-CSCF and S-CSCF) and HTTP (HSS) servers. Since it does not show criticalities due to traffic volume, the currently most important problems to be addressed are related to the hardware and software faults and to possible configuration inconsistencies that prevent nodes from deploying all their capabilities. All these problems could lead to failures of network services, including call blocking. As shown in Figure 2, symptom propagation due to faults proceeds towards the leaf nodes of the network. For example, a fault occurring in the SS VR 1 shown in Figure 2 will affect the SD (AN), the SD (PD), and so on, whilst no symptoms are expected at the UDBs. Clearly, an SD fault in the called side may affect SD and SS in this side. This is not true for a generic problem. For example, problems due to excessive signaling traffic volume sent by an SD, eventually due to retransmissions, not necessarily could affect the SS.

Figure 2. Scheme of the network analysed; (a) SDs are shown on the left hand side, SSs in the middle, and UDBs in the right hand side. The additional nodes shown, called “VS”, correspond to the SLF in the ETSI IMS specifications; (b) geographical distributions of nodes in Italy; arrows indicate instances of the relevant entities. Real Serv er

From the perspective of the management purposes, the architecture to be considered is shown in Figure 3, the requirement of which are illustrated in the next section. III.

From the practical deployment perspective, two principal requirements emerge: 1) Monitoring of network and service performance. It consists of monitoring the signaling and control processes in order to guarantee a sufficiently high reliability of the deployed services. To this aim, it is necessary to identify the performance metrics of the end-to-end procedures (e.g. SIP registration, call set-up, etc), network entities, and related protocols. For this reason, the relevant network model may be organized in three layers: •

User Services layer (not analysed in this work);

•

Control layer;

•

Network (transmission) and Node resources layer.

FE 1

Information BEA 1 Server LB

BEA 2 HTTP

SIP

DB DB

BEA 3

REQUIREMENTS

Service Reliability is the most important aspect to consider for Service Providers when defining performance requirements for VoIP and advanced telephony services. . In fact, a high reliability level, which characterizes the PSTN telephone networks, is the fundamental pre-requisite in order to induce a massive user migration towards these new services. Nevertheless, a network for VoIP services shows novel network and service management requirements in comparison to plain TDM telephone networks, since the underlying technologies are different, as well as the signaling complexity has increased due to the need of supporting applications distributed over different networks.

(b)

(a)

Information BEA 4 Server FE 2

UDB

Figure 3. Equivalent network architecture for fault management purposes.

2) Alerting and notification of performance related problems The information collected trough performance monitoring is typically used for fault/ problem analysis and identification. Moreover it should be properly shown to network administrators. A. Monitoring Requirement In order to monitor the health of an IMS-based signaling network we define a set of key performance indicators (KPI), according to the most important performance requirements. The control and resource related layers, defined above, apply to different performance requirement, described below. 1) Control Layer requirements • Stability: it refers to the stability of the different network segments, e.g. the access segment (including the SD) and the Core IMS segment (including the SS and the UDB). The stability metric is a function of the number of nodes operating in critical conditions. •

Availability: it indicates the availability of network services and control procedures.

Network KPI monitoring

threshold overcome?

YES

Node KPI monitoring

NO Fault Localization

Fault Recovery

Figure 4. Flow diagram of the monitoring and fault localization processes.

•

Success Rate: it refers to the ability of completing the control procedures successfully.

•

Quickness of service provisioning: it is the time needed to activate a desired service or call-flow.

•

Abnormal traffic conditions: they are determined by the comparison the traffic volume generated by a given procedure and the relevant expected traffic volume in a given time slot.

2) Resource Layer requirements • Resource consumption: It indicates the percentage of the network nodes and links with over-threshold resources consumption. B. Alerting Requirement The Fault Management processes should be supported by alerting tools, driven by the crossing of thresholds on performance levels. Actually the immediacy of the notification is fundamental for preventing serious inefficiencies. We identify two kind of KPI: “network KPI” and “node KPI”. Whereas the first ones provide information on the network operation and may be associated with the status of the services offered to users (QoS), the “node KPIs” provide useful and detailed information for fault localization operations. Thus, in order to support the overall Fault Management process, a monitoring system for handle both the network and node alarms must be designed. It has to differentiate the most important KPIs for each performance metric (typically the network KPIs), which are helpful for a prompt problem detection, from those used for troubleshooting (usually the node KPIs), which can be used to identify and analyse in more details the problem root cause nodes, to find the origin of the problem, and to start the suitable recovery procedure. Figure 4 depicts the flow diagram of the monitoring and fault localization processes. KPI values are typically determined by processing a set of network measures, performed in the network section relevant to a specific KPI. Measurement tools commonly used within an IMS network are the application traffic (e.g. SIP, http) probes and the log files of the network entities that may count up the input and output traffic volume. In particular, data taken from the log files provide the following information:

•

Traffic volume exchanged by a node.

•

Usage of the node resources (CPU, RAM, …).

•

Peak and average response time.

The information about the traffic exchanged among these nodes is usually obtained by application traffic probes. The collection of the measures is usually performed by using SNMP or SFTP management protocols. C. Operational aspects The monitoring and alerting requirements can be organized in an overall performance and fault monitoring system, as shown in Figure 5. The system is composed by three functional blocks: 1) Performance Monitoring, implementing the following functions: • collection of network measurements; •

KPI computation;

2) Alarm generation and notification, implementing the following functions: • thresholds configuration for critical network-KPIs; •

monitoring of threshold crossing;

•

alarm generation triggered by threshold crossing;

•

alarm filtering and operator notification.

3) Automatic fault localization, implementing alarm correlation and fault localization, as described below. KPI threshold crossings may generate a large set of alarms which often do not give a clear indication of their relevant causes. These alarms typically need to be aggregated to make their visualization easy for a human operator. Moreover an automatic fault identification module may be used in order to correlate the observed alarms and quickly identify their root cause, allowing network administrators or, even better, automatic recovery systems to intervene and fix the problem quickly. KPIs may be also classified, according to their information content, as follows: 1) Indicators of the signaling traffic volume. This category is used to reveal an abnormal increase or decrease of the signaling traffic volume from/to a network node or through a particular link. 2) Indicators of dysfunctional protocol operation. This KPI type highlights an unusual protocol behaviour. For example, a SIP or HTTP request sent to a server node produces, in case of acceptance, a “200OK” response. In case of refusal, or missing response, it may be argued that something is going wrong in the downstream network path. 3) Indicators of time slot duration. This KPI type is important since the time needed for receiving responses from servers reflects the network status. 4) Indicators of resource utilization

These KPIs reflects the correspondence of the resource utilization with the relevant expected values on the basis of the operation conditions established by the network designers. D. KPI association with the layered architecture 1) Network KPIs The KPIs illustrated in this section are relevant to the network QoS.

-

•

•

Geographic stability: The network nodes are grouped according the their type (SD,SS,UDB). For each group, the larger the number of alarmed nodes, the worse the network stability, since it might be unable to complete the procedures involving the alarmed nodes. Thus, a possible group KPI consists of the percentage of alarmed SDs connected to the group. Node stability: it refers to the node capacity of making its functions available for the network procedures. For each node type (SD, SS, UDB) the KPI may consist of the percentage of currently active alarms in the network.

Abnormal traffic volume exchanged: it is evaluated for each critical procedure, as the volume of traffic exchanged and the relevant expected value in the same time slot. The traffic volume is captured at the SIP Server (in particular in the LB), since it is a transit node for all network traffic. Availability: It refers to the network service availability at the boundary between Access and Core network sections. When a network service is invoked, such as a register service, in case of no response by a given timeout the request is retransmitted. The number of retransmissions is assumed to be related to the network availability of processing the request. The larger the number of retransmissions, the lower the network availability.

•

Success rate: It indicates the network capacity of delivering the supported services successfully. It is evaluated by the ratio of successful procedures and the total requests issued to the SIP Server.

•

Service delivery time: It is the time needed for activating a given network service. It is evaluated at the SD level, since being it the control node closest to a user, it can collect all delay contributions introduces by other nodes.

•

Resource occupancy: It is the utilization factor of the network resources. A fundamental parameter, for each network element, is the CPU occupancy; the relevant KPI is typically the peak and the average utilization in a given time slot.

In order to obtain a quick snapshot of the whole network health status, the spatial average and the worst case among each of the previous indicators should be considered.

GUI (Alarms)

GUI (KPI)

Alarm Filtering & Aggregation (Presentation)

Network Stability: The network stability may be organized into two broad different categories: -

Fault hypothesis

GUI

Thresholds

Automatic support to Fault Localization

Alarm Generation

(Correlation Engine) Alarm layer KPI model

KPI Calculation

Network Model

FAULT RECOVERY

Measures Collection Bulk measures

network

Probes

•

Network operator

Snmp Agent

Monitoring layer

Agent Snmp

Agent Snmp

SS

Agent Snmp

…

Figure 5. Monitoring and fault localization system

1) Node KPIs Node KPIs may be organized as follows: - Level 1: elementary measures Elementary KPIs (L1) are those based on a single measurement. These measurements are in themselves meaningful, since they capture the desired phenomenon without any need of being combined with other measurements. - Level 2: aggregate measures The second level (L2) includes the KPIs obtained by the combination, through mathematical functions, of different elementary measures, such as the difference between the number of invite requests and the number of 200OK responses traversing a Sip Server LB in a given time unit; that should be close to zero. A detailed description of some possible L1 and L2 KPIs that might be considered in monitoring a signaling IMS network for VoIP applications is reported in Table 1. Due to space limitations, Table 1 reports only a representative subset. The choice of the KPIs to be monitored is generally tightly coupled to the knowledge of the most common criticalities in the specific managed network context. In order to show the importance of an appropriate tuning of KPIs to capture critical situations, Figure 6 shows the effects of the UDB load balancer misconfiguration over one of the UDB server CPU occupancy. The fault has first caused a decrease of the CPU work load. After that the UDB load balancer is restarted, some periodic load peaks appear, due to the synchronization of the terminals, which refresh their registration every 50 minutes. 2) KPI sampling frequency An important issue in Fault Management is the choice of a proper sampling frequency for the KPIs collection. Actually, since each measurement requires computational and bandwidth costs, it is desirable to keep the sampling frequency low. Nevertheless, it is important to guarantee a minimum sampling frequency such that the desired information about

the monitored metric is preserved. In addition, a long sampling period could also produce an additional long delay for detecting a critical situation and generating the relevant alarm. Thus, the final choice is a trade off among all these aspects. For what concerns the detection delay, a maximum value of 15 minutes is typically considered acceptable in operations, including the waiting for the sampling time. Unfortunately, the monitored metrics show a variation rate higher in critical situation than in normal condition, that is when they need to be detected. 100 90 80 Occupazione CPU% CPU Occupancy

70 60 50 40 30 20 10

0 0 :0 0 0 0 :3 5 0 1 :1 0 0 1 :4 5 0 2 :2 0 0 2 :5 5 0 3 :3 0 0 4 :0 5 0 4 :4 0 0 5 :1 5 0 5 :5 0 0 6 :2 5 0 7 :0 0 0 7 :3 5 0 8 :1 0 0 8 :4 5 0 9 :2 0 0 9 :5 5 1 0 :3 0 1 1 :0 5 1 1 :4 0 1 2 :1 5 1 2 :5 0 1 3 :2 5 1 4 :0 0 1 4 :3 5 1 5 :1 0 1 5 :4 5 1 6 :2 0 1 6 :5 5 1 7 :3 0 1 8 :0 5 1 8 :4 0 1 9 :1 5 1 9 :5 0 2 0 :2 5 2 1 :0 0 2 1 :3 5 2 2 :1 0 2 2 :4 5 2 3 :2 0 2 3 :5 5

0

fault

oscillating behavior

Figure 6. Daily CPU occupancy process in the UDB. (a) Sampling period: 30 seconds

(b) sampling period: 5 minutes

sampling periods, respectively. Nevertheless, since the transient phenomena are usually less critical than persistent ones, the practical experience suggests not to use a sampling period shorter than 5 minutes, in order to not overload nodes and/or the monitoring system itself. 3) Alarm monitoring This section illustrates the generation and aggregation requirements of the alarms. These alarms must be generated when the node KPIs cross some given thresholds; the threshold level must be useful to both detect the problem and minimize the false alarm probability. Each KPI is associated with an alarm. An alarm is then associated with a n-state logic (severity level), n=2,3,…. A 2state (binary) alarm consists on the possible states ON and OFF, a 3-state alarm may include the states NORMAL, WARNING, CRITICAL, and so on. It is worth noting the difference between an alarm and a fault. Whilst a fault is the cause of a network problem to be fixed, an alarm is an indicator, a symptom, of the negative effects of the fault. If we consider a topological perspective of the network, instead of a functional model, the number of KPIs to monitor and relevant alarms increases fast. For this reason, in order to visualize effectively to an operator a critical situation, it is necessary to aggregate different individual alarms into higher level notifications. Clearly, the operator may request more detailed information up to the specific alarms. 4) Alarm and Fault Hypothesis presentation As mentioned above, the alarm presentation aims to provide a user-friendly visualization of alarms. With reference to the network considered, the alarms associated with node KPIs at level one (L1) and two (L2) may be logically aggregated in order to compute higher level alarms. A level 3 (L3) alarm refers to the network nodes components. For example, with reference to SIP Server, the components monitored are the LB, RSs, and DBs.

samples

samples

Figure 7. Effects of undersampling a transient phenomenon. The quantity shown is the number of sockets open in the UDB between the LB and a WS; (a) 30 seconds sampling period; (b) 5 minutes sampling period.

A statistical frequency analysis may be helpful to determine the appropriate sampling period without oversampling. It is also important to distinguish between two critical phenomena: •

persistent phenomena, characterized by a long time duration.

•

transient phenomena, which disappear in a short time.

In order to capture the transient problems it is often necessary to use a short sampling period, as shown in Figure 7. This figure shows the number of TCP sockets established in the UDB between the LB and a WS4, shown in Figure 1.. Figure 7 (a) and (b) are relevant to a 30 seconds and 5 minutes

The level 4 (L4) includes only few (four in our case) generic alarms for each network node (SD, SS, UDB). These alarms can capture different abnormal node behaviors, such as HW/internal resource criticalities and problems in the SIP signaling procedures (session registration, call-setup, etc.). Note that these aggregated alarms bring lower information than those directly obtained from node-KPI, since they provide a general malfunctioning notification, to be subsequently analyzed in more details by using the elementary alarms. The correlation of L1 and L2 alarms may be used by an human operator, or an automatic system, for troubleshooting purposes, i.e. to identify their root cause. A complete overview of the most popular Alarm Correlation approaches may be found in [12]; moreover practical applications of some automatic Fault Identification techniques on the network architecture presented in this paper are available in our previous works [14][15] and [16] on this specific topic.

IV.

FAULT LACALIZATION ALGORITHMS

For the sake of completeness, we briefly illustrate the application of known algorithms using KPI alarms for alarm correlation and fault localization. Given a set of observed symptoms generated by KPI violations, it is necessary to identify the fault hypothesis that best explains them. An extensive overview of the main Fault Localization approaches may be found in [12]. We have used a symptom-fault map Fault Propagation Model (FPM), which represents relationships between faults and symptoms through a bipartite graph. The codebook technique [13] is a typical example of deterministic symptom-fault map FPM: a set of faults F, a set of symptoms S and their relationships are represented through a bipartite graph, which can be encoded by an |S|x|F| binary matrix C, called codebook, where each element cij is 1 if an edge connects fault fj and symptom si, 0 otherwise. Each column represents the codeword relevant to a fault. Thus, given a binary vector corresponding to a set of observed symptoms, where the i-th bit is 1 if the i-th symptom is observed, decoding consists of identifying the codeword at the minimum distance from the observation vector. This distance is typically specified by using the Hamming metric. A comprehensive description of the performance achievable by this approach over the network under analysis may be found in [15], including its robustness to false or undetected alarms. The correlation matrix of the network shown in Figure 2 is relevant to 32 SD, 18 SS, and 2 HSS. We have used SD faults, LB of SS faults, DB Server faults, and HSS faults. Any more detailed extension including more fault types and relevant details is straightforward. For example, a generic “HSS fault” may be refined as “WEB server fault”, “database fault”, “LB fault” and so on. Clearly, the larger the number of problems considered, the higher the number of symptoms needed to keep unchanged the fault identification success probability. In this way, the correlation matrix includes 772 symptoms and 70 faults, plus the null codeword, which means absence of problems. Thus, the length of each codeword is 772. It results that the resulting codebook matrix is very sparse. Its symmetric and regular structure reflects the hierarchical organization of the network. The main drawback of the codebook approach is its inability to handle nondeterministic relations between faults and symptoms, i.e. they are modeled by using the conditional probabilities p(si|fj). For these reasons, probabilistic techniques are proposed. An important one is the Incremental Hypothesis Updating (IHU) [17], and the set covering based algorithm called RHC[18]. A further solution is illustrated in [16]. V.

CONCLUSION

In this paper we have illustrated some aspects of the Fault Management problem for a NGN network that supports VoIP services, under a practical perspective. Although the network is compliant with the NGN IMS specifications, and can deploy also advanced services, we have focused on plain wireless and wired VoIP services. In this way, we have described in details the high level monitoring requirements, the KPI definition for

all the network entities in the signaling layer, and some important operational aspects. We have also shown some fault examples taken from the practical experience and sketched some algorithmic aspect of fault identification applied to a nationwide real NGN network. REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9] [10] [11] [12] [13] [14] [15]

[16] [17] [18]

N. Blefari-Melazzi, M. Femminella, "Stateful vs. Stateless Admission Control: which can be the gap in utilization efficiency?," IEEE GLOBECOM 2002, 17-21 November, 2002, Taipei, Taiwan. International Telecommunication Union – ITU-T M.3400, Series M: TMN and Netowrk Maintenance: International Transmission Systems, Telephone Circuits, Telegraphy, Facsimile and Leased Circuits. Telecommunications management network (02/2000) . C. Casetti, M. Gerla, S.S.Lee, G. Reali, “Resource Allocation and Admission Control Styles in QoS DiffServ Networks”, International Workshop on QoS in Multiservice IP Networks (QoS-IP 2001), Rome, Italy January, 2001. N. Blefari-Melazzi, JN Daigle, N Femminella,"Stateless admission control for QoS provisioning for VoIP in a DiffServ domain", 18th International Teletraffic Congress (ITC-18), 31 August 5 September, 2003, Berlin, Germany. D. Di Sorte, G. Reali, “Pricing and brokering services over interconnected IP networks”, Journal of Network and Computer Applications, 28 (2005) 249–283. N. Blefari-Melazzi, D. Di Sorte, G. Reali: “Usage-based Pricing Law to Charge IP Network Services with Performance Guarantees”, IEEE International Conference on Communications (ICC) 2002, New York, USA, 2002 Di Sorte, M. Femminella, G. Reali, “A QoS Index for IP Services to Effectively Support Usage-based Charging” IEEE Communications Letters, Vol. 8, No. 11, November 2004. N. Blum, P. Jacak, F. Schreiner, D. Vingarzan. P. Weik, “Towards Standardized and Automated Fault Management and Service Provisioning for NGNs”, Journal of Network and System Management, vol. 16, 2008, 63–91. G. Camarilla, Miguel A. Garcia-Martin: “3G-IP Multimedia Subsystem IMS” John Wiley & Sons, 2006 – 2nd edition. K Knightson,N. Morita, T. Towle: “NGN Architecture: Generic Principles, Functional Architecture, and Implementation”, IEEE Communications Magazine • October 2005 Rogelio Martínez Perea: “Internet Multimedia Communications Using SIP”, Morgan Kaufmann Publishers, Elsevier, 2008. M. Steinder and A. S. Sethi, “A survey of fault localization techniques in computer networks”, Elsevier, Science of Computer Programming, vol. 53, 2004, pp. 165–194. S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie, "High speed and robust event correlation". IEEE Communications Magazine, vol. 34, no. 5, 1996, pp. 82–90. G. Reali, L. Monacelli, “Fault localization in data networks”, IEEE Communications Letters vol. 13, no.3, March 2009. G.Reali, L.Monacelli, "Definition and Performance Evaluation of a Fault Localization Technique for an NGN IMS Network", IEEE Transactions on Network and Service Management, Vol. 6, no. 2, Page(s): 122 – 136, 2009 G. Reali, L. Monacelli, “Evolution of the Codebook Technique for Automatic Fault Localization”, IEEE Communications Letters, vol. 15, no. 4, April 2011. M. Steinder and A. S. Sethi, "Probabilistic fault diagnosis in communication systems through incremental hypothesis updating". Computer Networks, vol. 45, no. 4, July 2004, pp. 537-562. Zheng, Q., Hu, W., Qian, Y., Yao, M., Wang, X., and Chen, “A Novel Approach for Network Event Correlation Based on Set Covering”. In Proceedings of the 2008 Fifth international Conference on Fuzzy Systems and Knowledge Discovery - Volume 03 (October 18 - 20, 2008).

Table 1 – Examples of representative KPIs (L1, L2) Indicators

signaling traffic volume

time slot dutarion

resource utilization

SD

SS

(a) Number of errors arriving at an interface, (b) number of the SIP invite messages traversing the SD, (c) number of retransmissions detected at an SD interface.

(l) Number of specific 5xx SIP responses indicating problems when RSs access the DB Server (SIP500), or the UDB (SIP503/504). In the first case problems may be caused by DB Server table saturation, in the second case by lack of (d) Traffic volume leaving an SD. If it available sockets or UDB unreachability. gets below a given threshold, it is assumed to be caused by a fault within (m) Number of Invite messages entering the SD. the LB of the SIP Server, which corresponds to the number of calls (e) Number of Register commands attempted by users and managed by the issued by all terminals connected to an SIP Server. SD. This value can reflect the presence of terminals that flood bursts of requests (n) An excessive SIP traffic volume or unspecified problems in the nodes detected could be due to a high number involved in the registration procedure. of SD retransmissions, thus problems in receiving response messages. A small volume could be due to an SD fault. (f) Time between the transmission of the (o) Problems in receiving UDB responses SIP signalling and the reception of the and relevant data update. The growth of relevant response. This quantity time elapsed between an http request to depends on the proper operating UDB and the relative http response may conditions of the nodes downstream the indicate generic UDB problems. SD. (p) Time needed for deleting data from record tables relevant to calls that have contributed to CDRs generation yet . This time is typically lower than 1 s, thus a time increase could produce a relevant increase of both the size of tables and the blocking probability due to saturation.

(g) CPU utilization; a high utilization indicates the need of managing large traffic volume.

(q) Time elapsed from an SQL request, issued by an RS, and reception response of a DBS. An increase could be due to problems in that DBS. (r) Size of the DB Server table; this is a critical resource since an excessive size would imply an increase of the SQL response time, sent by RSs to the DBS. In addition, the table space saturation is a system blocking event.

(h) Health, which is an aggregate of different parameters indicating the general state of the device (e.g. temperature, proper fan operation, (s)Number of active RS threads. If all threads are being used, new requests reachability…etc.). cannot be admitted by RSs. (i) Amount of used RAM. (t) Number of open sockets per RS. The lack of available sockets results in communication problems between RS and UDB.

UDB (z)Anomalous HTTP requests; this may be due to problems in a node upstream the UDB. (A) Connection problems between UDB web servers and the connected databases. (B) Generic UDB errors.

(C) An excessive average UDB response time may be due to critical UDB problems.

(D) Number of sockets in the closing state. An excessive value might be due to UDB problems. (E) Number of open sockets in each UDB web server; a saturation of this resource would prevent a server from accepting new HTTP requests from RSs, thus blocking all procedures. (F) Occupancy of the UDB CPU.

(G) RAM used by a web server. In case of memory leak, the RAM utilization would increase abnormally, thus causing a lack of memory space (u) Occupancy of computing resources available for other applications and and RAM in RSs. blocking of web servers. (w) Occupancy of CPU in LB.

abnormal protocol operation

(H) Number of idle threads, depending on a web server load. A lack of threads causes processing slowdown and losses of HTTP requests. (j) Failures of a specific protocol (k) Criticalities in the register and invite (I) Number of users registered in the estimated by the difference between the procedures. These indicators can be UDB. number of signaling messages at an SD obtained by evaluating the differences interface and the responses leaving the between the number of request (L) Ratio of the error messages same interface. These KPIs are messages and the number of expected leaving the UDB and the total associated with an abnormal operation positive answers. Nonzero values is a number of received queries. with a node downstream the detection clear symptoms of network problems. interface (M) Problems for query fulfilment.