Data Mining telecommunications network data for fault ... - CiteSeerX

8 downloads 29115 Views 44KB Size Report
applying data mining to fault management and development testing of high-speed ... processing also meant that it was now possible to search (mine) large ...
Data Mining telecommunications network data for fault management and development testing R. Sterritt, K. Adamson, C.M. Shapcott, E.P. Curran Faculty of Informatics, University of Ulster, Northern Ireland.

Abstract Applying Data Mining and Knowledge Discovery to complex industrial problems is an increasing trend. The authors have been involved in researching and applying data mining to fault management and development testing of high-speed telecommunications systems. This paper discusses the strategies undertaken for data mining these applications using Telecommunications Management Network (TMN) data.

1 Introduction This paper discusses strategies for data mining Telecommunications Management Network (TMN) data for both Fault Management and development testing purposes. The authors’ collaborative experiences with NITEC (Northern Ireland Telecommunications Engineering Centre), an R&D lab of Nortel Networks, in this area are discussed. The research started in 1993 with the emphasis on simulation. This emphasis evolved to data mining in 1996. First the telecommunications domain is described, followed by an overview of previous research programmes. Next the data mining strategies for both fault management and development testing are discussed. Lastly an evaluation and future plans are discussed in the conclusion. 1.1 Telecommunications systems Within the telecommunication industry the Synchronous Digital Hierarchy (SDH) is an international standard for broadband networks, offering increased bandwidth and sophisticated services (ITU [1]). This increased sophistication allows for traditional voice, video on demand, ISDN data transfer and video conferencing to

use the one network more efficiently and effectively. It is the leading infrastructure solution for internet backbone communications. The management of this level of sophistication becomes more difficult, particularly when a fault in the network occurs (ITU [2]). The SDH multiplexers themselves and other network elements (NE) have built in recovery methods and the behaviour of these NE are highly specified by ITU (formerly CCITT), ANSI and ETSI and as such are deterministic. When a fault or multiple faults do occur the operator is presented with events which represent the symptoms. Yet due to the nature of networks it is not as simple as when x arrives followed by y then z has occurred. Y may arrive before hand or not at all. When taking into account that other network components up or downstream of the fault detecting problems also issue alarm events, the symptoms grow and as such the fault identification is a difficult task. This occurs to such an extent that the behaviour of the multiplexer network has been described as effectively non-deterministic (Bouloutas [3]). 1.2 Previous work The original aim of the research was to simulate the multiplexers on a parallel architecture to facilitate large-scale tests. The hypothesis being that, in industry, simulation is a cost-effective and rapid method for testing and designing new equipment and systems. Advances in the fields of mathematical modelling and parallel processing meant that it was then possible to address problems that had previously been unrepresentable due to constraints imposed by problem complexity and computational requirements. The initial project NEST (1993 [4]) focused on investigating how to simulate in real-time (emulate) an STM-1 multiplexer and network manager on a parallel processing environment. The NETEXTRACT project (1995 [5]) was then established to use evidential reasoning techniques (to ensure a tolerance for uncertainty) to validate and verify the emulation from its output data to secure as far as possible a correct model (Adamson 1994[6], Moore 1996[7]). The GARNET project (1996 [8]) was set-up to build on the NEST investigations – to develop a multi-processor implementation of the network elements to achieve realtime speeds and to develop an intelligent graphical user interface to map the desired network topology to the parallel environment (Sterritt 1998-1[9]). The simulation approach had an inherent problem in that it required acquiring expert knowledge about a system that was itself under development – a moving target. Therefore, this work not unlike many other similar projects suffered from the Knowledge Acquisition (KA) bottleneck. By the time the knowledge was acquired from the experts and the development of the parallel simulation/emulation models had taken place, the information would be out of date. The actual products being simulated had moved on to newer versions and development cycles. The same advances in the fields of mathematical modelling and parallel processing also meant that it was now possible to search (mine) large amounts of data for hidden and useful patterns. The Knowledge Discovery (KD) approach

initially avoids the KA bottleneck in that it mines the actual data (not the mind of the expert) and as such can keep abreast of developing versions as long as the data can be gathered and pre-processed. Yet to achieve true KD and not just perform Data Mining (DM) requires an expert interpretation/evaluation. Therefore we note that the two approaches have the dependency on the expert but at different ends of their respective processes. The mined results offers the advantage that there are solid findings to work with. The NETEXTRACT project had in reality always been about knowledge discovery since it was extracting cause and effect networks from data – scale would be the new difference. Instead of extraction from all modelled systems data the emphasis would shift to the extraction from real network management data from the testing environment (Sterritt 1997-1[10], 1997-2[11], 1997-3[12]). Under the second year of the GARNET project, automated testing (Sterritt 2000 [13]) was developed in NITEC and as such the KD architecture from the NETEXTRACT project was further refined to provide an assurance level for the auto tests (Sterritt 1998-2[14], 1998-3[15]). 1.3 Data Mining telecommunications network data - the two faces Two distinct applications of data mining from TMN data have evolved from the previously mentioned authors’ research with NITEC; (1) Fault Management, (2) Manual and (from 1997) Automated Testing These have common ground in that both can involve mining TMN data yet with a difference emphasis. In fault management the desire is to correlate the events to such a degree to facilitate the prediction of the actual fault. In testing the same mined correlation can be useful to validate a set of test results or mining be used to spot any anomalies in the test data. As a product nears release the data from a test environment becomes more relevant to fault management (apart from the events caused by test configuration) since it will start to match the final behaviour that will be prevalent in an operational network.

2. Data Mining Data mining deals with the discovery of hidden knowledge, unexpected patterns and new rules from large databases. It is now generally considered as the discovery stage in a much larger process ((Fayyad [16], Uthurusamy [17]) – knowledge discovery in databases (KDD). Adriaans [18] presents a comprehensive introduction for undertaking data mining and KDD particularly all the stages; Data selection, Cleaning, Enrichment, Coding, Data Mining, and Reporting. 2.1 Data Mining for TMN fault management

2.1.1 Overview Global telecommunication systems are built with extensive redundancy and complex management systems to ensure robustness. Fault identification and management of this complexity is an open research issue with which data mining can greatly assist. 2.1.2 Faults, Events and Masking A Fault is a malfunction that has occurred either in the hardware or software on the network. This can be due to some external force for example a digger cutting through the fibre cable or an internal fault such as a card fail. An event is an occurrence on the network. Those that relate to the management of the network are recorded by the Element Controller (EC; historically referred to as the Element Manager - EM). In older releases a recorded event equated to an alarm. This is no longer the case, other examples of events are user logins and user actions such as switch protection. There are numerous types of alarm events that may be generated within a Network Element (NE) typically around 100 types. An example of a critical alarm is a ’Comms fail alarm’. An alarm exists for a time period; thus under normal circumstances an alarm present event will be accompanied by an alarm clear event. Each alarm type is assigned a Severity Level of Critical, Major or Minor by the network management system depending on the severity of the fault indicated by the alarm type. In the example, the alarm type ’Comms fail’ has a critical severity level while other alarms such as ’Tributary Unit Alarm Indication Signal’ (TU-AIS) has a minor severity level. The occurrence of a fault can cause numerous alarm events to be raised from an individual NE, this means that the alarms are often inter-related (and thus the desire to correlate). Also a fault may trigger numerous similar and different alarms (and indeed alarm types) to be generated in different NE’s up or down stream on the network. For example the Comms fail alarm, an alarm raised by the management system if it cannot maintain a communications channel to the indicated NE, may cause other alarms such as RS-LOS, RS-LOF, QeccComms_fail, MS-EXC or even laser alarms depending on the fault and configuration. The Qecc-Comms_fail alarm indicates that the NE can not communicate via the Embedded Control Channel (ECC) of the indicated STM-N card with the neighbouring NE. Alarms can be generated exponentially in different NE’s throughout the network due to certain fault conditions, the larger the network the greater the number of alarms that will be generated. It is therefore essential for the NE’s to provide some correlation of the different alarms that are generated so that the EC is not flooded with alarms and only the ones with high priorities are transmitted. This is handled in three sequential transformations; alarm monitoring, alarm filtering and alarm masking. These mean that if the raw state of an alarm instance changes an alarm event is not necessarily generated.

Alarm monitoring takes the raw state of an alarm and produces a ’monitored’ state. Alarm monitoring is enabled/disabled on a per alarm instance premise. If monitoring is enabled, then the monitored state is the same as the raw state, if disabled then the monitored state is clear. Alarm filtering is also enabled/disabled on a per alarm instance basis. An alarm may exist in any one of three states, Present, Intermittent or Clear, depending on how long the alarm is raised for. Assigning these states, by checking for the presence of an alarm within certain ’filtering’ periods, determines the Alarm Filtering. Alarm masking is designed to prevent the unnecessary reporting of alarms. The masked alarm is inhibited from generating reports if an instance of its superior alarm is active and fits the ’Masking’ periods. A ’Masking Hierarchy’ determines the priority of each alarm type. Alarm masking is also enabled/disabled on a per alarm instance basis. If an alarm changes state at any time the network management system must be informed. The combination of Alarm Monitoring, Masking and Filtering makes alarm handling within the NE’s quite complex. The simple example of inter-connecting alarms above and the transformations should have illustrated that fault determination is not a straightforward process. The combinations of possible alarm events and the time they are received at the EC are numerous. Added to this complexity is the fact individual alarms can be configured in different states such as ’Masking Disabled’ or ’Masking Enabled’; or the Network in different states such as ’1+1 protection’ or ’unprotected’. 2.1.3 Event correlation At the heart of alarm event correlation is the determination of the cause. The alarms represent the symptoms and as such, in the global scheme, are not of general interest once the failure is determined [19]. There are two real world concerns: (1) the sheer volume of alarm event traffic when a fault occurs; (2) the cause not the symptoms. The types of correlation that have been described previously meet criterion (1), which is vital. They focus on reducing the volume of alarms but do not necessarily meet the criterion (2) to determine the actual cause - this is left to the operator to determine from the reduced set of higher priority alarms. Ideally, a technique that can tackle both these concerns would be best. Artificial Intelligence (A.I.) and Data Mining offers that potential and has been and still is an active area of research to assist in fault management. 2.1.4 Event correlation - the Bayesian network way The authors’ research [7] does deal with both criteria (volume of alarms and cause not the symptoms) using probabilistic reasoning techniques [10]. The cause and effect graph can be considered a complex form of alarm correlation. The alarms are connected by edges that indicate the probabilistic strength of correlation. Yet

the cause and effect network can contain more than just alarms as variables - actual faults can be included as variables. Data Mining is used to produce the probabilistic network by correlating offline alarm event data, and deducing the cause using this probabilistic network from live alarm events. 2.1.5 Data Mining the Bayesian Network - Induction In this case, as in many cases, the structure of the graphical model (the Bayesian net) is not known in advance, but there is a database of information concerning the frequencies of occurrence of combinations of different variable values (the alarms). In such a case the problem is that of induction – to induce the structure from the data. Heckerman has a good description of the problem [20][21]. There has been a lot of work in the literature in the area, including that of Cooper and Herskovits[22]. Unfortunately the general problem is NP-hard [23]. For a given number of variables there is a very large number of potential graphical structures which can be induced. To determine the best structure then in theory one should fit the data to each possible graphical structure, score the structure, and then select the structure with the best score. Consequently algorithms for learning networks from data are usually heuristic, once the number of variables gets to be of reasonable size. 2.1.6 Data Mining for additional simple rules In practice, when it comes to learning the cause and effect graph, the volume of event traffic and correlation of alarms can be reduced by simple first stage correlation (generally pattern matchers). The expert system approach (in this case the deduction from the probabilistic network) could then handle the remaining more complex problems, taking advantage of the much reduced and enriched stream of events. As such the authors have now designed and developed a simple first stage event correlator. Rules for the system can be written from mined results from such tools as Clementine for example; when a Comms fail alarm occurs it is likely a QeccComms_fail alarm will be injected into the network. It is envisaged that these additional rules could be potentially adapted to extend the existing correlation system in an element manager.

2.2 Data Mining for development testing 2.2.1 Overview Within NITEC high capacity broadband transmission and switching equipment are designed and developed. This complex mix of hardware, software and firmware must conform to international standards to facilitate heterogeneous global networks.

During the development cycle for each release of a product a significant proportion of the time is taken up with testing, commonly estimated at 60%. As the product becomes larger and more complex the ability to comprehensively test and verify the operation within the decreasing timeframe to market becomes increasingly difficult. Automation offers the potential to decrease this overhead. Traditional manual testing of telecommunications equipment was expensive in terms of time spent, costs involved and even de-motivation of specialised engineers due to the repetitive task. Automation offered a competitive advantage in terms of reduced cost, reduced time to market, enhanced quality and “freeing-up” of specialised engineers for further investigating and solving of problems areas discovered from the testing. The disadvantage to automation is the experimental approach to testing is lost. A test script will not spot anomalous behaviour that an engineer would have. Automation offers a rich data trail which can then be utilised to compensate for the lose in live experimentation by the engineer. Hidden in that data should be indications that any anomalies have occurred. 2.2.2 Mining for Test Assurance The initial mining that tool place under the Netextract project had limited results due to the fact that in manual testing the user actions were not recorded. Automated test scripts use a command interface to inject ‘user action’ events into the network. These commands can be used to model potential faults in the network, for example to disable a card to model a card fail. These commands are now also available to mine in conjunction with the resultant alarm data. Since the resultant data to these commands is TMN alarm event data that has been previously discussed the same mining approaches can be utilised in this testing environment to provide assurance for the lack of live test engineer experimentation. Each execution of an individual test leaves behind a statistical 'footprint' which can be presented graphically, i.e. the Bayesian networks, to assist in classifying a pass/fail. Rules to define specific test environment behaviour amongst the alarms events for example a TU-AIS alarm being raised on 15 ports instead of 1 port due to the daisy chain test environment configuration. Mining for other hidden behaviour across a test script’s results from different executions over a period of time with the aim to find any anomalies that may indicate a fail. 2.2.3 Footprints The assumption is that the foot-prints can be utilised for a wider-based identification (classification) of a pass or a fail of an individual test. In any case where there is a sufficiently large number of pass and fail tests available, it should be possible to use classification techniques, such as a neural network, to generate a pass/fail classifier for automated testing. Yet a drawback of many classification

techniques, including neural networks, is that they do not provide any explanation of the decision. Inducing a probabilistic network from the data provides a much more visual footprint. Probabilistic networks, in which relationships between variables can be represented by the existence of links between them, have an intuitive appeal. They are easy to “read” if represented graphically and can summarise fairly complex relationships succinctly. The approach offers great promise. From initial experimentation it would appear that the nets can be used as a classification technique. They cover all events that have occurred during a test and therefore provide the means to make up for the lack of a test engineer at the scene monitoring for anomalous activity.

3. Conclusion 3.1 Evaluation Over the years this research has produced a useful study of applying different data mining techniques to several problems in the telecommunications domain. Since 1997 each year several data sets, usually of a month duration, have been gathered from the environments to assist in this work. The most promising areas are for fault management and identification and for test assurance in the equipment’s R&D lifecycle. 3.2 Future work Under the new JIGSAW project, as part of a data warehouse strategy, databases are being established to store all the necessary data from this point forward. This will enable the data mining applications discussed here to be used on a day to day basis for instance as a part of a decision support system with data mining at its core (Schuster [24]).

Acknowledgments We are greatly indebted to our industrial collaborators Northern Ireland Telecommunications Engineering Centre (NITEC), Nortel Networks, who have supported our research for many years now. We would also like to thank EU (Stride programme 1993-95), EPSRC (AIKMS programme 1995-97), IRTU (Start programme 1996-99) for funding this work.

References

[1] ITU, Types and General Characteristics of SDH Multiplexing Equipment, ITUT (previously CCITT) Recommendation G.782 1990. [2] ITU, SDH Management, ITU-T (previously CCITT) Recommendation G.784 1990. [3] Bouloutas, A. T., Calo, S. and Finkel, A., Alarm Correlation and Fault Identification in Communication Networks, IEEE Trans Comms, Vol. 42, No 2/3/4, Feb/Mar/Apr 1994. [4] EU/STRIDE ‘NEST’ Project, Collaborators; University of Ulster, Queens University of Belfast and Northern Telecom, 1993-95. [5] EPSRC & DTI/AIKMS ‘NetExtract’ Project, Collaborators; University of Ulster, Nortel (Northern Telecom) and Transtech Parallel Systems, 1995-1997. [6] Adamson K., A Knowledge Based Approach to Real time Systems Modelling, Proc. of the 12th Int. IASTED Conf. On Applied Informatics, pp1-3, 1994 [7] Moore P., Shao J., Adamson K., Hull MEC., Bell DA., Shapcott M., An Architecture For Modelling Non-Deterministic Systems Using Bayesian Belief Networks, Proc. of the 14th Int. IASTED Conf. On Applied Informatics, pp254257, 1996 [8] IRTU/START ‘GARNET’ Project, Collaborators; University of Ulster and Nortel, 1996-1999. [9] Sterritt, R., Curran, E.P., Adamson, K., Towards A Graphical And Real-time Network Simulation Toolset, Eds Adey R.A., Rzevski G., Nolan P., Applications of Artificial Intelligence in Engineering XIII, CMP: Southampton, CD-ROM pp210-227, 1998 [10] Sterritt R, Daly M., Adamson K., Shapcott M., Bell D.A., McErlean F., NETEXTRACT: An Architecture For The Extraction Of Cause And Effect Networks From Complex Systems, Proc. of the 15th Int. IASTED Conf. on Applied Informatics, pp55-57, 1997 [11] Sterritt R., Adamson K., Shapcott M., Bell D.A., McErlean F. , Using A.I. For The Analysis Of Complex Systems, Proc. of the Int. IASTED Conf. On AI and Soft Computing, pp105-108, 1997 [12] Sterritt R., Adamson K., Shapcott M., Wells N., Bell D.A., Lui W., PCAEGA: A Parallel Genetic Algorithm For Cause And Effect Networks, Proc Int. IASTED Conf. On AI and Soft Computing, pp105-108, 1997

[13] Sterritt, R., Shapcott, C.M., Adamson, K., Curran, E.P., Calvert, W., Johnson, R., Designing And Implementing An Automated Testing Approach For The Development Of High Speed Telecommunication Equipment, Accepted for the 18th Int. IASTED Conf. Applied Informatics, 2000 [14] Sterritt, R., Adamson, K., Shapcott, C.M., Curran, E.P., Adapting An Architecture For Knowledge Discovery In Complex Telecommunication Systems For Testing Assurance Proc NIMES 98 Conf. on Complex Systems, Intelligent Systems and Interfaces, pp37-39, 1998 [15] Sterritt, R., Curran, E.P.,. Adamson, K., Shapcott, C.M., Application Of AI For Automated Testing In Complex Telecommunication Systems, Proc EXPERSYS 98, 10th Int. Conf. Artificial Intelligent Applications, pp97-102, 1998 [16] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery & Data Mining, AAAI Press & The MIT Press: California, pp1-34 1996 [17] Uthurusamy, R. “From Data Mining to Knowledge Discovery: Current Challenges and Future Directions”, Advances in Knowledge Discovery & Data Mining, AAAI Press & The MIT Press: California, pp 561-569, 1996 [18] Adriaans, P., Zantinge, D., Data Mining, Addison-Wesley: England, 1996. [19] Harrison K. "A Novel Approach to Event Correlation", HP, Intelligent Networked Computing Lab, HP Labs, Bristol. HP-94-68, July, 1994, pp. 1-10. [20] Heckerman, D., 1997, “Bayesian Networks for Data Mining”, DM&KD 1, 79119, 1997 [21] Heckerman D, 1996. “Bayesian Networks for Knowledge Discovery” eds. Fayyad UM, Piatetsky-Shapiro G, Smyth P and Uthurusamy R, Advances in Knowledge Discovery and Data Mining, AAAI Press / The MIT Press, 273-305. [22] Cooper, G.F. and Herskovits, E., 1992. “A Bayesian Method for the Induction of Probabilistic Networks from Data”. Machine Learning, 9, pp 309-347 [23] Chickering D.M. and D. Heckerman, 1994. “Learning Bayesian networks is NP-hard”, MSR-TR-94-17, MS Research, Microsoft Corporation, 1994. [24] Schuster, A. , Sterritt, R. , Adamson, K., Curran, E.P., Shapcott, C.M., Towards a Decision Support System for Automated Testing of Complex Telecommunication Networks, Submitted for publication at IEEE Int. Conf. On Systems, Man and Cybernetics, 2000