Proactive Network Maintenance Using Machine ... - Semantic Scholar

2 downloads 0 Views 757KB Size Report
communications etwork, rule induction, proactive maintenance, machine learning, .... munications network, we adopted a computer-based approach of learning from .... in published works[4]~:in our Study, we emphasized a rule induction tech-.
From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.

Proactive Network.Maintenance Using Machine Learning R Sasisekharan

V Seshadri AT&TBell Laboratories,

S MWeiss HR2K033

480 Red Hill Road Middletown, N J 07748 [email protected]

¯ - ABSTRACT A new approach to proactively maintain a massively interconnected communicationsnetworkis described. Tl’fis approachhas beenapplied to the detection and prediction of chronic transmission faults in AT&T’s digital communicationsnetwork. A windowingtechnique was applied to large volumesof diagnostic data, and these data were analyzed by machinelearning methods. A set of conditionshas beenfoundthat is highly predictive of chronic circuit problems,that is, problemsthat are likely to continue in the immediatefuture withoutdiagnosis and repair. In addition, a few conditions have been found that are predictive Of problemsthat affect multiple circuits. Such analyses over the completenetworkare helpful in proactively maintainingthe networkand in spotring trends for circuit problems.Proacfive maintenanceof the networkcan help in greatly improving the quality and reliability of a networkby identifying potentially serious problemsbefore they OCCur.

KEYWORDS communications network,rule induction, proactive maintenance,machinelearning, databasemining

KDD-94

AAAI-94 Workshop on Knowledge Discovery in Databases

Page 453

INTRODUCTION ¯ With the increasing complexity of moderncommunicationsnetworks, there is a commensurate need for intelligent systemsto help manageand maintainthem.¯Mostdesirable are systemsthat can analyze and resolve problemsin the networkautomatically, thus greatly improvingthe reliability and quality of the network.Artificial intelligence(M)techniques haveprovenuseful in building networkoperation systems[l]. Anarea in whichsuch techniques can have significant impact is in the proaetive maintenanceof networks. Proactive maintenanceof the networkcan help in greatly improvingthe quality and reliability of a networkby identifying potentially serious problemsbefore they degrade. This can be accomplished bymonitoring network performance over time and spotting trends in problems. In addition, monitoringnetworkperformancecan help in prioritizing networkproblems.The abilityto prioritize network¯problemscan significantly improvethe quality of the networksince those problemsthat have the most significant impacton the operations of the networkcan be addressed first. Monitoring network performanceinvolves analyzing extremely large amountsof diagnostic data that varies with time. Lookingfor pattems of behavior in such large volumesof data can only be accomplishedby computeranalysis. In this paper, wepresent an approachto do such an analysis using machinelearning techniques. Weillustrate the approachby specifically looking at detection and prediction of chronic transmission faults in AT&T’s world-widedigital communicationsnetwork. Even though this paper focuses on telecommunications networks where manyhomogeneous ¯ networkcomponents exist, our workis applicable to other types of networksas well. In the following sections, Wedescribe the application domain,explain the approachthat weused to enable the ¯ machineto learn from time-dependentProblemsin the network, apply alternative methodsof machine learning to our problem,and report the results we have obtained. PROBLEM

DESCRIPTION

Networkoperation systems (NOS)exist in the network to support provisioning, maintenance, operation, administration and management functions for the networkand for individual network components.A Circuit can be considered-as a path in the network, which contains network componentsand links. Transmissionproblemson a circuit are seen by several of the networkcomponents through which :the circuit connects. In a large network, such-as AT&T’s communications network, the ratio, of diagnostic data generated by various networkcomponentsto the root problem that is responsiblefor them,is very large. Thedifferent types .of problemsin the networkcan be broadly categorized into two classes, transient and non-transient. Transient transmission problemsare very common in the network, yet their behavior and causes are not completelyunderstood. Part of the difficulty in understanding themis related to separatingthe wheatfromthe chaff, that is, in learning to ignoreglitches that will not be repeated and focussing insteadon those transient problemsthat will recur (chronics). Chronits not only affect the quality of communications while they recur but also indicate degradationand potential-future failures in the network.Thusit is an importantand challengingproblemto identify these chronics and isolate their causes. Diagnostic procedures that attempt to resolve transient problemsmust rely on large volumes of historical information and a morecomplexanalysis of patterns of behavior. Onenovel approach to diagnosing transient faults is found in an AT&T system knownas SCOUT[2].Using historical and topological information, SCOUT finds specific related circuits that share common patterns of faulty behavior. Typically, these are difficult transient problems;multiple circuit problems,or even forms of chronic faulty behavior. In this paper, we consider a related form of analysis of chronic

Page 454

AAAI-94 Workshop on Knowledge Discovery in Databases

KDD-94

behavior. Weconsider the performanceof the complete AT&T network over time. Our objective is to determinewhetherthere are patterns of behavior over the networksuch that the following predictions can be made: ¯ Thefaulty behavior will continue in the immediatefuture. ¯ Thefaulty behavior involves multiple circuits. Wedo not attempt to solve specific problems.Instead, our objective is to determinewhether there are any signature characteristics for those networkproblemsthat do not get fixed. Wemay not have enoughdatato determine the exact nature of the problem, but wemaybe able to predict accurately that the problemwill not be fixedquicklyduring the normalcourse of operations. In order to achieve maximum reliability of the network, problemswith these signatures must be considered, and those problemsthat are deemedcritical must be addressed quickly. Froma computerscience perspective, interest in this area centers around a numberof issues related to mappingtime-dependentevents into a standard classification framework.These include the following: ¯ Defining classes to conveythe concept of "chronic." ¯ Definingfeatures that summarizehistorical events. In addition, the samplesize in this analysis :numbersin the tens of thousandsof cases. Looking for patterns of behavior in such large volumesof data can only be accomplishedby computer analysis using machinelearning techniques, possibly resulting in newinformation that cannot be obtained by typical humanexperience. THE

MACHINE

LEARNING

APPROACH

In order to identify chronics that have the potential to degrade the performanceof the communicationsnetwork, we adopted a computer-basedapproachof learning from historical data. The macb_inelearning methodologyis described in this section. Describing the Goals and Measurements Thecircuit,related questions that wehave outlined in the previous sections need to be posed in a standard classification format, so that a numberof interesting analytical techniques that are available can be applied. PredictiOn modelsthat can be applied tea standard classification problem include decision trees, decision rules, statistical linear diseriminants, neural nets, and nearest neighbor methods.In the standard classification format, samplesof cases are obtained. For each case, identical measurements are taken, and at least one of these measuresis the class label. Methods are applied whichattempt to find patterns for one class that differ fromother classes. For our first problem,the class label is chronic failure on a circuit, a conceptthat has been defined in previous sections. Thegoal is to predict that current failures will continue to occur. We mustalso take into accountthat these failures are often transient, and that failures will likely not occur continuouslyin the future. Instead,-a failure mayoccur in the future, but the occurrencemay also be transient. Periods of time that are reasonablyclose to the current period are of interest. The measurementsthat are used for prediction must summarizehistorical information. These measurementsare recorded each time a fault occurs. Not all measurementsare recorded for every fault, only those that directly measurethe fault process. Faults are often transient, so the trends for a period of time must be measured.It is quite possible that manyfaults will occur for a short time, but these faults are not necessarily chronic, Theycan be fixed and do not reappear in the immediate future. Measurements mustbe specified that are useful in pr~icting the target concept, that is, future failures on the circuit.

KDD-94

AAAI.94 Workshop on Knowledge Discovery in Databases

Page 455

This time-dependentproblemwas mapped¯into a standard classification format by the use of fixed time windows.Historical information for circuits was examinedover a consecutive period of time, and this time period wasdivided into two windows,W,~and Wb.The objective was to use the ¯ measurementsmadein Wato predict that problems will occur in Wb. Weconsidered both our knowledgeof the application and experimentaldata to arrive at reasonable sizes for each of the two windows.The windowswere also divided into sub,units based on time, which we will refer to as a time unit. Wewill refer to the size of W,as T, time units, and the size of W b as Tb time units. There are manyreasonable measurementsthat can be taken over time. Assuminga fault oc: curs, an alarm or exception is noted. Included in the possibilities of measurements are the number of times such an event occurs, the average numberof times an event occurs during a time unit, or the numberof time units during which the event occurs. Wedefined around 30 performancefeatures for this problem,based on the variation of diagnostic data over time and variation over space. In addition to the timeperiodsfor the windows,another factor must be consideredin defining the conditions for each window,This is the degree of chronic failure. For the two windows,W,and Wb;the conditions that were chosen were different, as follows: ¯ W,: any fault during the time period T,,. ¯ Wb:faults during at least half the time units of the followingtime period"lb. Therationale for these periods is as follows. Wemust consider a prior period sufficiently long that prediction of continuing chronic behavioris feasible. Weconsideredall circuits wherea ¯ fau!t has occurredduringWa.Becausefaults are often transient, wemust specify a reasonablylong period for Wb.Afault that recurs over at least half the time units during Wbwouldbe of interest becauseit indicates a clearly continuing and unresolvedproblem. In our learning experiments, weexaminedall circuits with problemsduring predefined intervals. It must be emphasizedthat the predictions to be madeare those for continuing chronic failures. Theseare not necessarily the most obviousor acute problems, whichare often diagnosedand fixed quickly. The second question that was examinedis whether any patterns of.measurementsare indicative of multiple circuit involvement.Werefer to failures, occurring on multiple circuits due to a commoncause, as. multiple circuit problems. Werely on SCOUT’sanalysis to determine if the problemon a circuit is a multiple circuit problem.The question that wehopedto answeris whether there are certain types of faults within the completenetworkthat consistently suggest multiple circuit problems. Learning from Data Oncesample data are obtained, several computer-basedtechniques can be used to makepredictions. Amongthe more prominent techniques are: ¯ ¯ ¯ ¯

statistical linear discriminants[5], neural nets[6], decision rules or trees[7][8], and nearest neighbor methods[9]. Both neural nets and linear diseriminants makepredictions based on weighted functions.The linear discriminant uses a simple additive scoring function, for example If a*X + b*Y > 3, choose Class 1. (1) The neural net, on the other hand, can model morecomplexdecision functions, typically

Page 456

AAAI-94 Workshop on Knowledge Discovery in Databases

KDD-94

non-linear functions. Decisiontrees and decisionrules posesolutions in the formofmaeor false conditions.Forexample, If A>10ANDB nl where X_Featurei is a performancefeature based on the numberof time units during which an event of a type i occurs and r~ is an integer. Wewill refer to a set of five conditions, of the type described abovethat were generated, as RulesetO.Anothercondition of a different type wasalso generated. It was of the form ¯ Y_Feature, > m, & Y_Featureb ¯ mb & Y_Featurec ¯ mc WhereY_Featureis a performancefeature that is not based on the count of time units and m is an integer. This condition is weakerthan the other conditions. Wewill refer to the rule set that includes all the conditions in RulesetOand this last condition as Rulesetl. While the predictive value of RulesetOis better than Rulesetl, Rulesetl covers a larger portion of the chronic problem. This is illustrated in Figures 2 and 3. Figure 2 plots the performanceof these two rule sets over the course of several time periods. The rule set was induced during a time period whenthe prevalence of chronic problemswas somewhat lower than during other time periods. Thus any solution induced for that time period necessarily wouldbe highly predictive to overcomethe odds of the larger class. Figure 3 plots the per-

KDD-94

AAAI-94 Workshop on Knowledge Discovery in Databases

Page 459

centage of each time period’s chronic problemscovered by the rule sets. Figure 4 plots the changein predictive value versus the numberof time units with faults for a representative time period. Specifically,, as the definition of chronic problemsthat are to be detected is mademoreselective, theperformanceimprovesin that the numberof false alarms is reduced. At the sametime, of course, coveragedecreases, meaningthat the numberof transient problems, that are no longer classified as chronic but havethe potential to result in performancedegradation, increases. Multiple Circuit Detection Using the same measurements, we consider the detection of multiple circuit problems. Whethera problemon a circuit is a multiple circuit problemor not is determinedby meansof data obtained from SCOUT. For this application, we trained on data from one time period and tested on data from another time period. The classes are close to equal in size with a 45%- 55%split. Two conditions emergedas particular strong predictors of multiple circuit involvement.Theywere of the type: ¯ X_Featurei >ni Althoughthey cover a relatively small percentage of all circuits with problems,wheneither of these conditions occur the likelihood that multiple circuits are involvedis estimated at 93%. Wehave identified certain conditions that are common to both chronic problemdetection and multiple circuit detection. Thus, we have comeup with a rule that predicts whena problemwith a circuit is likely to be chronicas well as affect multiplecircuits. CONCLUSIONS If all problemsin a communicationnetwork were either transient or quickly repaired, it wouldnot be necessary to detect chronic problems. However,chronic problemsdo occur. Identifying patterns of these problemsshould be helpful in characterizing problemsthat are not detected and re/)aired quickly. In our.analysis, wefound that the numberof time units over which events occurred wascritical in determiningthe likelihood that the problemwouldcontinue in the future. Theserules suggest a form of momentum or inertia for chronic fault problems. There are a number of rationales for the validity of this formof analysis: ¯ Not all faults have this momentum to the samedegree. Wehave identified those measurementsthat are predictive along with the correspondingthresholds, that is, the numberof time units of faults during a windowfor whichthey are predictive. ¯ Of particular importance,any circuit that exhibits this behaviorwill likely continue this behavior.Thus, if the goal is to maximize reliability, circuits exhibiting these characteristics shouldbe given priority in diagnosis and repair. Themostimmediateapplication of these results is their use in reordering trouble tickets. Beyond that, though, we have provided a methodologyto proactively maintain and monitor the performanceof a network. The determination of trends can demonstrate whether progress is madein reducing chronic problems or whether chronic problems are increasing in the network. The best results for network performanceoccur whenno patterns emergeor whenthey cover a smaller percentage of the problems. Clearly, weare limited by the types of recorded measurements.If the causes of circuit problems were eventually determinedandrecorded, it might be possible to explore hypotheses directly related to the cause and repair of a problem.Suchinformationis not currently available, and because of the complexity of a network with transient problems, such records maynever be fully

Page 460

AAAI-94 Workshop on Knowledge Discovery in Databases

KDD-94

available. The second issue that wehave addressed is the prediction of multiple circuit involvement. Wefound that certain types of faults are goodindicators of multiple circuit problems. Resolving multiple circuit problemsis particularly useful in reducing the overall numberof problemsin the network. Wehave considered a highly complex communicationsnetwork, and analyzed its behavior over time. To facilitate proaetive maintenance, wehave developed measurementsthat are sampled for the completenetworkduring regular time periods. Whileour results are boundedby the predictive capabilities of these measurements,this form of analysis did producereasonable predictors. Theanalysis’ involves intensive computerprocessing of very large volumesof data. In both objectivity and pattern matchingcapability, such efforts are beyondthe capabilities of humanprocessing and experience. REFERENCES [1] P. A. Corn, R. Dube, A. F. McMichaeland J. L. Tsay, "An AutonomousDistributed Expert System For Switched Network Maintenance", Proc. of IEEE GLOBECOM ’88, pages 1530-1537, 1988. [2] R. Sasisekharan, Y-K. Hsu, and D.Simen, "SCOUT:An Approach To Automate Diagnoses Of Faults In Large Scale Networks", to appear in Proc. oflEEE GLOBECOM’93, Houston, 1993. [3] S. Weiss and N. Indurkhya, Reduced Complexity Rule Induction, IJCAI-91, pages 678-684, Sydney, 1991.

Proceedings of

[4] S. Weiss and C. Kulikowski, ComputerSystems That Learn: Classification And Prediction Methods FromStatistics, Neural Nets, Machine Learning, And Expert Systems, Morgan Kaufmann,1991. [-5] M. James, Classification Algorithms, Wiley, 1985 [6] J. McClellandand D. Rumelhart, Explorations in Parallel Distributed Processing, MIT Press, 1988 [7] L. Breiman, J. Friedman,R. Olshen and C. Stone, Classification and Regression Trees, Wadsworth, 1984. [8] J. Quinlan, "Induction of Decision Trees", MachineLearning, vol. 1, pages 81-106, 1986. [9] D. Ahaand D. Kibler, "Noise Tolerant Instance-Based Learning Algorithms", Proc. of IJCAI-89, pages 794-799, Detroit, 1989.

KDD-94

AAA1-94 Workshop on Knowledge Discovery in Databases

Page 461

IO0 ,,,

RuJumO

8060-

Predictive Performance

40200

I

I

I

I

Time

Figure2:. Predictiveperformance of rule sets. 100 80Percentage 60of cases Covered 40--

"’’’’°’°’’’’’°’’’’’

200

I

I

I

I

Tim~

Figure3: Percentageof Chronicproblemscoveredby each rule set. 100, 80Pn~dic~ve Performance60and Percentage40Coverage

¯ ..

’’’-........,

,

"....

covemge ’’°’’’°°..

200

° "’°-°

.° °..

l

l

l

I

I

I

I

l

"°..

l

Number of time units with faults Figure4: Performance of Ruleset0.

Page 462

AAA/-94 Workshop on Knowledge Discovery

in Databases

KDD-94