An Application of Neural Network and Rule-Based System ... - CiteSeerX

2 downloads 107390 Views 200KB Size Report
module and a rule-based system for monitoring and diagnosing problems occur at the application level. The domain name system (DNS) was selected as a ...
An Application of Neural Network and Rule-Based System for Network Management: Application Level Problems Nittida Nuansri, Tharam S. Dillon*, Samar Singh* Department of Computer Science and Computer Engineering La Trobe University, Melbourne, Victoria, 3083 * ACRI (Applied Computing Research Institute) E-mail : {noi,tharam,samar}@cs.latrobe.edu.au Abstract The more complex a network becomes, the more reliable and intelligent a network management system must be to consistently monitor the network and detect abnormal situations in a timely manner as they occur. Expert system techniques have been widely accepted to create network management systems. Despite the fact that there are a number of network management systems, most of them deal only with problems at the lower layers of the network hierarchy (the data link, or the network layer). The nature of problems at the application level significantly differs from of those that occur at the lower levels. Lower layer problems are well-understood while problems at the application level are complex, application dependent, and distinct from one another. Consequently, a network management system, in particular a fault management system, used at this level should be able to cope with these difficulties and dependencies. We propose a hybrid system which consists of neural network module and a rule-based system for monitoring and diagnosing problems occur at the application level. The domain name system (DNS) was selected as a testbed application for the prototype system.

1.

Expert systems and network management

The more complex a network becomes, the more reliable and intelligent a network management system must be to consistently monitor the network and detect abnormal situations in a timely manner as they occur. Expert system techniques have been widely accepted, and applied to create intelligent network management systems. Currently, there are many network management systems available, most of which were implemented by using two Artificial Intelligent (AI) techniques: expert systems and neural networks.

Expert system techniques, mostly, knowledge-based and rule-based, are probably the very first AI techniques that were used to create an automatic, intelligent network management system. They have been widely accepted and used to implement network management system for almost a decade [22; 3; 19; 4; 10; 5]. These systems are similar in that they consist of a knowledge base, a rule base, and a control procedure. A typical knowledge base for a network application contains a representation of the network characteristics, including topological and state information. The knowledge base is mostly built using the knowledge extracted from human experts and the relevant network information obtained from the network itself. The rule base represents the operations to be performed when the network is in an undesirable state. The network problems might be obtained from user complaints or from monitoring systems that can detect abnormal network status. If the network enters an undesirable state, the control procedure selects those rules that are applicable to the current situation. A rule can test the network, query a database, or invoke another expert system, etc. Several variations of techniques of rule-based reasoning were used in the implementations. Although these approaches are widely used, they fit well only in a domain where problems have a welldefined model or structure. In the network management area, especially at the application level, it is hard to model a significant part of the set of problems that may occur. Some problems may have never occurred before. In addition, in some cases, not all of the problems are yet solved. This may be because of the difficulty of modelling the reasoning relating to a collection of knowledge, or because of the structure of the problems to be solved. It is difficult to apply only expert system techniques to these complex domains. This leads to an alternative technique in which neural networks are applied.

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Neural network techniques have recently attracted attention based on their ability to learn complex, nonlinear functions. They have recently been, used in some network management systems. However, most of these works are focussed on similar problems, e.g. routing and traffic management [20; 7; 8], or error correction at the digital communication level [18]. 2. Remaining network management problems Despite the fact that there are a number of network management systems, most of them deal only with problems at the lower layers of the network hierarchy (the data link, or the network layer). Thus problems occurring at these layers can be easily solved while those at the upper layer, in particular the application layer, are relatively difficult to solve. The nature of problems at the application level significantly differs from of those that occur at the lower levels. Lower layer problems are wellunderstood while problems at the application layer are complex, application dependent, and distinct from one another. In addition, the behaviour of applications is sometimes unpredictable and might depend on other events, hidden, or unknown at the time. Some applications can be considered as fundamental applications which are used by other applications. Thus problems that occur in these applications might be able to induce other problems in the applications using them. These types of dependencies have to be taken into account when considering the problem solving mechanisms. To solve most of the upper layer problems, the original problems have to be traced. This means, each application which causes a problem or has a tendency to cause a problem has to be investigated so that a problem solving method can be determined. We are interested in the development of problem solving techniques that will allow us to accurately diagnose network application problems. To accomplish this goal, at least one application is required as a testbed application for the research. Initially, the domain name system (DNS) has been selected as it is, currently, probably the only tool that is generally required by almost all network applications; for example, the electronic mail system, file transfer (FTP), and information services like the Wide Area Information Service (WAIS), archie, gopher, etc. These applications, nowadays, rely on DNS services to translate host names into IP addresses and vice versa so that they can establish a network connection in order to carry out their tasks.

3.

Domain name system and its problems

The DNS is a distributed database. It provides a mechanism for naming resources in such a way that the names are usable in different hosts, networks, and protocol families. The DNS consists of three major components [15]: a domain name space and resource records; name servers; and resolvers. The domain name space and resource records By design, the DNS internal name space is a variabledepth, inverted tree structure. The domain name space is the specification for this tree. Node names (labels) are variable-length strings of 0 to 63 octets [16]. A zero length label is reserved for the root which is written as a dot ‘‘.’’ character in text. Each node of the tree has an associated label and represents part of the domain name system, called a domain. A domain is called a subdomain if it is contained within another domain. This is similar to the directory and subdirectory of a UNIX file system tree. Resource records are data associated with the names in the domain name space. For each domain, there is a set of resource records which contains information for that domain. This information is distributed over the network and is used by name servers to provide services to their clients. A client is a general network application or a user program which requires names and IP addresses resolved. Name servers Name servers are the repositories of information that make up the domain database. Each name server has complete information about some part of the domain name space for which it is responsible. This part of the information is called a zone and the name server has authority for that zone. The delicate difference between a domain and a zone is that a zone contains the domain names and data that a domain contains, except for domain names and data that are delegated elsewhere (see figure 1). Each zone, is controlled by a specific organisation which is responsible for distributing current copies of the zones to multiple name servers. This makes the zones available to clients throughout the Internet. Zone transfers are typically initiated by changes to the data in the zone. Resolvers A resolver is a program, typically a system routine, that interfaces between name servers and user programs, or its clients. It extracts information from name servers in response to a client request. To evoke a response from a name server process, a resolver sends a request

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

’’

..........

domain

zone

Figure 1: Domain vs. zone message, called a query, containing a domain name or IP address, and a query code. When queried, the name server process might respond by answering the question directly, referring the client to another set of name servers, or signalling an error condition. The method implemented and protocol used to carry out the Internet queries, including zone transfer processes, are carried either in User Datagram Protocol (UDP) datagrams or over a TCP connection. More information, e.g. guidelines on use and implementation, syntax, specification, of each component can be found in [15] and [16]. Some interesting historical information of the domain name system and its development can be found in [17]. The implementation of the DNS name server software used on most UNIX systems is the Berkeley Internet Name Domain (BIND) software. Under BIND, the name server is a process called named. Hereafter it will be interchangeably used with a name server process. This process is running all the time on a host delegated as a name server provider. Some hosts do not have a named process running, but they can access information of domains from the domain name space by sending queries to a nearby name server. Like the Internet, the DNS is probably also a victim of itself. Its popularity as well as some of its original designed protocols [13; 14] led to some problems of the overall network performance [17]. However, most of these problems were solved for the time being either by the revising of the DNS protocols [15; 16] or the implemented software. Nevertheless, there are still a number of problems, mainly operation problems, caused by human errors, that can occur in the DNS. These errors not only affect the DNS itself but also applications using its services. The DNS is, normally, not directly used by general users. Rather, it is used by other network applications. Thus errors occurring in the DNS directly affect the applications. For instance, the ftp program uses the DNS service to translate host names into the

associated IP addresses. When there is some DNS problem on a domain especially when the name server cannot provide the required address to the ftp request in a certain interval of time, the end user might get a response, apparently from the ftp command, such as "connection timed out" which does not necessary mean that this timed out message is from the ftp command, rather it can be the result from address querying of the domain name system made by the ftp. To avoid such correlated problems, the DNS must be reliable at all times. Ideally it should be error free. However, this is almost impossible in the real world, and what we can do is minimise the number of errors that occur in the DNS so that other applications will not suffer because of them. Currently, there are many DNS problems, some of which are not too difficult to recognise and correct, while others are harder. Some problems occur because of a particular error whereas others are caused by a combination of errors. The latter makes it difficult to trace and solve the problem and requires intensive knowledge, especially from experienced domain administrators. 3.1. DNS error report mechanism and error format The named always reports errors every time they are detected. These errors are reported in a text format and can be logged in a log file of a system running named. However, not all of the error messages are logged. The system process, such as syslog of the UNIX operating system, filters the repeated messages out and logs only new occurrences of messages in a set time interval. These error messages are usually used by network managers or domain name owners to monitor the domain name system. Although these error messages are important, in practice, they are usually ignored. This might happen because of the DNS protocol and its architecture that tries its best to serve its client. Normally, when an error message is reported by the name server process, it means a problem has occurred somewhere, or something is not correct or ‘‘broken’’ which should be fixed. However, this ‘‘broken’’ effect is not as serious as a physical link problem. In addition, the DNS always tries to provide an answer to all queries as much as possible by means of the ‘‘distributed’’ architecture. For example, when a particular name server cannot provide an answer to a client request due to a malfunction, the name server process will try to query other name servers. The size of the error log file is another issue that discourages human attention. Along with real error messages reported and logged in the log file, there are

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

several non-error messages. Thus the log file can be very big, if not truncated, and can consume the whole available disk space of the file system in which it resides. The size of the file and its contents are generally beyond the ability of human network managers or zone managers to scan for diagnosis of the DNS problems. Example error messages extracted from the log file of a machine running a name server are shown in figure 2.

Oct 1 03:49:35 munnari named[100]: Lame delegation to ’PT’ from [128.103.1.1] ( server for ’pt’?) on query on name ’asterix.inescn.pt’ Oct 1 05:21:06 munnari named[8849]: Err/TO getting serial# for "cp.COM.AU" Oct 1 05:21:21 munnari named[8849]: Zone "205.250.128. IN-ADDR.ARPA" (class 1) SOA serial# (1994092101) rcvd from [128.250.209.2] is < ours (1994130501)

Figure 2: Messages from named Generally, each message (each line) consists of, the machine name that logged this error message, which is also the machine running the name server; the cause of the error, e.g. "Lame delegation to.."; and relevant diagnostic information is also provided. By looking at these error messages, one thing that we can tell is that it is not easy to determine the cause of the problems, unless we have good knowledge and some level of experience in this domain. Unfortunately, not everyone meets these criteria nor are many people able to satisfy them in a short time. However, as we know the format and the meaning of each field of the logged message, these difficulties can be overcome. With adequate knowledge to interpret and match the messages with the right causes, we can apply some existing tools to solve the problem. The use of these error messages is not straight forward as some problems produce more than one error message type while other problems are reported by only one error message. This leads to the difficulty in analysing and diagnosing those problems. In order to make use of these error messages, some methodologies to analyse and find the cause of each error message are required. These methods are presented in section 5.1. 4.

Related work

There are several tools currently available to aid in DNS problem diagnosis. These tools: doc, ddt, dnswalk, addhost/rmhost, etc, are similar in that they are used to verify that a domain is configured or working properly although their functionality could vary slightly. For example doc [12] and ddt [6] perform as domain debugging tools and provide functions to verify that a

domain is configured and functioning correctly. They make no attempt to validate the data inside the domain, only its structure. On the other hand, dnswalk [1], is meant to be a DNS database debugger, checks the internal consistency and correctness of an individual zone database. Addhost/rmhost [11] provides a convenient method to maintain a zone database file but it does not perform any data validation. [2] presented a tool to check for lame delegation problems and notify hostmasters of the original domain that caused the problem. The tool is used to detect only lame delegation problems. This function is provided in dnswalk as well. The above tools provide a method to scan for errors in zone database files or zone configuration which can cause DNS problems. Their common purpose is to minimise DNS errors which can occur by configuration mistakes. However there is no attempt to analyse or diagnose those real DNS problems that have actually occurred. Currently, there exists a project called ‘‘the checker project’’ [9], which is more focussed towards network problems caused by the DNS. Its main purpose is to quantify the name server traffic, in particular that which is categorised as unnecessary name server traffic, so that the number of packets traversing the network can be reduced. So far, there is no real system that provides all the necessary features which are required to solve the current DNS problems. This system, ideally should be able to study and analyse the categories of the problems that have occurred in the DNS and can diagnose and report the causes of those problems or any mistakes in the zone information. It should also be able to suggest some useful or possible methods to get rid of the problems. This research investigates this kind of problem and describes a prototype system which provides the required features. This system is described in the following section. 5. A hybrid system for diagnosing DNS problems The goal of this work is to build a system that can detect, analyse, and notify when problems occur. The system must have a capability to automatically report the problems to relevant people without human assistance. This system has to learn some particular knowledge and remember this knowledge for later use. We propose a hybrid system which is a combination of a neural network system, which can learn from data in the past, and a rule-based expert system which makes use of an output from the learning process. A tool called BRAINNE [21] (Building Representations for AI using Neural Networks) and NEXPERT are used in the system.

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

BRAINNE is an automated knowledge acquisition tool based on neural networks learning techniques. Unlike typical neural network applications that provide weight matrices after the learning process, BRAINNE provides symbolic knowledge in the format of IF-THEN rules which can be used to create a knowledge base. Thus, in our system, BRAINNE is used as an automated knowledge acquisition tool that allows us to extract essential knowledge which are DNS faults (errors) and their causes while the NEXPERT system is later used as a rule-based expert system for the analysis and diagnosis part. The system can then be used to monitor the domain name system in a real situation by scanning error messages from the named log files. While doing this, the system will check the log files from time to time in order to obtain new rules, that have not occurred before. These new rules, if any, will be added to the rule-based system so that it is always up to date. Thus the rule-based system will cover as many potential errors as possible. 5.1. Knowledge acquiring method The learning module of the system is required to be able to learn from the errors in the past and the desired result from the learning process is a relationship between an error or a group of errors and their correspondent causes. The supervised learning module is used. Thus the input pattern consists of two components, error messages and the known causes (faults). The first component can be easily obtained from a named log file, while the second component is not easy to obtain. This is because of the nature of the name server software which does not explicitly state the cause of any problem reported. Consequently, we have to create some mechanisms by which we can obtain these faults and match them with a given error message or a group of messages. To do this, we propose two methods: extracting knowledge from experience and human experts, and from forcing faults on a name server. 5.1.1. Extracting knowledge from experience and experts. This is a very simple method to create an input pattern for our system. In this method we studied errors that happened in the past from several log file, then created pairs of an error or a group of errors with the fault causing them. In doing this we used our own knowledge of DNS and its software. Other documentation was also consulted [15; 16]. Sometimes, for verification, the source code of the relevant software was also used. In addition to this, several human experts in this field, especially administrators of domains, were also consulted.

5.1.2. Fault forcing method. Although the first method is simple and can provide desired information, there is also a flaw in it that the information obtained by this method is not guaranteed to be correct or complete. The alternative method is attempting to create faults in the domain name system and observing the corresponding errors so that we can match an individual fault with an error message or a group of messages reported for each fault type. By using this methodology, the fault forcing process can be repeated if desired so that the obtained information is confirmed. Once the result is confirmed, it is then used as an input pattern for the training process. However, there are some types of faults that are difficult to obtain by this method. Some problems are caused, effected by, or depend on other events that are beyond our control. For instance, there are some error messages reported in a log file that occur because of the corruption of UDP packets used by the DNS. This is not a real DNS problem. Some errors are believed to happen because of some flaws (bugs) in the software being used, especially from some old versions of named. It is not easy to create these kinds of errors using this method. We rather suggest that the software (that is still being used) should be upgraded to other versions where these errors are already fixed. There are also some errors, which while not impossible to create, take quite a large amount of time to be reported, partly because they also depend on some other parameters, of the environment e.g. a system configuration, the underlying network protocols. These type of errors, at this stage, are ignored by this method. However, we can obtain them from the first method if required. 5.2. Problem analysis and diagnostic method In the monitoring and diagnosing process, the system has to diagnose problems corresponding to the errors that were logged. The diagnostic method is based on the knowledge of DNS faults and errors that were obtained from the knowledge acquiring process. From the study of the DNS errors it had been found that DNS problems can be grouped into two main categories which are problems that are caused by errors in configuration of DNS database files, and those that occurred as a result of general network problems. These groups of problems affect DNS to different degrees, and the methods used to deal with them in order to solve the problems are also different. The errors caused by network communications problems are not too difficult to deal with as the problem itself is not too complicated, whereas those caused by configuration errors are more complex. Most of the latter problems propagate around

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

the network which sometimes make it difficult to trace back to the site where the problem originated. The following sections provide a brief discussion about these groups of problems. 5.2.1. Wrong delegation problems. According to the distributed architecture and implementation of the DNS, information for a particular zone always exists at more than one place. When the information is not consistent among the name servers — which happens all the time — diagnosis is difficult as the incorrect information is also propagated around the network along with the correct. Besides this, some detected problems can be transient. This can occur when a zone information has just been changed at one name server while some other participating name servers have not yet updated their data. In this case, this error will vanish when the information has eventually reached all relevant name servers and has been updated to the same meaning. The information inconsistency is the the major problem of the DNS and also leads to several wrong delegation problems. It is found that the majority of the errors reported by named are caused by an incorrect DNS data due to wrong configuration and delegation. Most common messages which dominate a log file are lame delegation, forwarding loop and contains our address. These error messages are reported in a log file as shown in figures 3 to 5. Fig 3: Oct 24 21:51:49 munnari named[2718]: Lame delegation to ’MILLS.EDU’ from [128.32.136.12] (server for ’MILLS.EDU’?) on query on name ’varese.mills.edu’ Fig 4: Oct 25 00:09:05 munnari named[2718]: ns_forw: query(2.239.2.203.in-addr.arpa) contains our address (munnari.oz.au:128.250.1.21) Fig 5: Oct 24 22:39:30 munnari named[2718]: ns_forw: query(129.108.242.131.in-addr.arpa) forwarding loop (dpigw.ind.dpi.qld.gov.au:131.242.51.208)

These three different error messages basically have the same meaning which is ‘‘there is something wrong in the zone delegation information’’, but they are detected in different contexts relative to where the error has actually occurred. For instance, in the error shown in figure 3, a problem was detected and reported at a name server machine called, munnari. It indicates that there is a wrong delegation to a name server called MILLS.EDU from a name server called 128.32.136.12. In fact, this problem is not at the detecting name server, but it is of some other name server. From using only this information, it is not possible to say conclusively at which name server this mistake has occurred. Additional information is required to diagnose this problem. Figure 4, indicates a problem that is related to the name server detecting it. In this case munnari received a

query which was forwarded from some other name servers expecting that munnari can provide an answer to that query, in other words, it is expected to be a name server for the zone to which the query is sent. However, it is not currently acting as a name server for that zone. Figure 5, is also related to the name server detecting this error. The problem occurs when munnari receives a query from somewhere and it cannot provide an authoritative answer to that query. It then looks for an authoritative name server to which it should forward this query. But it turns out to be the name server which sent this query. If this query was forwarded again, it would create an infinite loop as it is known that server forwards this query to munnari, so munnari reports this error and does not send the query to that name server. 5.2.2. Network related problems. This group of problems occur when a secondary name server cannot update zone data of a particular zone. It is normally caused by the inability of a secondary name server to contact the primary name server of that zone for a period of time. Normally a secondary name server has to refresh zone data of each zone every refreshing time interval specified in the SOA record of each zone. This process involves sending a query to the primary name server asking if the zone information has been changed. It expects the answer for this query to come back within a particular time, otherwise it will establish a connection to that primary name server in order to transfer a copy of zone data. Unless the information is updated before its specified expiry time is reached, it will expire and the name server stops giving answers to any query related to that zone. There are several error messages reported by the named process in response to this problem. These are listed as in figures 6 to 11. Fig 6: Oct 24 21:58:46 munnari named[2718]: Err/TO getting serial# for "128.163.IN-ADDR .ARPA" Fig 7: Oct 24 21:58:47 munnari named-xfer[1042]: bad response to SOA query from [163.128.2.8], zone 128.163.IN-ADDR.ARPA: rcode 0, aa 0, ancount 0, aucount 2 Fig 8: Oct 24 23:36:24 munnari named[2718]: zoneref: Masters for secondary zone maths.mu.OZ.AU unreachable Fig 9: Oct 24 23:36:24 munnari named-xfer[3127]: connect(128.250.35.32) failed: Connection timed out Fig 10: Oct 25 02:03:00 munnari named-xfer[7593]: connect(129.214.1.100) failed: Connection refused Fig 11: Oct 25 08:00:09 munnari named[2718]: secondary zone "128.163.IN-ADDR.ARPA" expired

Like the first group of problems these problems can also be transient, especially the problem reported by the error message in figure 6, if it happens because of a time out problem when a secondary name server sends a query

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

to the primary name server. (The error message means either an error in the received packet containing the answer to the query, or a time out before the query has been answered). If the secondary name server can contact the primary name server later, this error message is not reported again and it can be assumed that it was a transient problem. However, when the problem persists, the named process will report other related error messages shown in figures 6 to 11 which may indicate that real problems exist. Ideally, these problems should be solved before the related name server expires its zone data. Once the zone information is expired, the name server will not answer any query related to that zone. Other name servers do not know about this and still query for desired information as usual, though they will never get a desired answer and they have to send the same query to some other name servers for that zone, if there are any. If this repeatedly happens for a long period of time, it will lead to unnecessary network traffic. Apart from the above errors which can lead to the expiration of the zone data, there is another type of problem, although it is not related to a network connection group. This error is reported as shown in figure 12. Fig

12: Oct 25 08:31:23 munnari named[2718]: Zone "cc.monash.EDU.AU" (class 1) SOA serial# (1994102405) rcvd from [130.194.1.99] is < ours (1994106404)

This error message is reported when a serial number of the SOA record from the primary name server is less than the current one which is in its cache. The new serial number in the SOA record is required to be higher than the previous one if there is any change to the zone data [16]. Whenever a name server realises that a value of the new serial number of a particular zone is less than the current one in its cache, it refuses to update the zone information to the be the same as those from the primary name server. If it still detects the same event continuously in the period of time that it has to expire the current data held in its cache, it has to expire the data rather than updating to the new one. 6.

System implementation

The system consists of several main components which are depicted in figure 13: data acquiring process, data transformation functions (B_transform and N_transform), BRAINNE, NEXPERT, interface. The data acquiring process is a process that is used to acquire an input pattern for a learning module, and was described in section 5.1.

data acquiring logfile

data transformation

data transformation

training pattern

NEXPERT

interface

BRAINNE

NEXPERT KB

Learning Process

report

report

Monitoring and Diagnosing Process

Figure 13: Hybrid system 6.1. Data transformation functions Two transformation functions are implemented in order to convert the raw data obtained from the log file into an appropriate format required by the BRAINNE and NEXPERT systems. These functions are B_transform and N_transform. They are similar that they are used to convert raw data from a textual format into an appropriate format required by the BRAINNE and NEXPERT systems respectively. B_transform BRAINNE accepts two types of data, continuous and discrete data. In our domain, input data are in the discrete category. This consists of a set of error messages as input classes and a set of corresponding faults as output classes for the supervised learning method. For a discrete data type, each data class is presented to a learning process by either 0 or 1, where 0 is the absence of data 1 is the presence of data However, the raw data is in a textual format which is not understood by the learning module. B_transform is used to transform the obtained information into the required format. N_transform N_transform is used to present real problems reported by a particular name server to the expert system which monitors and diagnoses DNS problems. However, the messages logged in the log file are not all errors. Some of them are just log events, as shown in figure 14.

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

.. Zone "193.IN-ADDR.ARPA" (class 1) xfer’d and loaded .. approved AXFR from [137.172.248.100].4787 for "oz.au" .. zone transfer of "oz.au" to [137.172.248.100] (pid 18063)

Figure 14: Non-error messages These messages are not of interest and will not be included as an input to NEXPERT. Thus the transform function filters these irrelevant messages out before processing the remaining information. In addition, error messages in the log file are for many domains while their errors were detected at a machine logging the errors. Normally a name server is involved with more than one other name server. The server then logs messages caused by errors at many servers. These appear in one log file in the temporal order. When an error causes multiple messages it is likely that messages related to other errors are interspersed. This makes manual interpretation of the log file difficult. In order to sort the errors, and group them appropriately, host or zone identification is required. Unfortunately named reports errors using any of host names, zone names, or IP addresses. It is possible to translate between host names and IP addresses using the gethostbyname and gethostbyaddr functions. However there is no function to translate between zone names and server IP addresses or names. Unfortunately most errors are reported using zone names. The information required to make this mapping is obtained by parsing the ‘‘named.boot’’ file — a configuration file used by the named process, from the host where the errors are logged. 6.2. BRAINNE and NEXPERT After the learning process, BRAINNE provides a set of rules, of which a part is is shown in figure 15. Rule 1 (covers 123 exs OK, 0 exs NOT_OK) ( (1) forwarding loop == yes ) ==> (0) configuration error Rule 13 (covers 221 exs OK, 0 exs NOT_OK) ( (7) Err/TO getting serial # == yes ) ( (11) connection refused == yes ) ( (8) masters for secondary zone unreachable == yes ) ==> (3) no name server process running

Figure 15: Output from BRAINNE These rules are then used to create a rule base of which figure 16 shows part.

(@RULE= R1 (@LHS= (Yes (forwarding_loop))) (@HYPO= configuration_error) (@RHS= (Execute ("forward_loop"))) ) (@RULE= R13 (@LHS= (Yes (Err/TO_getting_serial_#)) (Yes (connection_refused)) (Yes (masters_for_secondary_zone_unreachable))) (@HYPO= name_server_problem) (@RHS= (Execute ("check_ns"))) )

Figure 16: NEXPERT rule base 6.3. Interface function This function reads error messages provided by the N_transform function and passes them in an appropriate format to NEXPERT so that the relevant function is activated. Input for the function is an error message while the output has an appropriate value that required to activate an atom type corresponding to the error. The function also extracts a machine name or zone name from the error message and passes it to the activated function. This information is used by the activated functions to send a report to the zone administrator. 7.

Result

After the rule base was created from rule sets of the learning process, the diagnostic system then was tested with real data from named log files. During the testing process, several new error messages were found and corresponding new rules were added into the rule base. This happened because some errors rarely occur and they were not found during the learning process, hence they were not included in the early version of the rule base. After the rule base is satisfied, the system is used to diagnose a daily log file, or the named log file. It was found that the rule base covered most of the error types. As a result, the diagnostic system is able to diagnose almost all error messages. Only about 0.5 % of unrecognised errors were found (unknown and rejected by the diagnostic system). They are reported as ‘‘not enough memory, network unreachable, zone removed’’, and ‘‘file table overflow’’. However, three of them (‘‘not enough memory’’, ‘‘network unreachable’’, and ‘‘file table overflow’’) are not DNS problems but they were reported because of problems of the machine running a name server. Although the ‘‘zone removed’’ error message seems to be a normal operational occurrence, surprisingly it was found that it was also caused by a system problem.

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

This happened when the system had been in a unhealthy state for a long time, when it had an I/O error and could not read the list of zones. It then decided that it should serve none, so all zones were removed. However, when a new error is found, a corresponding rule can be easily added into the rule base. This is the advantage of using a rule-based system as part of the diagnostic process. The system does not require a retraining process in order to add new rules. However, the retraining process might be necessary when the system is applied to some applications that are not stable, or are in the developmental and testing phase. Errors or problems found from this type of application seem to vary and change more frequently than from errors which have been used for a long period. Thus it might be better to occasionally perform the learning process again, so that new rules are found. It can be easier to re-create a new rule base than to modify the existing one. Interesting Statistics This section presents some statistical values obtained from running the diagnostic system. Figures 17 and 18 are derived from a log file of a name server called munnari.oz.au† between the period September 19 October 5 1995. In figure 17, error messages are grouped together by the cause of the error. There are three subgroups as configuration error, network related error, and others. The figure lists the type of error, the number and the percentage of occurences. In this figure, same error messages are counted regardless of from which name server or zone they were produced, figure 18 shows the number of zones that were involved in error in each category. Note that the total number of messages that were logged is not the same as the number of error messages. This is because the log file also contains other types of messaegs reported by named, such as warnings or operational report messages. Output and Report An output from each diagnostic function is a report of the cause or the problem, and the original name server or zone that created that problem. It provides some other information that might be useful to correct the problem, including the error message or messages that were reported. Suggestions for the correction of some type of error are also given. Ideally, this output should be reported to persons, normally hostmasters, who control or have authority over the zone that caused a problem. Thus the report should be sent to those people, perhaps by † munnari.oz.au is a major name server being either the primary or a secondary server for hundreds of zones.

Error Message Configuration Errors lame delegation attempted update to auth zone response from unexpected source forwarding loop contains our address wrong serial number cname error database format error unknown type outside zone Total Network Related Errors Err/TO timed out bad SOA masters zone unreachable connection refused zone expired Total Others malformed response not enough memory address reuse Total Number of error messages Number of messages logged

#

%

42709 1751 1705 1418 563 528 51 8 2 2 48737

69.569 2.852 2.777 2.309 0.917 0.860 0.083 0.013 0.003 0.003 79.38

5546 3569 1108 844 38 432 11537

9.034 5.813 1.804 1.374 0.061 0.703 18.792

1026 80 4 1110

1.671 0.130 0.006 1.808

61384 61390

Figure 17: Errors found from a sample log file

Error Type

# of zones

lame delegation Err/TO + masters for secondary zone unreachable Response from unexpected source forwarding loop contains our address attempted update to auth zone Err/TO + connection refused Err/TO + bad SOA wrong serial number zone expired

4071 382 137 106 79 62 11 8 3 3

Figure 18: Number of zones involved in each group of error using e-mail. However, at this stage we decided not to notify them, but have kept a log of the output messages. Co-orporation and understanding among the people involved are required before the reports are sent out. Otherwise the report messages might annoy instead of being a useful piece of information, as pointed out by [2].

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

8.

Summary

7.

Diagnosing problems that occur on an application layer in a complex and dynamic network is a difficult task. Traditional algorithmic problem solving techniques does not adequately solve the current network diagnosis problems nor does each separate AI technique. We make use of some strong points of AI techniques (neural networks and expert systems) and combine them together to develop a hybrid system that can learn from, monitor and diagnose problems that occur in the network application level. To apply such an approach, we investigated the domain name system’s characteristics and the problems that have occurred, then used its errors as an input pattern to the neural network module in order to extract the relationship between a particular error or a group of errors and the causes. The knowledge obtained from this step was used to create a rule-based system which is used to monitor and diagnose DNS problems. Future work will involve investigations to apply this method to other network applications. 9.

Acknowledgement

8.

9.

10.

11. 12.

13.

We would like to acknowledge the permission of the AKT Systems, PO Box 452, Caulfield East, VIC 3145, to use the BRAINNE functions in the AKAT software for this project.

14. 15. 16.

10.

References 17.

1. 2. 3.

4.

5.

6.

D. Barr, “dnswalk - A DNS Database Debugger,” ftp.luth.se::/pub/unix/dns/tools (1994). B. Beecher, “Dealing with Lame Delegation,” LISA VI (October 1992). L. Bernstein and C. M. Yuhas, “Expert Systems in Network Management - The Second Revolution,” IEEE Journal on Selected Areas in Communications, 6, 5, pp. 784 - 787 (June 1988). R. N. Cronk and P. H. Callahan, “Rule-Based Expert Systems for Network Management and Operations: An Introduction,” IEEE Network, pp. 7 - 21 (September 1988). M. Feridun, M. Leib, M. Nodine, and J. Ong, “ANM: Automated Network Management System,” IEEE Network, 2 No. 2, pp. 13 - 19 (March 1988). J. Frazao and A. Romao, “ddt: Domain Debugging Tool,” ftp.luth.se: /pub/unix/dns/tools (1995).

18.

19.

20.

21. 22.

T. Fritsch and W. Mandel, Communication Network Routing using Neural Nets - Numerical Aspects and Alternative Approaches, pp. 752 - 757 (1993). R. Fujii, M. F. Tenorio, and H. Zhu, “Use Of Neural Nets in Channel Routing,” IJCNN’89 International Joint Conference on Neural Networks, 1, pp. I-321 - 325. R. M. Goodman, J. Miller, and P. Smyth, “Real Time Autonomous Expert Systems in Network Management,” Integrated Network Management: Proceedings of the IFIP TC 6/WG 6.6 Symposium on Integrated Network Management, pp. 599 - 624 (May 16-17 1989). J. J. Hannan, “Network Solutions Employing Expert Systems,” IEEE Annual International Phoenix Conference on Computer and Communications, pp. 543 - 547 (1987). J. C. Hardt, “addhost, rmhost,” ftp.luth.se::/pub/unix/dns/tools (1992). S. Hotz, P. Mockapetris, and B. Knowles, “doc Diagnose Unhealthy DNS Domain,” ftp.luth.se::/pub/unix/dns/tools (1995). P. Mockapetris, “Domain Names - Concepts and Facilities,” RFC 882 (November 1983). P. Mockapetris, “Domain Names - Implementation and Specification,” RFC 883 (November 1983). P. Mockapetris, “Domain Names - Concepts and Facilities,” RFC 1034 (November 1987). P. Mockapetris, “Domain Names - Implementation and Specification,” RFC 1035 (November 1987). P. V. Mockapetris and K. J. Dunlap, “Development of the Domain Name System,” SIGCOMNM’88 Symposium Communications Architectures and Protocols, pp. 123-133 (August 1988). A. Ortuno, M. Ortuno, and J. A. Delgado, Neural Networks as Error Correcting Systems in Digital Communications (1992). C. Radcliffe, “An Expert System for Integrated Network Management,” ProceedingsE of the International Conference on Network Management (June 1988). H. E. Rauch and T. Winaske, “Neural Networks for Routing Communication Traffic,” IEEE Control System Magazine, pp. 26 -31 (April, 1988). S. Sestito and T. S. Dillon, Automated Knowledge Acquisition, Prentice Hall (1994). T. M. Smith, “The Network Management Domain,” ICL Technical Journal, pp. 763 - 779 (November 1991).

Proceedings of The Thirtieth Annual Hawwaii International Conference on System Sciences ISBN 0-8186-7862-3/97 $17.00 © 1997 IEEE

1060-3425/97 $10.00 (c) 1997 IEEE

Suggest Documents