of the management by delegation paradigm to support exible and ef- fective evaluation of health functions and linear threshold decisions at devices. Research ...
Evaluating Management Decisions via Delegation German Goldszmidt and Yechiam Yemini Distributed Computing and Communications Lab., Computer Science Department, Columbia University, New York, NY 10027, USA (Published on Proceedings of IFIP International Symposium
on Network Management, April 1993) Abstract
A central problem of network management is compressing vast amounts of real-time operational data to accomplish management decisions. A health function provides such ecient compression by combining managed data linearly into a single index of network state. This paper describes a computational theory of network behaviors, their observations, and threshold-decisions via health functions. The model is used to examine feasibility of various methods for computing health functions. Health functions cannot be included as part of a static MIB design, as they may vary from site to site and over time. Nor can they be usefully computed at centralized management platforms, since this can result in excessive polling rates, lead to errors due to perturbation introduced by polling, and miss the very goal of compressing data maximally at its source. Instead, we propose the use of the management by delegation paradigm to support exible and effective evaluation of health functions and linear threshold decisions at devices.
Research supported by NSF contract # NCR-91-06127.
1
1 Introduction A network management system needs to support eective management decisions based on vast amounts of real-time operational data. One may view such decision processes as methods to compress this vast management information into a simpler decision information. Often, much of the compression needed to evaluate management decision is accomplished through manual processes. Most network management systems are passive and oer little more than interfaces to raw or partly aggregated and/or correlated data in MIBs [MKB91]. Developing eective technologies to support compression of management information is a central problem of network management. Given the increasing plethora of manageable resources, the explosion of standard and vendor-speci c private MIBs and managed information, and the increasing scale of networks, this task is extremely dicult. One method to compress operational data is to compute functions of managed information, reducing a large number of observed operational variables to a single indicator of the network state. This is not unlike the use of dierent indexes to re ect the state of the securities market and the economy. Such indexing typically utilizes linear aggregation of a large number of variables, each providing a dierent microscopic measure of state. We call such a linear weighted function of managed variables a health function. In the context of SNMP [CFSD90], observations of operational variables are typically accomplished via counters and gauges. A counter represents a cumulative (integral) of an operational variable. Typically, however, only the change in the counter versus its value provides useful indication of the network state. For example, the MIB-II [MR91] counter representing ifInOctets accounts for the total number of bytes received by an interface since device initialization. Only the rate at which this counter changes contributes an indication of the network state. Health functions will, typically, utilize a linear combination of such rates at which status indicators vary. It is often useful to combine MIB variables to more useful status indicators. For example, the utilization of an interface at time t=sysUpTime can be de ned as U(t) = [(ifInOctets + ifOutOctects) 8] = (ifSpeed SysUptime 100) where IfInOctets (IfOutOctets) gives the total number of bytes received (sent). This measure provides an average sense of utilization over a time 2
window since boot-time. A useful indication of the instantaneous network state is provided by the derivative u(t) = U 0(t), as per the discussion above. This derivative may be approximated by frequent sampling of the respective managed variables and computations of the changes in U . Similarly to utilization, one can establish measures of instantaneous error rates to capture additional network micro-state indication. For example, the percentage of input errors to packets delivered, can be evaluated as E(t) = ifInErrors = (ifInUcastPackets + ifInNUcastPkts): Again, only the derivative e(t) = E 0(t) is of interest. A health function can be used to linearly aggregate these micro-measures of local network state. For example, the state of the network seen by a hub, including multiple interfaces, can be indexed by a health function H(~e; ~u) = ~ e + B~ ~ u where ~e and ~u represent the evaluation of e and u at all managed A~ entities and A~ and B~ are weight vectors. This health function can provide a useful aggregate measure of the network state as viewed by the hub. By compressing information at the hub, the tasks of management centers and operators could be signi cantly simpli ed and management could be scalable over a wide spectrum of network sizes and complexity. Contrast this with management which requires continuous polling of vast MIB information into centers, and its manual interpretation. Unfortunately, current network management paradigms do not support the exibility and decentralization required to compute health functions effectively. A health function could not be usefully incorporated as part of a static MIB design. The speci c function used and its parameters may vary among installations, device con gurations and even time of the day. Similarly, a health function could not be usefully incorporated as part of an OSI managed object. While the OSI permits encapsulation of functions within managed objects and their remote invocation by managing entities, these functions must be statically bound to a managed object at its design time. Again, designers of managed objects may not usefully provide con guration/installation/site/time-independent health functions. Nor can a health function be usefully computed by managers at centralized management platforms. First, the rates of polling required to aggregate the variables used may far exceed platform processing capability. In the example above, suppose a device is polled every 0.1 seconds. Suppose too, 3
that the total number of interfaces aggregated by health functions is n (say n=200). The aggregated polling rate is then 10n (e.g., 2000) SNMP requests per-second. Second, the very goal of compression is to reduce data volume at the source. Third, polling through the network introduces random perturbations in approximating temporal derivatives of managed variables, leading to errors and potential hazards in decisions. To resolve the diculty, this paper proposes the use of the management by delegation paradigm [YGY91] to compute health functions. Delegating health functions to distributed agents enables compression of data at the source. It permits exible changes in health functions to re ect speci c network behaviors at dierent sites and times. It allows the computations of health functions to accomplish great precision through direct access of minimal delay to observations of operational values. It improves scalability of management by reducing the rates of polling needed and restricting it mostly to times when problems are identi ed via aggregated health measures. It permits local evaluation of health functions and decisions by device agents, when management platforms have diculties in accessing devices (during critical stress times). A prototype implementation of a health function application over a delegation platform, MAD [GY91], is described. The rest of this paper is organized as follows. In Section 2 we introduce a notation to describe the behavior of managed objects, their observations by manager applications, and management decisions. Section 3 describes problems related to derivation of MIBs, Section 4 provides an example of using observation operators for health functions, Section 5 describes a prototype health application, and we conclude the paper in Section 6.
2 Behaviors and observations Let x denote a managed entity (variable), tx(n) denote the time at which the n-th change in the state of x occurs through a sample computation/communication involving x, and x(n) denote the value of x after the n-th change. A sample behavior of x is a sequence X = f[tx(n); x(n)]gn. To associate behavior values with a given time, de ne nx(t) = maxfnjtx(n) tg as the most recent event occurrence prior to time t. For example, x may denote an attribute of a PDU (Protocol Data Unit), received at an interface, and n counts PDU arrival events. In the case of IP 4
frames, for instance, XIP (n) can be the number of data octets in the n-th frame, while nx( ) is the number of IP frames arriving by time . Observations of sample behaviors compute some function over the history of the behavior. For example, the total number of data octets arriving via IP integrates the sample behavior over XIP up to a given point in time. Let us denote the initial history of a sample behavior X up to time t as X t = f[tx(n); x(n)]gt n t. An observation of a managed variable is a computable functional F [X t; t] = y, where y is the value observed. An observation process is given by a mapping F : f[tx(n); x(n)]g 7! f[ty(n); y(n)]g where y(n) = F [X t n ; ty (n)]. Sample behaviors are P often observed via counters. A counter of X may t be de ned as C [X ; t] = jn x(j ). The MIB-II variable ipInUnknownProtos, for instance, counts (modulo 2 ) \the number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol" [MR91]. Thus, if u(x(n)) is de ned as 1 when the protocol is unsupported or unknown, and 0 otherwise, then x( )
y( )
x(t)
32
ipInUnknownProtos
X
u(x(n))
Alternatively, de ne Y = C X recursively via
ty (n) = tx(n); y(n + 1) = y(n) + x(n + 1): The rst part means that the counter is updated (observed) at times when x changes, and the second part de nes the counter values recursively. In a similar manner, derivatives of behaviors may be de ned recursively. As an example, let us de ne a moving average observation over a window (number of events) of length k:
Y = E k X is de ned by ty (n) = tx(n); y(n) =
X
n?k < h (x ; x ) e x < e H = > h (x ; x ) e x < e >: h (x ; x ) e x Error rates lower than e are considered insigni cant. An error rate in the ranges de ned by e and e , for example, is considered signi cant if the utilization rate is lower than h (x ; x ). The domain where faults are indicated is bounded by 3 linear functions as depicted in gure 1. The fault indication region is de ned by the intersection of the half-planes decided by each of these health functions. In other words, if h (h ; h ; h ), the fault indication domain is h = f~xjh(~x) ~0g. Sometimes, management actions need to be invoked only when unhealthy behavior is sustained for a while, or when it is repeated intermittently. For example, when operational mechanisms provide temporary relief from sustained problems, intermittent problem indications may arise. To avoid spurious alerts, threshold excess must be sustained over a suciently long time window. For example, a hysteresis mechanism should be implemented to limit the generation of alarms [Wal91]. If the observed behavior uctuates, an alert should not be generated. These problem indicators may be captured by appropriate observation operators applied to the output of the H observation. Sustained problems, such as unhealthy behavior for a period of duration , may be detected by: Z = P Y where tz (n) = ty (n); z(n) = 1 if y(j ) = 1 for ny (ty (n)?) j n else 0: Applying then Z = PHX will provide observations of sustained unhealthy behaviors. Intermittent health problems may similarly be detected. +
1
2
1
1
1
1
2
1
1
2
2
1
2
2
1
3
3
1
2
3
1
1
1
2
1
1
2
1
+
11
2
3
Standard network management approaches require a priori knowledge of what algorithms are mapped into statically de ned objects. For example, the OSI Workload Monitoring Function [ISO91] speci es metric objects for determining resource performance and utilization. The parameters that de ne what constitutes a healthy network depend on the particular installation con guration, usage, and administrative policies. These parameters vary among dierent networks, and during dierent times within the same network. Therefore, health functions can not be statically de ned, but should be dynamically bound to agents when needed. Centralized, platform-based management, is unsuitable to compute health functions due to its performance limitations and the inaccuracies introduced by polling. Thus, the evaluation of health functions must be distributed. Consequently, we propose the use of the management by delegation paradigm [YGY91] to compute health functions. The application of this paradigm for the de nition and evaluation of health functions is described in the following section.
5 Health of a network Delegating health functions to distributed agents enables direct observation of network behaviors at sucient precision. Health functions may be dynamically changed to re ect varying behavior patterns at dierent times. By maintaining health indicators locally, vast amounts of real-time data can be signi cantly compressed. To study the application of management by delegation to evaluate management decisions, a simple health application for an Ethernet has been prototyped [AG92]. Figure 2 depicts the relationships between the components of the application. It consists of manager processes, a health process, and a collection of observers. Manager processes can dynamically recon gure the distributed application. They receive evaluation reports and present them on a graphical user interface to the operator. The observers compute observations such as operational state, interface utilization, rate of collisions and interface errors. They also perform corrective actions. The health process receives reports from the observers and evaluates a higher level abstraction (H) of the state of the network. A manager process can specify (and later modify) a list of observers with 12
Delegation
SNMP
Health
SNMP agents
Protocol Manager
SNMP
SNMP
OS
SNMP agent
SNMP
Hosts
Delegation Protocol Manager Observers
Delegation Protocol
Mad Agent
Mad Agent
Figure 2: The Components of the Health application which health will communicate, and the relative weights of their evaluations in the overall score or index. The generic health object performs a scalar product using a vector of weights w~ and a vector of observation functions fi(~x). Fine tuning w~ and the fi is an interactive process that takes advantage of the dynamic recon guration capability of MAD [GY91]. Health may report its evaluated index to managers, by answering an explicit request, via setting a private MIB variable, or as event reports. Delegation enables managers to replace the code of any object, to instantiate and kill processes instances, and to con gure their communication. Diagnostic procedures and corrective actions can be dynamically delegated to be executed when necessary, without manager intervention. For example, if the level of an Ethernet utilization becomes too high, an observer process may automatically disconnect (temporarily) the device which is the source of the largest amount of packets from the network. Applying dynamic delegation of health functions resulted in several gains. Management decisions, such as to temporarily disconnect a device, are executed eciently, without the need for manager platform intervention. Real time operational data is eectively compressed at the delegation agent probe. Managers can de ne observation operators tailored to their changing needs. 13
6 Conclusions Network management applications must assist managers in making eective decisions. Health functions provide an ecient method to overcome the volume and complexity of data which characterize large heterogeneous distributed systems. Management decisions can be obtained by de ning families of linear functions via observation operators. The de nition of what constitutes a healthy network can not be standardized or xed for all networks, since it is installation and time dependent. Thus, it is not sucient to provide xed de nitions as part of speci c programs or MIB variables. Current standards de ne observation and collection of MIB data independently of its use. Since the need for MIB data cannot be predicted, many observations are collected and stored but never used. This approach wastes management resources. In contrast, management by delegation supports a dynamic approach to de ning observations. This dynamicity permits applications to con gure observation processes to exibly monitor information of their interests.
References [AG92]
Cristina Aurrecoechea and German Goldszmidt. Evaluating the Health of a Distributed Environment. Unpublished Report, Columbia University, 1992. [CFSD90] Jerey D. Case, Mark S. Fedor, Martin L. Schostall, and James R. Davin. A Simple Network Management Protocol (SNMP). RFC 1157, May 1990. DDN Network Information Center, SRI International. [DAR88] DARPA. Neural Network Study. AFCEA International Press, Fairfax, Virginia, November 1988. [DH73] R. O. Duda and P.E. Hart. Pattern Classi cation And Scene Analysis. John-Wiley & Sons, New York, 1973. [GY91] German Goldszmidt and Yechiam Yemini. The Design of a Management Delegation Engine. In Proceedings of the IFIP/IEEE 14
International Workshop on Distributed Systems: Operations and Management, Santa Barbara, CA, October 1991.
[ISO91]
International Standards Organization - ISO. Information Processing - Open System Interconnection - Systems Management - Part 11: Workload Monitoring Function. Sydney, Australia, December 1991. [LF93] Allan Leinwand and Karen Fang. Network Management A Practical Perspective. Addison-Wesley, 1993. [MKB91] B.N. Meandzija, K.W. Kappel, and P.J. Brusil. Integrated Network Management and The International Symposia. In The Second International Symposium on Integrated Network Management, Washington, DC, April 1991. [MR91] K. McCloghrie and M. Rose. Management Information Base for Network Management of TCP/IP-based internets: MIB-II. RFC 1213, March 1991. [Ros91] Marshall T. Rose. The Simple Book, An introduction to Management of TCP/IP-based Internets. Prentice Hall, 1991. [Wal91] S. Waldbusser. Remote Network Monitoring Management Information Base. RFC 1271, November 1991. [YGY91] Yechiam Yemini, German Goldszmidt, and Shaula Yemini. Network Management by Delegation. In The Second International Symposium on Integrated Network Management, Washington, DC, April 1991.
15