Shedding Light on Enterprise Network Failures ... - Semantic Scholar

Shedding Light on Enterprise Network Failures using Spotlight Dipu John, Pawan Prakash, Ramana Rao Kompella

Ranveer Chandra

Purdue University West Lafayette, IN 47907

Microsoft Research Redmond, WA 98052

Abstract—Fault localization in enterprise networks is extremely challenging. A recent approach called Sherlock makes some headway into this problem by using an inference algorithm over a multi-tier probabilistic dependency graph that relates fault symptoms with possible root causes (e.g., routers, servers). A key limitation of Sherlock is its scalability because of the use of complicated inference algorithms based on Bayesian networks. We present a fault localization system called Spotlight that essentially uses two basic ideas. First, it compresses a multi-tier dependency graph into a bipartite graph with direct probabilistic edges between root causes and symptoms. Second, it runs a novel weighted greedy minimum set cover algorithm to provide fast inference. Through extensive simulations with real service dependency graphs and enterprise network topologies reported previously in literature, we show that Spotlight is about 100× faster than Sherlock in typical settings, with comparable accuracy in diagnosis. Index Terms—fault localization; enterprise networks; dependency graphs;

I. I NTRODUCTION Today’s enterprise network services are quite complex. A simple-appearing activity such as an employee trying to access a Web portal often involves complex interactions between several individual components such as a Web server, an SQL server for database queries, a DNS server for resolving different domain names, and so on. Further, these interactions are carried over a network substrate that comprises of several routers, switches, links, and other network elements. Given the large number of components involved and the associated complexity of the interactions among them, today’s network services are often prone to failures and performance degradations. Enterprise IT staff are tasked with managing the performance and minimizing the downtime of these critical services that affect employee productivity. Typically, failures in enterprise networks are detected by monitoring the response times of various services—a service taking too long to respond is an indicator of a possible failure. Alternately, users accessing a particular service may notify the IT administrators in case they observe unacceptable delays in the response times. The IT administrators are then tasked with the hard problem of isolating the root cause of the failure and corresponding repair actions. While some amount of “self-healing” is automatically enabled using specialized controllers that detect failures and switch to redundant servers, there exist several failure modes that are often difficult to detect automatically. In such cases, the administrators resort

to manual techniques to either pore through logs or other debug information that individual components may output in order to isolate the root causes. Such a task is both costly (maintaining large number of staff members incurs severe operating expenses) as well as time consuming—hence the need for automated tools for diagnosing failures in enterprise networks. Given the importance of the problem, there exist several commercial tools (e.g., SMARTS [1], HP OpenView [2], IBM Tivoli [3], Microsoft Operations Manager [4]) for managing enterprise networks. While they help IT administrators to some extent, the fact that these tools focus on monitoring each server and router as a black-box makes it hard to isolate end-to-end failures that may not be triggered as alerts in SNMP. A recent system called Sherlock [5] addresses this problem with the help of service-level dependency graphs (SLDGs) and a Bayesian inference algorithm called Ferret for fault diagnosis. It creates a multi-tier inference graph based on failure reports from individual clients accessing various services, the SLDGs for the services that describes the set of servers it is (probabilistically) dependent on, and the network topology that models the set of routers and links that are used to exchange messages between these servers. While Ferret achieves good accuracy for many failure conditions, it relies on computing a score for a large number of failure assignments and thus becomes unscalable for large graphs. In this paper, we propose a new algorithm called Spotlight that achieves similar accuracy to Ferret, while consuming significantly less amount of time to identify the root causes of failures. Spotlight relies on two basic steps: First, it compresses a multi-tier dependency graph into a bipartite graph with probabilistic relationships between rootcauses (servers, routers, etc.) and symptoms (client reports). Second, based on the set of symptoms observed, Spotlight uses a weighted min-set cover on the bi-partite graph to output a hypothesis set of root causes. The minimum set-cover in the general case is NP-hard, and hence it uses a novel weighted greedy approximation to output the hypothesis set. We evaluate the accuracy and scalability of Spotlight using simulations on realistic enterprise network topologies and real SLDGs described in prior work[6], [5]. Our results indicate that Spotlight is extremely accurate (98% accuracy for one failure) and outputs diagnosis extremely fast (less than 1 second for graphs of size 5,000 nodes). In comparison, Ferret

achieves similar accuracy but takes a significant amount of time (45 minutes for the same 5,000 nodes graph as Spotlight). Even for graphs of size 100,000 nodes, Spotlight outputs diagnosis in as little time as 10 minutes, while Ferret takes at least four orders of magnitude more time for outputting the diagnosis. Thus, Spotlight significantly increases the scalability of Sherlock-like systems that rely on service-level dependency graphs and network topologies for fault localization. The rest of the paper is organized as follows: First, we review the background and related work in Section II. In Section III, we describe the overview of the Spotlight including the set of inputs and how the hypothesis is generated from the inputs. Section IV details the evaluation methodology including the set of graphs and topologies we have used for our simulations along with the evaluation results. II. BACKGROUND

AND RELATED WORK

Modern enterprise services accessed by the end users depend on numerous other services potentially spanning many different servers and network components. Thus, a failure in any of these components, either servers or network elements, may cause degradation in the user performance seriously affecting employee productivity. From client-observed performance degradation, the essential problem of fault localization is to locate the root cause of the problem. In the majority of failures, fault localization represents the most time-taking ordeal for IT staff. Typically, once the location of the problem is narrowed down, downtime can be reduced easily—for instance, by removing the failed component, rebooting the server or replacing the server. While fault localization is a well-studied topic in research, it is typically domain-dependent. Depending on the level of intrusiveness required, there are several different types of fault localization and detection systems proposed in the literature. Sophisticated commercial tools, such as EMC’s SMARTS [1], HP Openview [2], IBM Tivoli [3], MAM [7] provide powerful generic frameworks for correlating alarms from standardized alerts generated by individual servers through SNMP [8]. Techniques such as Magpie [9], FUSE [10], and Pinpoint [11] keep track of requests that flow through the system by instrumenting middleware on end hosts. The components with failed requests are then correlated to diagnose the source of failure. X-Trace is a cross-layer, cross-application framework for tracing the network operations resulting from a particular task [12]. It assumes a common instrumented platform for applications, and hence is not applicable in enterprise networks with numerous different kinds of applications potentially developed by different vendors. The difficulty with many of these systems is that they require special instrumentation in the application itself that makes them less general. In other domains too, specific solutions have been proposed for fault localization. For instance, SCORE [13] and Shrink [14] have been proposed to diagnose cross-layer failures in the context of IP link failures resulting from optical component failures. SCORE represents the dependencies that exist between root causes and symptoms in the form of a bipartite

graph and models the inference problem as finding the minset cover of this graph. Our setting differs from SCORE in two ways: First, we require multi-tier dependency graphs to represent the dependencies that are more natural in enterprise networks. Second, the dependencies in our setting are probabilistic in nature, whereas in SCORE, the dependencies are direct (i.e., the failure of a particular entity will always cause the symptom to be observed. Shrink is also designed to operate with bipartite graphs, but assumes different components have different failure probabilities. Shrink’s inference algorithm falls under a general class of Bayesian inference algorithms that includes [15], [16]. The main problem with these approaches is that they do not scale well for large inference graphs such as those we consider in this paper. Besides, the failure rates of individual components may not be known a priori. In the context of enterprise services, the two state-of-theart systems proposed in literature are NetMedic [17] and Sherlock [5]. NetMedic, the more recent of the two, requires instrumenting and polling statistics on a per-process level and thus, can provide detailed diagnosis capabilities. It is, therefore, more suitable for small enterprise or home networks only. Sherlock, on the other hand, does not require any perprocess level statistics and requires minimal instrumentation at servers. While Sherlock scales better than NetMedic, its use of a Bayesian network based inference algorithm called Ferret makes it hard to scale to large enterprise network services— the kind that we are interested in this paper. Our solution Spotlight operates in a similar setting as Sherlock and follows the general approach of Sherlock, but addresses the scalability issues in Sherlock with the help of novel inference algorithms. We describe Spotlight in greater detail next. III. Spotlight D ESIGN Spotlight consists of three major components for fault localization as shown in Figure 1. First, it uses servicelevel dependency graphs (SLDGs) that codify the high-level relationships that exist between different servers that together provide the service. Coupled with the network topology, it creates an inference graph—a probabilistic dependency graph that connects root causes to observable symptoms. It is much more detailed than SLDGs as it also takes the underlying network topology into account. This component is exactly the same as that used in Sherlock [5]. Second, it uses an inference algorithm that first compresses the multi-tier inference graph into a bipartite graph to simplify the multiple layers of dependencies. Third, it then applies a novel weighted greedy min-set cover algorithm (WGMSC) for quickly outputting a set of candidate root causes that explain the observed symptoms the best. We explain these individual components individually next. A. Multi-tier inference graph Spotlight creates probabilistic multi-tier inference graphs from SLDGs and the underlying network topology to represent the dependencies that exist between root causes and observable

Service Dependency Graph

Network Topology

Multi−tier Inference Graph

Graph Compression

Fig. 2.

Service Level Dependency Graph for clients accessing web portal

Bi−partite Graph

Observed Symptoms

Weighted Greedy Min Set Cover Spotlight Algorithm Hypothesis Set

Fig. 1. Different components in Spotlight. The components shown enclosed within the dashed box represent the novel components employed by Spotlight for scalable fault localization. We assume the same approach used in Sherlock to form the multi-tier inference graph which is input to Spotlight.

symptoms. In Spotlight, we use the same technique as that used in Sherlock for constructing this inference graph. We provide a brief overview to introduce the terminology and provide a quick reference, but for a more thorough description, please refer to [5]. Service Level Dependency Graphs. One service is said to be dependent on the other if the operation of the former relies on the proper operation of the latter. The dependencies between services supporting an application can be graphically represented as shown in Figure 2. We can highlight two important characteristics of the dependency graphs. First, the dependencies are inherently multi-level. In Figure 2, we can observe a multi-level dependency in which a client accessing a Web-portal is dependent on the portal, which in turn is dependent on the search indexer and the search server, and the search indexer is dependent on the server backend. Second, all the services are ‘loosely’ dependent on each other. Overprovisioning in the network, load balancing, caching, etc., make these dependencies highly probabilistic. For example, a client trying to access a web server may avoid doing a DNS lookup if the IP address of the web server is present in the cache. The weights on the edges in the graph represent the probability of a service to be dependent on the other service. SLDGs are computed in Sherlock as follows [5]. If accessing a service B is dependent on a service A then the request to service B is likely to invoke an exchange of packets with both A and B. Sherlock leverages this observation and computes the

dependency probability of a host on service A when accessing service B as the number of times in the packet trace an access to service B is preceded by an access to service A within a small time interval (dependency interval). It aggregates this information collected across the network to construct a service level dependency graph. A relation A → B denotes that the service B is dependent on service A with a probability as shown on the edges connecting the services. For example, for the dependency graph in Figure 2, if the DNS server fails, then the client trying to access the Web portal would experience a failure with a probability of 0.05. Automated ways to obtaining SLDGs have been proposed in [6]. Inference graph. The inference graph is a labeled directed graph representing dependencies among components of a network, the components including services and hardware components. The service dependencies (from SLDGs) and network topology are combined to compute a unified and comprehensive inference graph which can be used for fault localization. There are three kinds of nodes in the inference graph: • Root-cause nodes correspond to the physical components in the network whose failure can affect the performance of applications resulting either in performance degradation or a failure. The root-cause nodes can represent a host (a machine with an IP address), a service, a router, or an IP link. • Observation nodes model user experience. The perception about application performance, i.e., whether an application is functioning normally or experiencing performance degradation, is represented using observation nodes. An observation node corresponds to a (client, service) pair. So a client accessing two different services or the same service running on two different clients would contribute two different observation nodes in the inference graph. • Action nodes form the bulk of the inference graph and act as glue between the root-cause nodes and the observation nodes. They represent the actions which services in the network have to take for satisfying user’s request. An inference graph is constructed as follows. Suppose a client C accesses a service S, an observation-node representing the client report and a root-cause node representing the host on which the service runs are added to the inference

Fig. 3.

Inference graph showing client accessing file server.

graph. Then, an action-node representing the action of client C accessing service S is added to the inference graph. One more action node is added which represents the path between the client C and the server hosting service S. It is added as the parent of the action node representing the action of client accessing the server. It means that the action (client accessing server) is not only dependent on the server but also on the path between the client and the server. All the routers and links lying on this path are added as root-cause nodes. They are connected as the parents of the path action node meaning that the action node is dependent on all the routers and links on the path. We then examine the SLDG corresponding to S and find out the set of services DS (e.g., DNS, Proxy) on which C will be dependent on, while accessing service S. For each each of these services, we create root-cause nodes to represent the hosts on which those services are running and the corresponding action-nodes connecting these root-cause nodes to the observation-node. New action nodes representing the path and root-cause nodes representing routers and links are also added as described above. The algorithm then recurses, expanding each service in DS and identifying the new set of root-cause nodes and the action-nodes. The same technique is applied to all (client, service) pairs to build a comprehensive inference graph for the enterprise network. Every server, router and IP link are represented by unique root-cause nodes in the inference graph. Figure 3 shows a portion of the inference graph when a client tries to access the File Server. Experience of the user accessing the file server is depicted as an observation node,

which has 8 action nodes as its parents. The action-nodes represent different operations which have to be performed for the file server access application. These action nodes have root-causes, corresponding to hosts running specific servers such as DNS, WINS, as parents. Also note the path actionnodes characterizing the routes between various hosts. We can clearly make out from the graph how failure of any root-cause node can percolate through action nodes to affect the file server application. B. Graph compression The inference graph computed above for several common enterprise networks can be quite large, especially when hosts in enterprise networks are situated several hops away from data centers, and services get more complex with increased number of intermediate nodes in the multi-tier inference graph. The large size of the inference graph can directly influence the amount of time it takes to output a hypothesis—a serious limitation of existing systems such as Sherlock. However, since dependencies are naturally represented in a multi-tier fashion, it is not easy to construct a simpler dependency graph directly. In Spotlight, we start with a multi-tier dependency graph to capture the dependencies, but compress the graph into a bipartite graph with direct probabilistic edges between root causes and observed symptoms. This simpler representation, as we shall see later, makes it easy to apply our minimum setcover algorithm described in Section III-C, that scales better than direct Bayesian inference over the multi-tier dependency

graph. Since our multi-tier inference graph is a directed acyclic graph with edge weights representing a probability distribution, we can potentially leverage the properties of Bayesian inference algorithm to compress our multi-tier inference graph to a bi-partite graph. While the bayesian approach for compression preserves all the dependency information in the original multi-tier graph, it is computationally very demanding. In our evaluation, we found that the compression time for an inference graph with 50 nodes to be of the order of few minutes. The time taken for multi-tier graph compression can become the bottleneck in the fault diagnosis process. If the inference graph can be reused across inferences, the compression cost is a small overhead that is amortized over several actual inferences. Unfortunately, notice that the observation set is going to be different for different instances of inference, since the actual failures may be different. This means, that the compression cost is not easily amortized over multiple inferences, and therefore, the cost of performing the compression is almost equivalent to directly performing Bayesian inference over the original graph (similar to that used by Sherlock). We, therefore, need a more scalable compression technique that can retain the properties of the original multi-tier graph as best as possible. Algorithm 1 Graph Compression Algorithm ⊲ I: Input Inference Graph ⊲ R: Set of root causes ⊲ C: Set of all children of all root causes ⊲ O: Set of all observations (symptoms) ⊲ A[m][n]: Output Edge-Strength Matrix with m root causes and n observations (symptoms) in a bi-partite graph. Initially ∀i, j, A[i][j] = 0 ⊲ wt(r, c): Dependency probability between r, c according to the inference graph ⊲ child(a): Set of children of node a according to the inference graph procedure G RAPH C OMPRESSION for r ∈ R do for c ∈ C do find observation(c, r, wt(a, r)) end for end for end procedure procedure FIND OBSERVATION(a, r, x) for b ∈ child(a) do if b ∈ O then A[r][b] = max{A[r][b], wt(b, a) ∗ x} else find observation(b, r, x ∗ wt(b, a)) end if end for end procedure Approximate compression. We propose an approximate

Fig. 4. Example showing the compression of a multi-tier dependency graph to a bi-partite one.

compression algorithm shown in Algorithm 1 that works as follows. The first step of the algorithm starts depth-first search from each root-cause node to observation nodes in the inference graph. The probabilistic dependence between a root cause and an observation is computed by calculating the product of probability values of the edges in the path from root cause to the observation node. In case of multiple paths, we take the maximum of the all the values calculated. Intuitively, this makes sense compared to say, adding the individual probabilities, since each path represents a possible way for a root cause to influence an observation, and hence, choosing the most probable path is sufficient as an approximation. This heuristic, as we shall show in our evaluation, appears to work well. Figure 4 illustrates this technique with an example. The dependency probability between R1 and O1 is calculated using the product of 0.2 × 0.12 and assigned as 0.024. Note that R1 has two different paths to O2, i.e., 0.3 × 0.08 or 0.1 × 0.2. We select the one with the higher value of the product, 0.024. Clearly, the run-time complexity of this compression algorithm is of the order of number of edges in the graph, as the compression algorithm replaces each edge in the path from a root cause to an observation node with a direct edge between them. Let V be total number of nodes in the inference graph and E be total number of edges. For a complete graph, |E| = O(|V |2 ). But the inference graph in our consideration is relatively sparse, where |E| = O(|V |). So the order of time taken for graph compression is almost linear in the number of nodes in the multi-tier graph for most practical graphs. We evaluate the compression time empirically in Section IV. C. Weighted greedy minimum set cover (WGMSC) Compression of the multi-tier inference graph provides us with a bi-partite graph consisting of root causes R and observation nodes O. Let S denote the set of symptoms actually observed (S ⊆ O), i.e., those set of observation nodes that failed. The problem is to identify the most probable hypothesis, H ⊆ R such that H explains S, i.e., every member of S is explained by (covered by) at least one member of H. Note that a symptom is explained by or covered by a root cause if there exists an edge with non-zero strength (dependence probability) between the root cause and the symptom. Finding H can be modeled as finding a set cover for S, as each element in H can be thought of as a set of symptoms that will be

observed if that element fails. Given a set of symptoms, then the problem is clearly just finding the set of elements that will completely explain all the symptoms. Any set cover for the particular observed symptoms can be a candidate hypothesis that will completely explain all observed symptoms, but it may not be the most probable. Finding the most probable set cover is equivalent to finding the smallest set cover if all elements are equally likely to fail, as the probability of larger number of failures is significantly smaller than smaller number of failures. In our enterprise setting, it is difficult to model the exact probabilities of failures for different types of root causes, and hence, we assume that each failure is equally like to occur. Thus, our inference problem then reduces to finding the minimum set cover. Unfortunately, finding the smallest set cover is known to be NP complete problem in general. Algorithm 2 Weighted Greedy unexplained = S hypothesis H = {} while (unexplained != empty) do for Ri ∈ R do find hit ratio; find cover ratio; metric = cover ratio × hit ratio; end for find Ri s.t ∀j metric(Ri ) >= metric(Rj ) H = H ∪ Ri ; unexplained = S − Si ; end while In Spotlight, we use a novel weighted greedy approach to identify the smallest possible set of root causes H that explains the symptom set S. Our algorithm is similar in spirit to SCORE [13]; however, we need to factor probabilistic dependencies between edges which SCORE’s setting did not involve. Our algorithm shown in Algorithm 2 makes use of two key metrics. • hit ratio of a root cause Ri is the fraction of the observation nodes associated with Ri that appear as symptoms, i.e., experiencing service failures. We use the dependence probability values in calculating the hit-ratio as: Let pj be the probability values corresponding to the j th edge for the root cause Ri . If the observation node at the other end of the j th edge is a symptom then we say j ∈ Si . We define Oi as the set of observation nodes that are connected to Ri (i.e., the set of symptoms that will appear if Ri fails). Pj∈Si pj hit ratio = Pj∈Oi pj •

cover ratio of a root cause Ri is the fraction of total symptoms which can be explained by Ri . It is defined as: total # of symptoms in Si cover ratio = total # of symptoms in S

The hit ratio captures the fraction of symptoms observed for any given root cause. If all symptoms associated with a given element do appear, then with good probability the element may have indeed failed. However, sometimes, all symptoms may not appear either due to erroneous reporting or due to loss of symptoms collected. If the loss rate is not too high, then a significant fraction should make it to the location where the inference algorithm is run. The cover ratio captures the fraction of total symptoms associated with a given root cause. Our algorithm described in Algorithm 2 works as follows. First, it starts with the unexplained set of symptoms being equal to the symptom set. It proceeds by selecting the best root cause in each iteration, removing the symptoms associated, and repeating the process. Our goal is to select root causes that have high cover ratio and high hit ratio. Choosing root causes with high cover ratio allows us to reduce the hypothesis set size, while those with high hit ratio allows eliminating natural bias towards choosing popular root causes that just are associated with a lot of symptoms anyway (minimize error). That is, in each iteration, the algorithm picks the root cause with highest coverage and hit ratio by computing their product and picking the one with the largest product. Note that product works better since it favors higher values for both hit and coverage ratios simultaneously as opposed to considering their sum. Once a likely root cause is selected, the corresponding symptoms that are already explained by the root cause are removed from the unexplained set, and the process is repeated by recomputing the hit and cover ratios. When all the symptoms have been explained, we output the hypothesis set. IV. E VALUATION Our main evaluation goal is to show the scalability of Spotlight in comparison with existing approaches such as Sherlock. While scalability is important, accuracy of diagnosis is also equally important. Thus, in our evaluation, we also focus on comparing the accuracy of Spotlight in diagnosing real failures. Before we explain these results, we first describe the evaluation methodology we have used. A. Evaluation methodology In order to analyze the performance of Spotlight, we simulate an enterprise network hosting 8 distinct applications that are used frequently (enumerated in Table I). The network itself consists of 21 unique servers (includes DNS, authentication, file servers, Web portal), 34 routers connecting various servers in the data center, 2 LANs with a total of 73 IP links. In our setting, we simulate the presence of 1, 500 clients accessing various applications spread across the two LANs. The service requests from clients to servers are routed through the network along paths computed using OSPF. Our topology represents a real enterprise network setting that the authors in [5] used before. In order to generate a multi-tier inference graph, we also need the SLDGs corresponding to the different applications apart from the network topology. We

TABLE I L IST OF APPLICATIONS ACCESSED BY CLIENTS IN OUR SIMULATION Accessing File Server Accessing Web Portal Accessing Sales-site Office Communications (IM, VoIP) Email Application Web-based collaboration (e.g. sharepoint) Distributed File System Source Depot (version control system)

used the SLDGs outlined in [6], [5] for the various applications depicted in Table I. Apart from the real enterprise network topology described above, we also evaluate the performance of Spotlight on randomly generated multi-tier graphs. Our random graph generator takes the number of root causes and number of observations as the input and creates a random multi-tier graph with random probability values. Random graphs provide us the flexibility of testing our system for different kinds of inference graphs and also play an important role in the scalability evaluation of Spotlight. We performed our evaluations using a prototype implementation of Spotlight (written in C) on a 2.00 Ghz Intel Xeon processor Linux machine (shared resource) with 8 GB RAM. All the timing results are average values of 25 runs and all accuracy results are averages of 150 runs of the algorithm. Fault injection. We evaluate the performance of Spotlight using artificially generated faults in our simulated environment. In our experiments, for each run, a fault is injected by randomly selecting a root cause node (or multiple nodes) and setting it as down. After injecting the fault, we perform a probabilistic walk through the multi-tier graph, i.e., we examine the nodes connected as child to these faulty root-cause nodes and propagate the fault depending on the dependency probability between each child node and the faulty parent node. For example, a node connected to the faulty root-cause node with a dependency probability value of 0.5 will have 50% chance of being affected by the fault in the parent node. We repeat this process recursively for all the faulty nodes in the graph and propagate the fault all the way to the leaf nodes (observation nodes) in the graph, which would form the symptoms. For the same fault injection, it is possible that we end up with different symptom sets for different runs due to the probabilistic nature of the dependencies between the nodes. The fault injection technique described above imitates a real life user experience in case of failures in enterprise networks. B. Size of the inference graph How big are real inference graphs ? To answer this, we can observe that as the number of root cause in the enterprise network topology is fixed (total number of servers, router, links), the size of the inference graph is solely dependent on the cardinality of the observation set, i.e., the number of clients accessing any application, and the set of services a given client is accessing. In our simulation, we can vary the number of

Size of inference graph(total # of nodes)

1 2 3 4 5 6 7 8

2200 2000 1800 1600 1400 1200 1000 800 600 400 20 40 60 80 100 # of observation nodes in inference graph

Fig. 5.

Size of inference graph v/s number of observation nodes.

observation nodes from 0 (when none of the clients are using any service) to 1500 × 8 (when all the 8 services are being accessed by all the clients). While we have restricted ourselves to the 8 services for which we had access to the SLDGs, there could potentially be a few 10s of services in a real enterprise network. Still, the dominant factor that determines the number of observation nodes is the number of clients itself. Figure 5 demonstrates the dependence of the size of the inference graph on the number of observation nodes. Every observation node potentially introduce new action nodes in the inference graph. The count of these newly added action nodes depends on the SLDGs of the application which the client is trying to access. The size of the SLDGs for different applications do not differ a lot (most of the applications have to do a DNS lookup, some authentication and so on) [6]. So when a client tries to access any of the 8 applications the number of new action nodes added are of the same order. Hence, we can see a linear increase in the size of the inference graph with increase in number of observation nodes as shown in Figure 5. The main implication of this result is that the inference graphs can be huge for enterprises that comprise a large number of clients (say, order of few tens of thousands), thus illustrating our point about the importance of scalability in these settings. C. Runtime performance comparison In this section, we compare the performance and accuracy of Spotlight with Ferret, the inference algorithm used in the Sherlock system [5]. Ferret operates directly on the multi-tier inference graph computed, while Spotlight incurs the cost of first compressing the multi-tier graph to a bi-partite and then running the WGMSC algorithm on the resulting bi-partite graph. Before presenting the comparison between Ferret and Spotlight, we first evaluate the amount of time it takes to compress multi-tier graphs to bi-partite graphs in relation to the overall diagnosis time. Compression time. In Figure 6, we show the time it takes to compress different multi-tier graphs of different sizes. For the purposes of this experiment, we used random inference graphs so that we can simulate a much larger variety of graphs.

graph compression WGMSC compression + WGMSC

0.3

Time (in seconds)

0.25

0.2

0.15

0.1

0.05

0

50

100

150

200

250

300

350

400

450

500

Number of root cause nodes

Fig. 6. Relationship between compression time, time taken by WGMSC, and the total inference time for Spotlight.

higher) for inference as compared to Spotlight at around 150 root cause nodes. Spotlight’s inference time also increases super-linearly as the number of root cause nodes increases, but scales much better than Ferret. A similar trend can be observed when we fix the number of root cause nodes (at 50) while we vary the number of observation nodes in Figure 7(b). In order to compare the two across really large inference graphs, we vary the size of the inference graph from 100 all the way to 100, 000 nodes. The total time taken to compress the inference graph into a bi-partite graph and localize the fault was around 10 minutes. We could not run Ferret for an inference graph of size 100, 000 nodes as it takes a really long time. The point for 100, 000 is a projection from the other empirically measured data points (we could evaluate the run time for Ferret until about a few thousand nodes). The projected run time is about 108 seconds for Ferret, which is 105 times slower than Spotlight, clearly indicating the need for Spotlight in the context of large enterprise networks. Even for smaller graphs, Ferret takes significantly more time (100× more) than Spotlight. D. Accuracy comparison

(Inference graphs generated on enterprise topologies are found to be similar in the amount of time it takes to compress.) We can make two main observations from the figure. First, our compression algorithm is highly efficient in terms of time taken for compression—even for a graph with 500 nodes, the compression time is quite small (0.72 seconds). Second, as we discussed in Section III-B, we can see that the time taken for compressing a graph is almost linear in the number of nodes in the graph because the number of edges is only slightly larger than linear. Even for large multi-tier inference graphs, our compression algorithm requires a minimal amount of time. WGMSC takes around 8 milliseconds to output the hypothesis set for a compressed bi-partite graph for random multi-tier inference graphs consisting of 128 root cause nodes and 300 observation nodes. The multi-tier inference graph computed from the real enterprise topology consists of around 5, 000 total nodes, for which the compression time for the same graph is around 0.3 seconds. Since the compression time is much higher than the time taken by inference algorithm, the total inference time is dominated by the compression time (as shown in Figure 6). The total inference time for Spotlight is the sum of compression time and the time taken by WGMSC algorithm to output a hypothesis set. For the same enterprise topology, Ferret takes significantly more time (45 minutes) compared to Spotlight. Inference time comparison. Figure 7 compares the performance of Spotlight and Ferret in terms of inference time, as we vary the number of root cause nodes and observation nodes over random inference graph topologies. We fix the number of observation nodes to be 20 and vary the number of root cause nodes in Figure 7(a). Note that we show the total inference time (as the sum of compression and inference) for Spotlight for a fair comparison with Ferret. We can observe from the figure that Ferret takes much more time (4 orders of magnitude

While the above experiments clearly illustrate the benefits of using Spotlight over Ferret in terms of scalability, it is important that such scalability come at no expense in accuracy, otherwise, the scalability benefits are not worth it. Thus, in the next set of experiments, we compare the accuracy of Spotlight under different failure conditions in localizing failures with Ferret. Since the accuracy is a metric that is much more closely tied with the actual topology under use, in all these subsequent experiments, we only use the real enterprise inference graph. Most common failure scenarios will not incur more than three simultaneous failures at any given point of time [5]. For each failure scenario, we test our system for multiple times (150 runs for each scenario). The primary metric we use is accuracy that is defined as the fraction of times when all the failures are localized correctly. accuracy =

# of runs with all faults localized # of runs

Figure 8(a) shows the accuracy for Spotlight as we vary the number of observation nodes. As can be seen from the figure, single failure cases are accurately diagnosed by both Spotlight as well as Ferret, although Spotlight’s accuracy is slightly higher than that of Ferret (around 90% with 80 observation nodes compared to almost 100% irrespective of the number of observation nodes). This may appear a bit surprising that Spotlight, which incorporates approximate inference algorithm outperforms Ferret, but recall that Ferret is also only an approximation of the actual Bayesian inference (which is likely to take much more time than Ferret). The accuracy of Spotlight drops to around 55% when there are 2 simultaneous root cause failures (as shown in Figure 8(b)) and to around 30% in case of 3 simultaneous failures (as shown in Figure 8(c)).

104

10

10

1

100 10

-1

101 100 10

-1

10-2

10-2

10-3

10-3

10-4

Ferret Spotlight

106

102 Time(in seconds)

Time(in seconds)

102

108

Ferret Spotlight

3

0

50

100

150

200 250 300 # of root-causes

350

400

450

10-4

500

Time(in seconds)

10

104

Ferret Spotlight

3

104 102 100 10-2

0

(a) Vary # of root cause nodes

50

100

150 200 250 300 350 # of observation nodes

400

450

10-4 2 10

500

103 104 Size of inference graph(# of nodes)

(b) Vary # of observation nodes

105

(c) Vary total # of nodes

Fig. 7. Inference time comparison between Ferret and Spotlight. Note that we could not run Ferret for an inference graph of size 105 , and therefore the point for x = 105 in Subfigure (c) is only a projection and not a real data point. 1

1

0.8

0.8

0.6

0.6

1

0.6

Accuracy

Accuracy

Accuracy

0.8

0.4

0.4

0.4

0.2

0.2 Ferret Spotlight

0 20

30

40

50 60 70 # of observation nodes

80

0.2 Ferret Spotlight

0 90

(a) Accuracy with 1 failure Fig. 8.

100

20

30

40


80

(b) Accuracy with 2 failure

Ferret Spotlight

0 90

100

20

30

40


80

90

100

(c) Accuracy with 3 failure

Accuracy comparison of Ferret and Weighted Greedy Min-Set-Cover

Our detailed investigations reveal that, in cases when there are 2 or more simultaneous root cause failures, there is a significant overlap in the symptoms generated by each of the failed root causes. In many cases, the symptoms generated by a failed root cause is a subset of the symptoms generated by the other failed root cause(s). For example, if an IP link which is used only by a small fraction of the client machines and a very popular server fail simultaneously, it is very likely that the symptoms generated by the failed IP link is a subset of the symptoms generated by the failed server. In this case, our algorithm outputs only one root cause as the hypothesis, since the basic philosophy of our approach is to select the best hypothesis (i.e., with least number of root causes) that can explain all the symptoms. These cases reduce the overall accuracy of our system, when compared to a single root cause failure case. But from Figure 8, we can also observe that our accuracy is still comparable to that of Ferret, indicating that Ferret also suffers from the same problem in such cases. Although the accuracy for Spotlight appears low for multiple simultaneous failures, we found that most of the time, Spotlight outputs at least one of the real root causes as the hypothesis. Specifically, for 2 simultaneous failures, around 90% of the time, Spotlight outputs at least one correct root cause, while for 3 simultaneous failures, Spotlight outputs at least one correct root cause 60% of the time. Thus, although the system is able to localize only one of the multiple failures, once the identified faulty node is fixed, further iterations of

running Spotlight can be performed to uncover the other faults. a) Imperfect data.: In the inference graph, client reports constitute the observation nodes and the clients which experienced a failure constitute the symptom set. In a realistic enterprise network environment, data loss is something that may occur sometimes and may not be completely avoidable. Since in our model, the inference graph is built dynamically, every client needs to send its report (whether the service request was a success or not) to the inference engine. So, unless the inference engine receives a client report, that clientservice pair does not become part of the inference graph that is dynamically created. We evaluate the impact of imperfect data on the performance of Spotlight on an inference graph consisting of 128 root cause nodes and 100 observation nodes. We simulate the imperfection in data by reporting a wrong observation for a fraction of the client-service pairs. In Figure 9, we compute the accuracy as a function of percentage of client reports that are erroneous. Clearly, as expected, the percentage of errors in client reports increases, the overall accuracy reduces for both Ferret and Spotlight. For errors up to 20%, our simulation results indicate that Spotlight is able to maintain an accuracy of more than 90%. Spotlight’s accuracy however reduces much more than Ferret (by about 10-20%) as the errors in the reports increase more than 30%.

1

Ferret (single failure) Spotlight (single failure)

Accuracy

0.8 0.6 0.4 0.2 0 0

Fig. 9.

10 20 30 40 50 60 70 80 Percentage of error in client report

90

Accuracy of Spotlight as the fraction of imperfect data increases

V. C ONCLUSION Fault localization in large enterprise networks is a challenging problem today. Typical applications used in large enterprises involve complicated probabilistic dependencies across different servers that lead to complicated failure scenarios that IT staff need to be quickly localize. Existing fault localization systems such as Sherlock work by constructing an inference graph represented as a multi-tier dependency graph that connects multiple root causes with observation nodes, and running Bayesian inference algorithms for outputting the most likely hypothesis. Unfortunately, these algorithms seriously limit the scalability of Sherlock, especially on really large inference graphs. As enterprises grow in size supporting several 10s of thousands of clients, with each client running several different applications, scalability becomes a big challenge. In this paper, we propose Spotlight as a scalable alternative to Sherlock. Spotlight constructs similar multi-tier dependency graphs as Sherlock, but uses novel graph compression algorithm that simplifies the multi-tier graphs into probabilistic bipartite graphs over which it runs a weighted greedy set cover algorithm for fast diagnosis. Using real topologies and service dependency graphs, we show that Spotlight outputs a diagnosis 100x faster, while being just as accurate as Sherlock. ACKNOWLEDGMENTS The authors are indebted to the anonymous reviewers for comments on previous versions of this manuscript. This work was supported in part by NSF Award CNS 0831647. R EFERENCES [1] [2] [3] [4] [5]

SMARTS Inc, “http://www.smarts.com.” HP Technologies, Open View, http://www.openview.hp.com. “http://www.tivoli.com.” “http://www.microsoft.com/mom.” P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang, “Towards highly reliable enterprise network services via inference of multi-level dependencies,” in ACM SIGCOMM, Aug. 2007. [6] X. Chen, M. Zhang, Z. M. Mao, and P. Bahl, “Automating network application dependency discovery: Experiences, limitations, and new solutions,” in OSDI, 2008, pp. 117–130. [7] “http://www.mercury.com/us/products/business-availabilitycenter/applica%tion-mapping.”

[8] J. Case, M. Fedor, M. Schoffstall, and J. Davin, “A simple network management protocol (SNMP),” IETF, RFC 1157, May 1990. [9] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, “Using magpie for request extraction and workload modelling,” in OSDI, Dec. 2004. [10] J. Dunagan and Nicholas J. A. Harvey and Michael B. Jones and Dejan Kostic and Marvin Theimer and Alec Wolman, “Fuse: Lightweight guaranteed distributed failure notification,” in OSDI, 2004. [11] M. Y. Chen, A. Accardi, E. Kiciman, D. A. Patterson, A. Fox, and E. A. Brewer, “Path-based failure and evolution management,” in NSDI, 2004, pp. 309–322. [12] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, “X-trace: A pervasive network tracing framework,” in NSDI, 2007. [13] R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren, “IP fault localization via risk modeling,” in NSDI, May 2005. [14] S. Kandula, D. Katabi, and J. P. Vasseur, “Shrink: A tool for failure diagnosis in IP networks,” in Proc. ACM SIGCOMM MineNet Workshop, Aug. 2005. [15] M. Steinder and A. Sethi, “End-to-end service failure diagnosis using belief networks,” in Network Operation and Management Symposium, Florence, Italy, Apr. 2002. [16] ——, “Increasing Robustness of Fault localization through Analysis of Lost, Spurious and Positive Symptoms,” in IEEE Infocom, 2002. [17] S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl, “Detailed diagnosis in enterprise networks,” in SIGCOMM, 2009, pp. 243–254.