INFORMS Journal on Computing
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
Vol. 27, No. 1, Winter 2015, pp. 103–117 ISSN 1091-9856 (print) ó ISSN 1526-5528 (online)
http://dx.doi.org/10.1287/ijoc.2014.0613 © 2015 INFORMS
Maintaining Secure and Reliable Distributed Control Systems Andrei Sleptchenko
Department of Mechanical and Industrial Engineering, College of Engineering, Qatar University, Doha 2713, Qatar,
[email protected]
M. Eric Johnson
Owen Graduate School of Management, Vanderbilt University, Nashville, Tennessee 37203
[email protected]
W
e consider the role of security in the maintenance of an automated system, controlled by a network of sensors and simple computing devices. Such systems are widely used in transportation, utilities, healthcare, and manufacturing. Devices in the network are subject to traditional failures that can lead to a larger system failure if not repaired. However, the devices are also subject to security breaches that can also lead to catastrophic system failure. These security breaches could result from either cyber attacks (such as viruses, hackers, or terrorists) or physical tampering. We formulate a stochastic model of the system to examine the repair policies for both real and suspected failures. We develop a linear programming-based model for optimizing repair priorities. We show that, given the state of the system, the optimal repair policy follows a unique threshold indicator (either work on the real failures or the suspected ones). We examine the behavior of the optimal policy under different failure rates and threat levels. Finally, we examine the robustness of our model to violations in the underlying assumptions and find the model remains useful over a range of operating assumptions. Keywords: reliability, maintenance-repairs; queues, priority; queues, optimization; probability, stochastic model applications; probability, Markov processes History: Accepted by Winfried Grassmann, Area Editor for Computational Probability and Analysis; received October 2013; revised February 2014; accepted May 2014. Published online in Articles in Advance October 28, 2014.
1. Introduction
the progress of a customer’s order on the shop floor can be used to advise customers on shipping dates and tracked in a customer relationship management system. Moreover, wireless technologies have made networking factory devices even easier and less expensive, further accelerating this network convergence. For example, General Motors (GM) used to manage many separate control networks but has integrated those onto an Ethernet backbone using wireless technology (Hochmuth 2005). However, opening these control systems to wider access also brings new security risks. For example, hospitals are increasingly reliant on sophisticated networks of wireless sensors and mobile devices to monitor patients. Such patient monitoring systems have been shown to have significant security weakness (Gold 2013) that could lead to failure resulting in patient harm or death. Since many SCADA and PCS applications were designed prior to the rise of Internet security fears, they were not engineered to incorporate sophisticated security provisions. In fact, many were designed without any security or implemented with security features effectively disabled (for example, passwords never changed from their default settings). Moreover,
Distributed control systems form the information and control backbone of most large automated systems like factories, hospitals, utilities, and mass transit systems. Sometimes called supervisory control and data acquisition (SCADA) systems or process control systems (PCS), these networks of sensors, simple controlling devices, routers, and computers, were once isolated networks uniquely designed for their intended purpose. With the rapid rise in enterprise software, many of these control systems have been integrated into wider corporate networks and further exposed to the Internet. Bringing such systems onto larger networks has many advantages including ease of data movement, remote control, and better integration of related business processes. For example, nurse-patient communication systems allow continuous patient monitoring and improved care. Likewise, remote patient monitoring in home settings allow patients to be treated at home, avoiding costly hospital stays (Greg 2013). In process industries, SCADA systems facilitate detailed metrics on the utilization of capital equipment that can be rolled up and presented in broader management dashboards. In discrete manufacturing environments, information tracked on 103
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
104
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
these isolated control systems in many cases were not updated or patched against known vulnerabilities. When integrated into the wider networks, these control systems can become susceptible to everyday viruses, worms, and hackers. Devices that become infected could fail or continue operating in a less reliable state. Moreover, they could harbor malware that lies dormant until some later time when activation leads to immediate failure. A more insidious risk is that devices that appear to operate acceptably may in fact have been compromised. For example, a sensor that continues to report normal operation of the machine it is monitoring, when in fact the machine is running out of control. Sometimes, remote diagnoses can detect abnormalities without showing conclusive failure. For example, an Australian man attacked the Maroochy Shire’s wastewater SCADA system that controlled the flow of waste through the complex system. Using a wireless device, he was able to open release valves at a sewage treatment plant, dumping foul-smelling sludge into local parks and rivers. Before being caught, he had successfully pirated control 45 times and dumped 264,000 gallons of sewage, according to the Government Accounting Office (Antunes 2005). More recently, in 2011 an Illinois water utility reported that its pumping systems had been hacked. Attackers reportedly enabled and disabled a pump repeatedly, eventually damaging it (Bradley 2011). Other recent examples of similar security failures in SCADA systems have occurred in the production of electricity and the distribution grid itself (Antunes 2005). In 2001, hackers were able to exploit a known weakness in the Sun’s Solaris server systems and successfully installed some malware that would allow them to control 75% of California’s power grid. Thankfully, in the 17 days before discovery, they caused little trouble. In 2003, Davis-Besse Nuclear Power Station in Ohio crashed for five hours as a result of the “slammer” worm. Although the failure of the nuclear power plant’s control network generated much concern, no harmful releases occurred during the episode. The possibility of such attacks made world headlines when the Stuxnet computer worm, believed to be developed by the United States and Israel, successfully destroyed Iranian nuclear centrifuges in 2009–2010. Recent reports show the escalation of such military use of SCADA attacks (Gellman and Nakashima 2013). Despite the fact that many firms have increased their investment in security technologies, the FBI Computer Crime Surveys show that enterprises continue to suffer myriad attacks with many incidents going unreported (Brenner 2006). Security breaches create new sources of maintenance burdens on repair resources (people and diagnostic computing devices) and open new questions on how best to allocate limited maintenance resources.
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
Cyber attacks often result in physical equipment failures that look like normal failures; for example, causing a device to repeatedly cycle through different speeds or states to cause fatigue and failure. Even if small security failures do not immediately translate into large-scale system failure, learning from those “nearmisses” has been shown to be an important element in avoiding larger systems failures (Phimister et al. 2003). Of course, traditional maintenance activities have long been an important area for ensuring the productivity and continuity of control systems. GM alone spends more than $1 billion in maintenance of its factory systems (Hochmuth 2005). Maintenance management in private and public sectors has motivated research on a wide variety of related problems. However, our focus is to develop models that help prioritize a response to possible security failures of SCADA systems. A Department of Homeland Security report (USDHS 2006) identified such modeling tools and processes as important elements to enhance the speed of response to a cascading failure. Recent industrial research has focused on continuous monitoring of SCADA systems to profile predictable behavior and identify potential failures (Higgins 2013). Our work resides at the intersection of several substantial streams of work. Operations researchers have been interested in repair problems for more than 50 years (cf. Barlow and Proschan 1996). Problems more closely related to ours are found in the literature of maintaining deteriorating systems (Klein 1962, McCall 1963), where the state of the system is not directly observable or is costly to observe (Eckles 1968). Much of this work is focused on single-unit systems; see Wang (2002) for a detailed survey. Another type of problem similar to ours is the machine interference problem (MIP). Models of this type consider a finite set of (different) operating machines maintained by a group of (specialized) repairmen. To the best of our knowledge, none of the models of this type takes into account the deterioration process (see Haque and Armstrong 2007 for a detailed survey of existing models). For our multi-item system, structural dependence prevents simple reductions to the single-item case (as in most of the works on deteriorating systems). On the other hand, in our system (as will be seen in the discussion of our objective functions) failure of a single component does not necessarily imply system failure. In fact, several devices may fail and the overall system may continue without incident. However, it is often not clear to engineers how many devices can fail without causing a major incident. For example, a failure on a pipe-valve controller in an oil refinery typically results in the valve moving to a failsafe mode (open or closed, depending on the situation) and other redundancies are often in place. Of course, failures must be quickly repaired but the
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
system may be safely operated for some time after such a failure. However, if several controllers have failed, the overall system will reach a failure state. This leads us to focus on objectives like the probability of exceeding k-out-of-N failed devices. De Smidt-Destombes et al. (2006) consider the inventory policy of a k-out-of-N system with wear-out. In their work, devices can be in one of three states: working, deteriorated, or failed. Like such previous research on such deteriorating systems, our focus is on providing reliable operation of the total system by minimizing the expected number of failed components or the probability of k failed components. However, unlike the previous research on deteriorating systems, we consider devices that are also subject to suspected security failures that could be real failures, later result in real failures, or be false alarms with no real danger. Understanding how to manage maintenance resources in this new world of security concerns is an important new area of study. In our system, devices can (and must) be restored—typically through software patches or physical repairs. Thus, inventory, as in De Smidt-Destombes et al. (2006), is not the key issue in this paper. Rather, our focus is on optimal use of the repair resources themselves as its capacity is the key limitation. From a modeling point of view, our problem is also similar to some of the work on restless bandit problems. These problems have been widely underconsidered in recent years (cf. Gittins et al. 2011; Bertsimas and Niño-Mora 1996, 2000; Glazebrook and Mitchell 2002, Glazebrook et al. 2005; Niño-Mora 2008, 2011). The most similar, from our point of view, are the problems considered in Glazebrook and Mitchell (2002), Glazebrook et al. (2005). In those papers, the authors consider a repair model for a group of continuously improving/deteriorating machines and the question is how to schedule repairmen such that a certain cost function is minimized. They present a scheduling heuristic based on Whittle’s indices (Whittle 1988) and develop theoretical bounds on heuristic performance. Another similar model can be found in Tiemessen and van Houtum (2013), where the authors analyze different policies for dynamic repair priorities using a Markov decision process (MDP) formulation and simulations. Here, we treat the system as a two-class priority queue where the server chooses jobs from different queues such that a certain objective is optimized. In this respect, our research is also related to the large stream of work in multiclass queues with service priorities (cf. Cobham 1954, Davis 1966, Jaiswal 1968). Recent developments in that area are focused, however, on measurements of system performance rather than on optimization (cf. Wagner 1996, 1998; Sleptchenko et al. 2005; and Harchol-Balter et al. 2005), or require work conservation laws that would allow optimization
105
of processing priorities using a polyhedral approach (cf. Bertsimas and Niño-Mora 1996). In the model considered in this paper, conservation laws do not hold since jobs can change their type and service time (real failure after suspected). In our multiclass, closed system the optimal priority policy shifts dynamically with the state of the system. We develop a stochastic model for a maintenance system, where the server (or repairman) is faced with two types of failures: “real” and “suspected.” Each failure type can be resolved by the server at different rates. In addition, devices in a suspected failure state may transition to a “real” failure, since the devices with suspected failures are still operating. Given an appropriate (linear) objective function related to the overall system integrity, we develop the optimal repair policy. We present an exact linear programming- (LP)-based optimization technique for determining a time-independent, statedependent scheduling policy. Such an optimization technique is tractable because the closed system results in a modest state space (two-dimensional and finite). Our approach using standard LP techniques allows us to optimize discrete priority rules. A similar property was shown for systems where conservation laws hold (cf. Bertsimas and Niño-Mora 1996). However, for our problem the conservation laws do not hold. Therefore, optimal processing priorities are state-dependent and require an approach based on MDPs. Using techniques from queueing theory, we show how to reduce state and action space. For example, direct use of MDP theory would require working on 2 ⇥ 3N -dimensional problem (N machines with three states each and two actions in each state), whereas using our approach we can formulate the problem as an 4N 4N É 155/2-dimensional optimization problem. We begin by formulating the model (§2). We show that a relaxed formulation where the repair resource chooses the next device to be repaired with some probability results in a deterministic optimal policy (§3). We derive generalized priority assignment rules for some partial cases (§4) and examine the behavior of the optimal policy in other generic settings (§5). Finally, we examine the robustness of our model to violations in the underlying assumptions (§6).
2. Model Formulation 2.1. Optimization Problem We consider a system with N operating devices that are subject to two failure types: “real” and “suspected.” The “real” failures occur at rate ã1 for each device, and we refer to these as “real” because they are instantly identified with certainty. The devices are also subject to “suspected” failures at rate ã2 for each device. Suspected failures may be real failures (for example, with some probability g 0) or may lead to real failures.
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
106
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
Unfortunately, the true condition of devices suffering from suspected failures cannot be known with certainty unless they transition to a real failure mode (with rate ã12 ) or are inspected by the repair server. We assume that the interfailure times are exponentially distributed. Thus, the repair system has two flows of arrivals that occur according to nonstationary Poisson processes, with rates that depend on the number of working devices. That is, given that n1 is the number of devices with “real” failures and n2 is the number with “suspected” failures, the systems arrival rates will be: 4N É 4n1 + n2 55ã1 —arrival rate of “real” failures, 4N É 4n1 + n2 55ã2 —arrival rate of “suspected” failures. In addition, the devices that are already in a “suspected” failure state can have a real failure. Thus we have an additional stream of “real” failures with system rate: n2 ã12 —rate of “real” failures occurring in devices with a “suspected” failure. We consider a system with a single repair server that processes failures of both types with rates å1 and å2 for real and suspected failures, respectively (service times are exponentially distributed). Since the considered system is Markovian (i.e., interarrival and service times are memoryless), it is logical to assume that the optimal dynamic processing priorities do not depend on the current state of the interarrival and service processes or on the processing history. That is, they depend only on the current numbers of failures in the system. We solve for the optimal state-dependent processing policy (dynamic processing priorities). Namely, the server chooses to process “real” failures with parameter Ån1 1 n2 , and correspondingly “suspected” failures with parameter 1 É Ån1 1 n2 . We relax the optimization problem by allowing parameters Ån1 1 n2 to be continuous (Ån1 1 n2 2 601 17). For this relaxation, Ån1 1 n2 can be interpreted as the state-dependent probabilities that the server chooses to process “real” or “suspected” failures. Later we will show that, in cases with linear objective functions, the optimal parameters Ån1 1 n2 obtained from the relaxed problem are always integer (i.e., Ån1 1 n2 2 801 19). Finally, we assume that the server follows a preemptive priority discipline, which means that the server will postpone processing of the device currently in service if the current scheduling rule (parameter Ån1 1 n2 ) dictates the server to process a device with the other type of failure. That is, preemption might occur after each failure event. However, our approach can be directly applied to systems with a nonpreemptive discipline at the expense of a larger state space. In our analysis, we consider two different objective functions. The choice of objective function to apply in any real situation depends on the nature of the real and suspected failures and their relationship to the overall system integrity.
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
Objective 1 (O1). Minimize the probability of the event that n1 + n2 g > d.
This objective (O1) function is most appropriate in cases where exceeding a certain level of failures (real and suspected) may lead to catastrophic system failure so we wish to minimize the probability of such an outcome. Notice that a special case of this objective is where we only measure real failures (g = 0). Objective 2 (O2). Minimize the weighted sum of expected “real” and “suspected” failures 4aE6n1 7 + bE6n2 75.
This objective is more appropriate in cases where the overall system performance degrades linearly in the number of real and suspected failures, where a and b represent the relative impact of a real or suspected failure. If g 0 is interpreted as the probability that the suspected failure is actually a real one (independent of the system state), then setting a = 1, b = g gives the expected number of failures. If the objective is to minimize the expected number of real failures, set a = 1 and b = 0 in objective O2. 2.2. Solving for a Given Priority Assignment The multiclass closed system with preemptive priority behaves like a Markov process with finite state space. The states of the system can be described by the 2-tuple 4n1 1 n2 5, which represents the number of real and suspected failures, respectively, with transition rates as illustrated in Figure 1. Given these transition rates, the equilibrium equations for the system states probabilities can be obtained by equating the flow out of state 4n1 1 n2 5 and the flow into state 4n1 1 n2 5: 4N Én1 Én2 5ã1 +4N Én1 Én2 5ã2 +n2 ã12 +Ån1 1 n2 å1
+41ÉÅn1 1 n2 5å2 pn1 1 n2
= 4N Én1 Én2 +15ã1 pn1 É11 n2 +4N Én1 Én2 +15 ·ã2 pn1 1 n2 É1 +4n2 +15ã12 pn1 É11 n2 +1 +Ån1 +11 n2 å1 pn1 +11 n2 +41ÉÅn1 1 n2 +1 5å2 pn1 1 n2 +1 1 < n1 +n2 < N 1 4N Én2 5ã1 +4N Én2 5ã2 +n2 ã12 +å2 p01 n2 = 4N Én2 +15ã2 p01 n2 É1 +Å11 n2 å1 p11 n2 +å2 p01 n2 +1 n1 = 01 1 < n2 < N 1 4N Én1 5ã1 +4N Én1 5ã2 +å1 pn1 1 0 = 4N Én1 +15ã1 pn1 É11 0 +ã12 pn1 É11 1 +å1 pn1 +11 0 +41ÉÅn1 1 1 5å2 pn1 1 1 1 < n1 < N 1 n2 = 01 n2 ã12 +Ån1 1 n2 å1 +41ÉÅn1 1 n2 5å2 pn1 1 n2 = ã1 pn1 É11 n2 +ã2 pn1 1 n2 É1 +4n2 +15ã12 pn1 É11 n2 +1 n1 +n2 = N 1 4Nã1 +Nã2 5p01 0 = å1 p11 0 +å2 p01 1 1
n1 = n2 = 00
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
107
“Suspected” failures
“Suspected” failures
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
(N – n2)!2
(N – n1– n2)!2 n2 n2
(N – n2)!1 "n1, n2#1
(N – n1– n2)!1 #2
"n1, n2#1
n2!12
n2!12 (1– "n1, n2)#2
n2!12
(1 – "n1, n2)#2 (N – n1)!2
N!2
(N – n1)!1
N!1 A Figure 1
n1
“Real” failures
B
#1
n1
“Real” failures
Flow Diagrams for Different System States
Note that Ån1 1 n2 = 1 when n2 = 0, and Ån1 1 n2 = 0 when n1 = 0. Although the system states probabilities pn1 1 n2 have double indexation, we can assume that they are elements of a column vector p, ordered in, for example, a lexicographical manner; i.e., p = 4p01 0 1 p01 1 1 p11 0 1 p21 0 1 p11 1 1 p01 2 1 0 0 0 0 0 0 1 p01N 1 0 0 0 1 pN 10 5. Then we can write this system of equilibrium equations as a system linear equation of size 4N + 25 · 4N + 15/2 ⇥ 4N + 254N + 15/2 plus the normalization equation: pQ = 01
peT = 10
For certain orders of the probabilities pn1 1 n2 in the vector p, the matrix Q will have a three-diagonal block form (see the appendix for more details) and it is easy to see that we have a finite nonhomogeneous two-dimensional quasi-birth-death (QBD) process. We can solve this system exactly for all the state probabilities, given specific system parameters (Ån1 1 n2 , ã1 , ã2 , etc.) applying known techniques (cf. Latouche and Ramaswami 1999, Neuts 1981). However, our focus here is the development of an optimization procedure to determine parameters Ån1 1 n2 . Note that our optimization procedure can be easily adapted for finding the steady state probabilities given certain Ås by introducing additional constraints.
3. Optimization of the Priority Assignment (for a Linear Objective Function)
Our goal is to find probabilities pn1 1 n2 and parameter Ån1 1 n2 such that a linear objective function is minimized
and the equilibrium equations are satisfied: cpT ! min
s.t. pQ4¡5 = 01 peT = 11 p
(1)
01
0 ¡ 11 where vector Å is defined in the same way as vector p. Although the equation pQ4¡5 = 0 is nonlinear, we can use the structure of the generator Q4¡5 and formulate a linear optimization problem. For this we introduce variables yn1 1 n2 = pn1 1 n2 Ån1 1 n2 that will allow us to rewrite the constraints (1) of the optimization as: cpT ! min
s.t. pL + yM = 01 peT = 11
pÉy y
(2)
01
01
where matrix L contains all the terms of the matrix pQ4¡5 without probabilities Ån1 1 n2 and matrix M contains all the terms corresponding to probabilities Ån1 1 n2 (see the appendix for more details). Thus we obtain a linear program with 4N + 25 · 4N + 15/2 + N 4N É 15/2 variables (taking into account that Ån1 1 n2 are fixed for n1 or n2 equal to 0), with 4N + 254N + 15/2 + N 4N É 15/2 + 1 constraints and nonnegative variables. Once the optimal vectors p and y are known, the optimal priorities ¡ can be uniquely recovered (shown
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
108
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
in Lemma 2). Moreover, we also show that the optimal Ås are always integer. Lemma 1. For any feasible vector of priority parameters 0 ¡ 1 the finite-state Markov chain describing our maintenance system has a unique stationary distribution with nonzero probabilities of each state. Proof. This lemma follows immediately from the fact that the system is finite and irreducible. The first is obvious. The second fact means that if the system at a certain moment is in a specific state (any state), then there is a positive probability that it will move to any other state in finite time. This follows from the fact that any state can be reached from the state with zero machines in repair, and the zero state can be reached from any other because the repair server is always working unless there are no failures in the system. É Lemma 2. There is a one-to-one correspondence between the pairs 4p1 ¡5 and 4p1 y5 that corresponds to the optimal solutions of the initial optimization problem (1) and of the LP problem (2). Proof. As it is shown in Lemma 1, for each priority vector ¡ there is a unique vector p. The uniqueness of vector y for each vector ¡ follows then from its definition (yn1 1 n2 = pn1 1 n2 Ån1 1 n2 ). Assume now that vector y is known. Using standard knowledge from the theory of finite-state Markov chains (cf. Latouche and Ramaswami 1999) and linear algebra, it is possible to show that the rank of the system of linear equations from problem (2): pL = ÉyM1
peT = 1
is equal to the dimension of the state space (dim4p5). This means that for each vector y we will have a unique vector p. If both of these vectors satisfy the inequalities of problem (2): pÉy
01
y
01
then pair 4p1 y5 will give us unique vector ¡ satisfying the constraints of problem (1). É Theorem 1. The optimal parameter Ån1 1 n2 always takes integer values; i.e., Å?n1 1 n2 2 801 19, for all n1 , n2 .
Proof. It is known from the optimization theory (cf. Bertsimas and Tsitsiklis 1997) that the optimal solution of a LP problem lies normally on one of the vertices of the feasible polytope. In other words, a necessary condition for a feasible point to be the optimal solution is that the number of active constraints in this point is greater than or equal to the number of variables (problem dimension). As in Lemma 2, it is possible to show that the rank of the system of linear equations pL + yM = 01 peT = 1
is equal to the dimension of the state space (dim4p5). This means that in the optimal solution the number of active inequalities from the group pÉy
01
y
0
must be greater than or equal to the number of y-variables (dim4y5). Assume now that there is an optimal solution of system (1) with at least one variable 0 < Ån01 1 n02 < 1 (for certain n01 , n02 ). Then the corresponding inequalities (yn01 1 n02 < pn01 1 n02 and yn01 1n02 > 0) will be nonactive. This means that, to keep the number of active inequalities from the set p É y 0, y 0 greater than or equal to dim4y5, there must exist at least one pair 4n001 1 n002 5 for which both inequalities are active (yn001 1n002 = pn001 1n002 and yn001 1n002 = 0). The last fact assumes that our Markov problem will have states with zero probability, which contradicts Lemma 1. That is, the optimal priority parameters obtained from the optimal solution of the LP problem (2) will always take integer values. É
4. Generalizable Priority Assignments
The structure of the optimal repair priorities depend on the objective function and the other system parameters (failure and service rates). Moreover, the conservation laws do not hold in our problem, since the arrival (failure) rates depend on the number of working devices and thus depend on the processing rules. For example, setting high processing priority to the “suspected” failures will increase the number in queue with “real” failures (and thus, the number of working devices will be lower). So, the available results that employ conservation laws are not applicable for our system. However, in some conditions we can find generalizable priority rules, i.e., rules when one job type always gets priority over the other, regardless of the state of the system. The following lemmas present conditions for such priority rules. Lemma 3. If the objective function requires minimization of probabilities of system states with high n1 (e.g., Objectives O1, O2 with g of b equal to zero) and the “real” failure rate of the devices with already discovered “suspected” failure is not higher than the failure rate of the “new” ones (ã12 É ã1 0), then the optimal repair policy is to always repair the devices with “real” failures. Proof. The probability Pn11 of having n1 “real” failures in the system can be expressed as (summation of the equilibrium equations): Pn11 =
N Én1
X
n2 =0
+
41 É Ån1 1 n2 5pn1 1 n2 + 4N É n1 + 15
ã12 É ã1 E6n2 ó n1 É 170 å1
ã1 1 P å1 n1 É1
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
109
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
Note that the minimization of Pn11 would require the maximization of Ån1 1 n2 . Maximization Ån1 1 n2 will cause E6n2 ó n1 É 17 to increase since the queue with “suspected” failures is not processed. However, in cases with ã12 É ã1 0, an increase of E6n2 ó n1 É 17 will produce even smaller probabilities Pn11 . That is, in cases with ã12 É ã1 0 the “real” failures should be processed first. É
The last lemma is similar in some way to the wellknown cå-rule (cf. Buzacott and Shanthikumar 1993). However, in our case the cå-rule does not work for nonequal cs, which can be seen from the examples presented in §5 (Figures 7 and 8).
Thus in cases where we are only concerned with real failures and ã12 É ã1 0, we want to keep devices in the “suspected” state, since the devices in the “suspected” state have lower “real” failure rate and thus the total “real” failure rates would be lower.
For other combinations of the input parameters, the optimal priority assignment will not always have absolute priority of one type of failure over another. To study such cases we performed a large number of experiments with different parameters and have highlighted only the most interesting combinations. Namely, we fixed the following system parameters: N = 10, ã1 = 1, ã2 = 1, ã12 = 3, å1 = 5 and calculated different priority assignments for different combinations of the remaining system parameters (å2 , g, d, a, and b). In the case of the “k-out-of-N ” problem (Objective O1, g = 0, d = 4) the optimal priorities will have the following as shown in Figure 2. This figure clearly indicates that when the service rate of the “suspected” failures increases, it becomes more important to give high priority in certain system states to the “suspected” failures. This is because in keeping the number of the suspected failures low, we will lower the chance of real failures (note here that ã12 > ã1 ; otherwise we get the case described by Lemma 3). However, along the border of the maximized area (n1 = d), it is optimal to assign higher priority to the “real” failures, such that the system moves more quickly into the “safe” area. Somewhat similar behavior of the optimal priorities can be observed for the “generalized” Objective O1
5. Developing a Plan Using Computational Results
Lemma 4. The total number of failures can be minimized by setting higher priority to jobs with shortest expected processing time. Proof. When the objective is based on the total number of jobs in the queue, we can sum up the equilibrium equations along n1 + n2 = n and get the recursive relations for the total probability PnT as: n X
4Ån1 1 nÉn1 å1 +41ÉÅn1 1 nÉn1 5å2 5pn1 1 nÉn1
n1 =0
T = PnÉ1 4N ÉnÉ154ã1 +ã2 51
T PnT = PnÉ1 4N ÉnÉ15
n ã1 +ã2 X å Éå2 É Ån1 1 nÉn1 1 pn1 1 nÉn1 0 å2 å2 n1 =0
Then, to minimize probability PnT we have to maximize Ån1 1 n2 if å1 É å2 > 0, or minimize Ån1 1 n2 if å1 É å2 < 0. That is, the optimal priorities depend only on the processing rates and not on the system state. É
n2 10 s 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s 0
R R R R R R R R R 1
R R R R R R R R 2
R R R R R R R 3
R R R R R R 4
R R R R R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1
n2 10 s 9 8 7 6 5 4 3 2 1 0
n2
!2 = 7
R R R R R R R R R R
s s s s s s s s s -
s s s s s s s s s R 0
s s s s s s s s R 1
!2 = 20
s s s s s s s R 2
s s s s s s R 3
R R R R R R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n1 Figure 2
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s 0
n2
10 9 8 7 6 5 4 3 2 1 0
!2 = 9.5
R R R R R R R s s R
R R R R R R R R R 1
R R R R R R R R 2
R R R R R R R 3
R R R R R R 4
R R R R R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1
s s s s s s s s s s -
s s s s s s s s s R 0
s s s s s s s s R 1
!2 = 31.2
s s s s s s s R 2
s s s s s s R 3
s s R R s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
n2
R 9 10
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s 0
n2
10 9 8 7 6 5 4 3 2 1 0
!2 = 13
s s s s s s s s s R
R s s s s s s s R 1
R R R s s s s R 2
R R R R R R R 3
R R R R R R 4
R R R R R 5
R R R R 6
s s R 7
s R 8
R 9 10
n1
s s s s s s s s s s -
s s s s s s s s s R 0
s s s s s s s s R 1
!2 = 32
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
n1
Optimal Priority Assignments for a “k-Out-of-N” System (Objective O1, g = 0, d = 4) for Different Service Rates
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n1
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
110
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
n2
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s 0
n2
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
R R R R R R R R R 1
s s s s s s s s s R 0
n2 10 s
!2 = 7
R R R R R R R R R R
R R R R R R R R 2
3
R R R R R R 4
R R R R R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1
s s s s s s s R 2
R s s s s s R 3
s R s s s R 4
R s R s R 5
s R s R 6
s s R 7
s R 8
s s s s s s s s s -
n2 10 s 9 8 7 6 5 4 3 2 1 0
R 9 10
!2 = 7.1
R R R R R R R R R R 0
!2 = 9
s s s s s s s s R 1
R R R R R R R
9 8 7 6 5 4 3 2 1 0
s s s s s s s s s -
R R R R R R R R R 1
s s s s s s s s s R 0
R R R R R R R R 2
3
R R R R R R
R R R s R
4
5
R R R R 6
R R R 7
s R 8
R 9 10
n1
!2 = 11
s s s s s s s s R 1
R R R R R R R
s s s s s s s R 2
s s s s s s R 3
s s s s s R
s s R s R
4
5
s s s R 6
s s R 7
s R 8
R 9 10
n1 Figure 3
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
n2
!2 = 5
R R R R R R R R R R
R R R R R R R R R 1
R R R R R R R R R R 0
0
n2 10 s 9 8 7 6 5 4 3 2 1 0
!2 = 7.5
R R s s s s s s s R
s s s s s s s s s -
R R R s s s s s R 1
s s s s s s s s s R 0
R R R s s s s R 2
3
R R s R s R 4
R R R s R 5
R R R R 6
R s R 7
s R 8
R 9 10
n1
!2 = 12
s s s s s s s s R 1
R R R R s s R
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n1
R R R R R R R R 2
R R R R R R s s R 1
R R R R R R R 3
4
R R R R R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1
!2 = 5.5
R R R R s s s R 2
R R R R R R
R R s s s s R 3
R s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
n2
10 9 8 7 6 5 4 3 2 1 0
n2 10 s
!2 = 5.05
R R R R R R R R R R 0
are fixed to the values as indicated in the first paragraph of this section and in the caption of Figure 6. These sets of experiments (Figures 3–6) also clearly indicate that it is quite difficult to derive rules of thumb for optimal priority assignments and the optimization method presented in §3 is needed. In optimization of the repair priorities, following objective O2, there is no clearly defined zone with the probabilities that have to be maximized (or minimized). Therefore, it is even more difficult to define rules of thumb for optimal priority assignment (except for the cases described in Lemmas 3 and 4). That is, minimization of the expected number of “real” failures E6n1 7 (objective O2, a = 1, b = 0) when ã12 É ã1 0 still falls under Lemma 3, whereas
R R R R R R R R R 1
R R R R R R R R 2
R R R R R R R 3
R R R R R R 4
R R R s R 5
R R s R 6
s s R 7
s R 8
R 9 10
n1
s s s s s s s s s s -
s s s R s s s s s R 0
s R R R s s s s R 1
!2 = 6
s R R s s s s R 2
s R s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
n1 Figure 4
s s s s s s s s s -
Optimal Priority Assignments for the Objective O1 with g = 005 and d = 6
0
n2
9 8 7 6 5 4 3 2 1 0
n1
(with g > 0). In this case, the “safe” area with functioning systems is below the diagonal (or stepwise diagonal) line, as shown in Figures 3–5. Minimization of “weighted” probability P 4n1 + gn2 > d5 with d = 6, different values of g (0.5, 0.8, 1.3), and different service rates å2 of the “suspected” failures gives us the following priority assignments as shown in Figures 3–5. Here, as in the case with g = 0, the “suspected” failures get higher priority more often if they have higher service rate å2 . The previous experiments also indicate that, for higher weight g the “suspected” failures start getting higher priority for lower service rate å2 . This conclusion is confirmed by further experiments with the weight g (Figure 6), where all system parameters (except for g) n2
n2 10 s
Optimal Priority Assignments for the Objective O1 with g = 008 and d = 6
s s R 7
s R 8
R 9 10
n1
9 8 7 6 5 4 3 2 1 0
s s s s s s s s s 0
n2 10 s 9 8 7 6 5 4 3 2 1 0
!2 = 5.1
R R R R R R R R R R
s s s s s s s s s -
R R R R R R R R R 1
s s s s s s s s s R 0
R R R R R R R R 2
s s s s s s s s R 1
R R R R R R R 3
4
R s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n1
!2 = 9
s s s s s s s R 2
R R R s s R
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n1
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
111
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
n2
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
R R R R R R R R R R 0
n2
10 9 8 7 6 5 4 3 2 1 0
1
n2 10 s
!2 = 1
R R R R R R R R R
R R R R R R R R 2
R R R R R R R 3
R R R R R R 4
R R R R R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1
s s s s s s s s s s -
s s s s s s s s s R 0
s s s s s s s s R 1
!2 = 4
s s s s R R R R 2
s s s s R R R 3
s s s s s R 4
R s s s R 5
R R R R 6
R R R 7
R R 8
R 9 10
9 8 7 6 5 4 3 2 1 0
s s s s s s s s s -
R R R R s R R R R R 0
R R R R R R R R 2
R R R R R R R 3
R R R R R R 4
R R s s R 5
R R R R 6
R R R 7
R R 8
s s s s s s s s s -
s s s s s s s s s R 0
s s s s s s s s R 1
2
s s s s s s R 3
s s s s s R 4
s s s s R 5
R R R R 6
R R R 7
R R 8
n2
n2
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s 0 s s s s s s s s s s
s s s s s s s s s s
R R R R R R R R R R 1
n2 10 s
g = 1.2
R R R R R R R R R 2
R R R R R R R R 3
R R R R R R R 4
R R R R R R 5
R R R R R 6
R R R R 7
R R R 8
R R R 9 10
n1 s s s s s s s s s -R 0 1
g = 1.6
s s s s s R R R R
s s s s s R s R 2
s s s s s s R 3
s s s s R R 4
s s s s R 5
s s R R 6
s R R 7
R R 8
R 9 10
n1 s s s s s s s s s -R 0 1
s s s s s s s s R
g=3
s s s s s s s R 2
s s s s s R R 3
s s s s s R 4
s s s s R 5
s s R R 6
s R R 7
R R 8
R 9 10
n1 Figure 6
R R R s s R R s R R
n2
R 9 10
10 9 8 7 6 5 4 3 2 1 0
!2 = 3.5
R R R s s s R R R 1
R R R R R R R R 2
R R s R R R R 3
R s s s R R 4
R s s s R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1
s s s s s s s s s s -
s s s s s s s s s R 0
s s s s s s s s R 1
!2 = 5
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n1
Optimal Priority Assignments for the Objective O1 with g = 103 and d = 6
10 9 8 7 6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
n1
optimization of the repair priorities for ã12 É ã1 > 0 will produce the following priority assignments as in Figure 7: Here again, the “suspected” failures start having higher priority to decrease the probability of real failures in the system. In cases with the Objective O2 and nonzero weight b we can use the results of the lemmas only if the weight b n2
10 9 8 7 6 5 4 3 2 1 0
0
!2 = 4.9
s s s s s s s R
n1 Figure 5
R 9 10
n1
n2 10 s 9 8 7 6 5 4 3 2 1 0
1
n2
!2 = 3
R R R R s R R R R
9 8 7 6 5 4 3 2 1 0
s s s s s s s s s
n2 10 s 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s
n2 10 s 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s
R R R R s R R R s -R 0 1
is equal to a. Otherwise, optimization is necessary and the priority assignment will behave as shown in Figure 8. It can also be clearly seen in the last experiment that the priority assignments are quite sensitive to the service rates even in the case with relatively large differences between weights (Figure 8). n2
g = 1.3
R R R R s R R R R
R R R R R R R R 2
R R R R R R R 3
R R R R R R 4
R R R s R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1 s s s s s s R s s -R 0 1
g = 1.7
s s s s s s s s R
s s s s s R s R 2
s s s s s s R 3
s s s s R R 4
s s s s R 5
s s R R 6
s R R 7
R R 8
R 9 10
n1 s s s s s s s s s -R 0 1
n2
s s s s s s s s R
g=4
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s R R 6
s R R 7
R R 8
R 9 10
n2
10 9 8 7 6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1 0
n1
Optimal Priority Assignments for the Objective O1 for Different Values of g 4å1 = 15, å2 = 85
s s s s s s s s s s
s s s s s s s s s s
s s s s s s s s s s
s s s s R R R R s -R 0 1
g = 1.4
s s s R R R R R R
s s s R s s s R 2
s s s R R R R 3
s s R R R R 4
s s s s R 5
s s R R 6
s R R 7
R R 8
R 9 10
n1 s s s s s s R s s -R 0 1
g = 1.9
s s s s s s s s R
s s s s s R s R 2
s s s s s s R 3
s s s s R R 4
s s s s R 5
s s R R 6
s R R 7
R R 8
R 9 10
n1 s s s s s s s s s -R 0 1
s s s s s s s s R
g=7
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s R R 6
s R R 7
R R 8
R 9 10
n1
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
112
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
n2
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
R R R R R R R R R R 0
n2
10 9 8 7 6 5 4 3 2 1 0
R R R R R R R R R 1
n2
!2 = 12.5 R R R R R R R R 2
R R R R R R R 3
R R R R R R 4
R R R R R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1
s s s s s s s s s s -
s s s s s s s s R R 0
1
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
s s s s s s s s s s -
s s s R R R R R R R 0
n2
!2 = 12.6
s s s s s s s s R
10 9 8 7 6 5 4 3 2 1 0
R 9 10
10 9 8 7 6 5 4 3 2 1 0
s s s R R R R R R 1
n2
!2 = 12.505 s s s R R R R R 2
s s s R R R R 3
s s s s R R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n2
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
R R R R R R R R R R
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s R 0
0
!2 = 12.52
s s s s s s R R R 1
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
!2 = 13
s s s s s s s s R 1
n1
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n1
1
n2
!2 = 8
R R R R R R R R R
R R R R R R R R 2
R R R R R R R 3
R R R R R R 4
R R R R R 5
R R R R 6
R R R 7
R R 8
R 9 10
n1
s s s s s s s s s s -
s s s s s s s R R R 0
s s s s s s s R R 1
!2 = 8.005
s s s s s s s R 2
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
s s s R R R R R R R 0
n2
10 9 8 7 6 5 4 3 2 1 0
1
n2 10 s
!2 = 8.0001
s s R R R R R R R
s s R R R R R R 2
s s s R R R R 3
s s s R R R 4
s s s R R 5
s s s R 6
s s R 7
s R 8
R 9 10
9 8 7 6 5 4 3 2 1 0
s s s s s s s s s -
s s s s s R R R R R 0
!2 = 8.001
s s s s s R R R R 1
s s s s s R R R 2
s s s s s R R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
n1
s s s s s s s s s s -
s s s s s s s s s R 0
s s s s s s s s R 1
2
n1
!2 = 8.1
s s s s s s s R
s s s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
n1 Figure 8
s s s s s s R R R R
Optimal Priority Assignments in Minimization of “Real” Failures (Objective O2, a = 1, b = 0)
0
n2
s s s s s s s s s s -
n1
s s s s s s s s s s -
n1 Figure 7
10 9 8 7 6 5 4 3 2 1 0
s s R 7
s R 8
R 9 10
n1
Optimal Priority Assignments When Weighted Sum of Expected Numbers of Failures aE6n17 + bE6n27 Is Minimized 4a = 1, b = 0055
6. Simulation To examine the robustness of our optimal policies to violations in the underlying assumptions, we simulated systems under a wide range of operating conditions. In the experiments, we tested six groups of priority assignments corresponding to different objective functions (see Figure 9, left-hand side). All assignments were obtained by optimizing the corresponding objectives with the system parameters shown in the bottom of the figure (common parameters) and next to each priority assignment (specific for each priority assignment). For each priority assignment, different combinations of square coefficients (Cx2 = 008, 1, 1.25) for interfailure and service process were tested. This yielded 35 = 243
Table 1
Example Simulation Output Illustrating Representative Performance P 4n1 N É k5 P 4n1 + gn2 d5
Sc. 1 Average Standard deviation 95% int. halfwidth Halfwidth/average Batch correlation Sc. 82 Average Standard deviation 95% int. halfwidth Halfwidth/average Batch correlation
009736 000024 000005 000095 000167
009906 000014 000003 000003 001284
009864 000016 000003 000003 É001406
009953 000009 000002 000002 É000919
E6n1 7
aE6n1 7+ bE6n2 7
104624 000240 000048 000033 000158
107494 000254 000051 000029 000364
101462 104356 000234 000244 000047 000049 000041 000034 É001427 É001382
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
113
s R R R R R R R R R
Priority assignment 10
Max P(n1 + gn2 ≤ d)
0 10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
Priority assignment 13
Max P (n2 + gn2 ≤ d)
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
Priority assignment 16
s s s s s s s s s s -
R R R R s R R R s R 0
6
s s R 7
s R 8
R 9 10
s R R R R 5
s R R R 6
s R R 7
s R 8
R 9 10
R R R R R R 4
R R R s R 5
R R R R 6
R R R 7
R R 8
9
R 10
R R R s R R R
R R s s s R 4
R s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
!2 = 8 g = 1.3
R R R R R R R R
R R R R R R R 3
R R R R R R 4
R R R s R 5
"1 = 1,
R R R R 6
R R R 7
R R 8
9
"2 = 2,
R 10
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
1
s s s R R R R R R R 0
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
s s s s R R R R R R
s s s s s s s s s s -
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
R R R R 6
s s R 7
s s s s s s s s s R
s s s R R R R
s s s s R 5
s s s R R R 4
8
R 9 10
s s s R 6
s s R 7
R s R R R s R
s s s s R 5
R R s R s R 4
R R s s s s R 3
s s s R 6
s s R 7
!1 = 1,
s R 8
R 9 10
R R R s R 5
R R R R 6
R R R 7
R R 8
9
R 10
R s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
!2 = 11 g = 1.3
s s s R R R R R 2
8
R 9 10
!2 = 17 g = 0.8
R R R s s s s R
s s s s s s s s R
s R
!2 = 22 g = 0.5
3
2
1
s R
!2 = 24.4 g = 0.5
R R R R s s s R
R R R R R s s s R
s s s s s R
s s s s s s s s s s 0
"1 = 1,
R R R R R 5
4
3
2
1
s s s s R R R
s s s R R R R R
R R R s s s s s R
R R R R R R R R R R
4
3
2
1
0 10 9 8 7 6 5 4 3 2 1 0
2
1
0
s s s s R R R R
s s s R R R R R R
R R s s s s s s s R
3
R R R R R R
!2 = 42 g = 0.5
s s s R R R R R R 1
0 10 9 8 7 6 5 4 3 2 1 0
2
R R R R R s R
s s s R R R R 3
s s s s R R 4
a = 1,
s s s s R 5
R R R R 6
R R R 7
R R 8
b = 0.5,
9
R 10
k = 6,
Priority assignment 3
Priority assignment 2
0
s s s s s s s R
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
s s s s s s s s s R 0
Priority assignment 6
s s s R
!2 = 16 g = 0.8
N = 8,
Figure 9
s R R R R R
R R R R R R R
3
2
5
4
3
2
s s s R R
!2 = 21 g = 0.5
R R R R R R R R
R R R R s R R R R 1
s R R R R R R
R R R R R R R R
R R R R R R R R R
s s s R R R 4
3
2
1
8
R 9 10
!2 = 24.28 g = 0.5
s R R R R R R R
R R R R R R R R R
R R R R R R R R R R
s s R R R R R 3
2
1
0 10 9 8 7 6 5 4 3 2 1 0
s R R R R R R R R
R R R R R R R R R R 0
s s R R R R R R 2
1
7
s R
!2 = 40 g = 0.5
s s s s s s s s R
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
10 9 8 7 6 5 4 3 2 1 0
s s s s s R R R R R
s s s s s s s s s s -
10 9 8 7 6 5 4 3 2 1 0
0 10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s -
10 9 8 7 6 5 4 3 2 1 0
s s s s s s s s s s 0
6
s s R 7
s R 8
R 9 10
s s s R 6
s s R 7
s R 8
R 9 10
s s s s s R
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
!2 = 24 g = 0.5 R s s s s s R
R R s s s R 4
R R R s R 5
R R R R 6
R s R 7
s R 8
R 9 10
!2 = 18 g = 0.8 s R s s s s R 3
s s s s s R 4
s s s s R 5
s s s R 6
s s R 7
s R 8
R 9 10
!2 = 15 g = 1.3
s s s s s s s R 2
s s s s R 5
4
3
2
s s s s s R
s s s s s s R
s R R s s s s R
s s s s s s s s R 1
s s s R
!2 = 24.7 g = 0.5
s s s s s s s R
s R R s s s s s R
s s s s s s s s s R
R R R R R 5
4
3
2
1
s s s s s s R
s s s s s s R R
s s s s s s s s R
s s s s s s s s s R
R R R R R R 4
3
2
1
0
s s s s s s R R
s s s s s s R R R
s s s s s s s s s R
3
2
1
s s s s s s R
!2 = 44 g = 0.5
s s s s s s R R R
s s s s s s R R R R
s s s s s s s s s s -
s s s s s s s R 2
1
0
!2 = 50 g = 0.5
s s s s s s s s R 1
0
Priority assignment 9
s s s s s s s s s s -
6
R R R
s s s s s s s s s R
Priority assignment 12
10 9 8 7 6 5 4 3 2 1 0
5
R R R R
!2 = 41 g = 0.5
s R R R R R R R R 1
4
R R R R R
s s s s s s s s s s -
Priority assignment 15
Priority assignment 7
Max Ea[n1] + bE [n2]
0
3
R R R R R R
10 9 8 7 6 5 4 3 2 1 0
Priority assignment 18
s R R R R R R R R R
2
R R R R R R R
Priority assignment 5
s s s s s s s s s s -
1
R R s s s s s R
Priority assignment 8
Priority assignment 4
Min E(n1)
10 9 8 7 6 5 4 3 2 1 0
!2 = 35 g = 0.5
s s s s s s s s R
Priority assignment 11
s s s s s s s s s R
Priority assignment 14
s s s s s s s s s s -
Priority assignment 17
Priority assignment 1
10 9 8 7 6 5 4 3 2 1 0
0
Max P (n1 + gn2 ≤ d)
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
Max P(n1 ≤ N – k)
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
s s s s s s R 3
s s s s s R 4
s s s s R 5
R R R R 6
R R R 7
R R 8
R 9 10
d=6
Scenario Settings
combinations for each priority assignment for a total of 18 · 35 = 41374 experiments. In each simulation, four performance estimators were measured: P 4n1 N É k5, P 4n1 + gn2 d5, E6n1 7, and aE6n1 7 + bE6n2 7 with parameters a, b, g, and d as indicated in Figure 9. In scenarios where we employed the exponential distribution, we implemented the simulation to exploit its unique properties. For cases where we simulated nonexponential processing and interfailure
times, distributions were generated using the first two moments in one of the following distributions: (a) Erlang (Ek1 kÉ1 ), for 0 < cx < 1, (b) HyperExponential, for cx > 1, p where cx = Var6x7/E6x7. Each simulation was started in the state with all devices working and run for 104,000 time units, disregarding the first 4,000 time units to eliminate the effects of initial transient bias (Kelton and Law 2000). Using the
Cx2 = 008
Pr. assing. 8 Cx2 = 1025
Cx2 = 008
Pr. assing. 7 Cx2 = 1025
Cx2 = 008
Pr. assing. 6 Cx2 = 1025
Cx2 = 008
Pr. assing. 5 Cx2 = 1025
Cx2 = 008
Pr. assing. 4 Cx2 = 1025
Cx2 = 008
Pr. assing. 3 Cx2 = 1025
Cx2 = 008
Pr. assing. 2 Cx2 = 1025
Cx2 = 008
Pr. assing. 1 Cx2 = 1025
Table 2
11015 6043 2051 1047
9006 5092 1094 1026
10057 5089 2038 1047
10045 5099 2027 1033
9061 5096 2006 1029
10015 5081 2027 1041
9092 5083 2016 1033
9093 5098 2017 1034
P 4n1 45
5053 3030 1025 0081
5086 3091 1026 0085
3020 2010 0068 0054
3039 2020 0073 0052
4090 3018 1005 0071
4053 2093 0097 0073
5019 3035 1010 0082
5068 3070 1021 0088
P 4n1 + gn2 65
28045 12096 6012 1059
26048 12022 5062 1061
37014 13079 8005 1039
35009 14010 7055 1072
32095 13081 7010 1077
35075 13038 7097 1058
34098 13063 7075 1070
34060 13060 7074 1074
E6n1 7
18026 8071 4002 1011
17058 8075 3081 1021
26086 10017 5091 1001
24039 10051 5040 1032
23052 10089 5023 1048
31016 12038 6091 1055
31001 12079 6084 1067
31004 12088 6092 1074
E6n1 + 005n2 7
Interfailure time (real failure) (%)
14093 8021 1023 0078
10019 7058 0091 0077
8085 5056 0067 0047
9082 6007 0078 0054
9038 6040 0080 0060
7061 5007 0063 0047
7071 5038 0061 0043
7076 5050 0062 0048
P 4n1 45
9080 5064 0075 0052
10012 6074 0080 0063
3001 2024 0021 0019
3062 2057 0027 0022
6040 4032 0051 0040
4079 3039 0035 0030
6035 4043 0043 0034
6076 4070 0047 0039
P 4n1 + gn2 65
33070 14082 2054 0069
25028 11060 2007 0069
30019 15082 2008 0087
27020 13016 1097 0064
24061 11051 1094 0056
30064 14014 2039 0084
30027 13049 2032 0073
29087 13033 2029 0072
E6n1 7
Interfailure time (suspected failure) (%)
32088 13044 2075 0075
31026 12048 2069 0069
30046 13069 2027 0078
30035 13037 2038 0072
30083 13015 2053 0068
33085 13072 2063 0081
35043 13033 2068 0068
35022 13030 2067 0070
E6n1 + 005n2 7
Change of Performance Estimators Due to the Change in System Variability
9067 5064 1049 1004
13084 9006 0089 0061
1046 0089 0083 0057
5065 3038 1021 0083
9087 6039 1025 0089
1011 1031 0072 0059
3016 3008 0082 0069
4049 3091 0092 0068
P 4n1 45
3052 2029 0059 0046
6074 5001 0045 0037
0030 0026 0021 0020
1022 0088 0030 0025
4017 2099 0055 0044
0065 0075 0033 0031
1073 1077 0040 0040
2051 2037 0047 0041
P 4n1 + gn2 65
19067 4054 3017 0082
29052 5044 2083 1011
5013 1068 2043 0069
14063 3047 3029 0089
23018 4097 3060 0099
0084 1052 1032 0048
3050 2094 1045 0058
5038 3071 1062 0056
E6n1 7
7088 1091 1036 0049
11010 2084 1011 0047
2097 0085 1048 0044
7030 1030 1072 0044
11023 2024 1087 0047
0051 1016 1003 0043
2035 2002 1010 0052
3085 2061 1029 0053
E6n1 + 005n2 7
Interfailure time (real after suspected) (%)
É44025 17083 1068 0043
É39035 18064 1030 0085
É23086 10028 1056 0028
É31006 12012 1065 0031
É34023 14031 1042 0045
É20039 9045 1026 0033
É22002 10023 1019 0047
É23042 10079 1023 0042
P 4n1 45
É17039 8025 1051 0029
É23074 11009 1071 0037
É5014 2099 0093 0030
É6079 3053 1006 0031
É14073 6056 1067 0027
É8035 4076 1020 0029
É11028 6041 1018 0030
É12090 7006 1028 0024
P 4n1 + gn2 65
E6n1 + 005n2 7
É62004 É46070 3044 5009 1000 0023 0059 0034
É63065 É48094 2049 4096 1005 0017 0078 0042
É56083 É46039 3069 4049 1017 0070 0061 0040
É62054 É50027 2063 4047 1000 0019 0062 0038
É65039 É53076 2027 4045 0085 É0015 0073 0041
É47021 É42094 4045 4081 1019 1010 0071 0071
É48040 É44088 4039 4064 1023 1008 0075 0081
É49026 É46019 4063 4092 1030 1014 0072 0079
E6n1 7
Repair time (real failure) (%)
É9034 6024 É0035 0042
É2014 1027 É0009 0028
É10012 6069 É0001 0023
É8005 5058 É0009 0025
É4002 2099 É0007 0024
É9013 5090 0006 0026
É7096 5029 0000 0021
É7019 4092 0000 0027
P 4n1 45
É7040 5029 É0026 0031
É1050 0095 É0006 0019
É3023 2051 0012 0009
É3049 2069 0008 0011
É2093 2033 É0004 0016
É4002 3022 0009 0014
É4023 3045 0005 0013
É3099 3027 0003 0018
P 4n1 + gn2 65
É15007 4043 É0041 0046
É7052 4007 É0029 0044
É21018 5062 0002 0050
É14066 4006 É0004 0040
É9065 3043 É0014 0037
É26058 6046 É0012 0044
É25011 6058 É0023 0042
É24002 6089 É0019 0043
E6n1 7
É14049 5002 É0041 0031
É10008 6028 É0034 0032
É20043 5002 0001 0037
É15038 4080 É0006 0028
É12006 5052 É0011 0026
É27086 6024 É0003 0041
É27002 6046 É0010 0041
É25099 6090 É0008 0040
E6n1 + 005n2 7
Interfailure time Repair time (suspected failure) (%)
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
114 Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
Cx2 = 008
Pr. assing. 16 Cx2 = 1025
Cx2 = 008
Pr. assing. 15 Cx2 = 1025
Cx2 = 008
Pr. assing. 14 Cx2 = 1025
Cx2 = 008
Pr. assing. 13 Cx2 = 1025
Cx2 = 008
Pr. assing. 12 Cx2 = 1025
Cx2 = 008
Pr. assing. 11 Cx2 = 1025
Cx2 = 008
9099 6021 2020 1027
17057 6075 4025 1067
17065 6068 4023 1060
14062 7033 3035 1069
15069 6075 3078 1070
13043 6048 3011 1047
9051 6013 2003 1027
14044 6071 3038 1070
10022 5016 2040 0087
7091 3052 1084 0083
8000 3043 1085 0079
7077 3068 1082 0086
6007 3023 1040 0082
5093 3029 1033 0080
5057 3060 1019 0077
5069 3010 1030 0083
P 4n1 + P 4n1 45 gn2 65
18059 8019 4003 1008
17052 9020 3088 1039
20021 10064 4047 1061
26043 12095 5064 1076
22022 11014 4097 1064
21083 10079 4085 1060
25094 11089 5048 1058
28033 13060 6012 1083
E6n1 7
11037 5036 2046 0072
14040 7085 3017 1021
15026 7098 3034 1017
15095 7056 3047 1000
18066 9066 4015 1045
18031 9058 4004 1047
16069 8022 3060 1014
20011 9031 4038 1022
E6n1 + 005n2 7
Interfailure time (real failure) (%)
(Continued)
Pr. assing. 10 Cx2 = 1025
Cx2 = 008
Pr. assing. 9 Cx2 = 1025
Table 2
11022 8012 1012 0083
31063 9026 2091 1027
30037 9047 2087 1023
21045 10046 2000 1012
24000 9011 2013 1007
15050 7050 1042 0077
10017 7084 0090 0075
20087 9005 1074 0092
42095 6057 4016 1021
22011 7073 1071 0087
21076 7064 1071 0082
20063 8008 1062 0082
11012 5062 0084 0053
10062 5075 0079 0053
10014 6060 0077 0057
10003 5052 0073 0048
P 4n1 + P 4n1 45 gn2 65
21079 10014 2004 0058
31082 12048 2085 0074
32099 13068 3002 0085
34035 14078 2097 0099
33023 13064 2091 0078
28038 12044 2056 0071
24053 11082 2005 0069
40034 19064 3018 1015
E6n1 7
Interfailure time (suspected failure) (%)
25013 9021 2039 0058
31078 11043 2083 0069
32020 11090 2092 0074
32030 12021 2086 0086
34062 12048 2099 0071
33032 11094 2086 0065
30080 12013 2065 0074
35084 14078 2096 0091
16043 9076 1003 0061
1094 1037 1008 0064
3022 1097 1021 0069
11048 6006 1073 1007
1080 1069 1001 0065
6025 4098 1048 0096
14009 9019 0097 0060
2082 1050 1017 0074
É13013 6054 É1018 0070
0003 0058 0016 0032
0033 0044 0018 0026
1054 1007 0031 0028
1011 1009 0041 0031
2051 2021 0053 0042
5074 4033 0042 0032
0089 0066 0039 0031
E6n1 + P 4n1 + 005n2 7 P 4n1 45 gn2 65
24060 3083 2001 0064
1025 0054 0072 0035
3029 1074 1020 0065
18076 8096 2085 1021
1051 1008 0093 0040
4082 2049 1038 0046
29023 6001 2079 0096
4069 1045 1077 0068
E6n1 7
6070 3007 0046 0032
0053 0048 0040 0034
1038 0070 0058 0039
6008 1077 0097 0042
0076 0069 0059 0041
2034 1027 0078 0044
9099 2034 0099 0042
2041 0086 0096 0048
É49015 20087 1032 0081
É53058 19069 0013 1010
É57023 21056 0006 1015
É66017 26067 1005 0096
É41092 16091 0049 0098
É40038 15085 0061 0073
É41009 18088 1039 0070
É41079 17011 1001 0069
É62092 16040 É0063 1018
É18040 7091 0097 0032
É20011 8086 0096 0027
É32019 14030 1016 0029
É13032 6095 0096 0025
É16016 8043 1007 0025
É22013 10018 1068 0027
É11099 6042 1009 0028
E6n1 + 005n2 7
É54030 É37089 2068 4017 0094 0043 0050 0026
É30068 É26005 6029 6004 0029 0028 0036 0032
É36033 É29028 6067 6005 0018 0021 0039 0032
É59056 É42066 4050 4049 0070 0030 0066 0034
É35066 É31011 6064 6038 0051 0049 0047 0043
É38013 É33043 5046 5071 0052 0044 0042 0040
É63003 É47032 2047 4074 1002 0027 0061 0030
É48073 É38055 7025 6059 0057 0038 0051 0033
E6n1 7
Repair time (real failure) (%)
E6n1 + P 4n1 + 005n2 7 P 4n1 45 gn2 65
Interfailure time (real after suspected) (%)
É3041 1089 É0039 0041
É43044 17017 É1090 1019
É40032 16038 É1095 1001
É17078 10027 É1013 0085
É32044 15018 É1016 0092
É18049 8046 É0074 0058
É3017 2016 É0022 0033
É25007 13081 É0072 0072
É14082 5034 É1023 0062
É26037 12015 É0074 0073
É25007 11069 É0077 0063
É15018 8073 É0078 0058
É12009 6090 É0015 0032
É9034 5051 É0030 0036
É2075 1090 É0020 0025
É10091 6095 É0003 0029
P 4n1 + P 4n1 45 gn2 65
É6077 3023 É0062 0038
É25008 6038 É0099 0037
É25025 6078 É1017 0046
É17090 3040 É0096 0055
É27003 6099 É0095 0034
É23036 7005 É0095 0034
É8012 3070 É0039 0039
É26079 7032 É0062 0049
E6n1 7
É6093 3073 É0058 0028
É24061 5081 É0089 0032
É23079 5066 É0099 0035
É15054 3040 É0076 0034
É27023 6030 É0083 0029
É24070 6060 É0086 0028
É10029 5070 É0042 0029
É23094 5070 É0053 0036
E6n1 + 005n2 7
Interfailure time Repair time (suspected failure) (%)
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
115
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
116
É16055 4096 É0099 0032 É16047 5052 É1006 0037 É44055 10048 É2078 0099 É53062 13034 É3061 1030 É22049 É18098 5038 4093 0000 0004 0020 0018 É27004 6019 0023 0047 É71058 19059 É1063 1065 0001 1025 0003 0035 1018 1013 0032 0036 É3043 3013 É0085 0073 0081 2022 0063 0087 23092 8069 2040 0062 23017 9073 2039 0071 38068 5003 3082 1009 8057 4097 1086 0081 10042 5085 2028 0095 Cx2 = 008
Pr. assing. 18 Cx2 = 1025
Cx2 = 008
19038 6065 4061 1031
9032 4019 2015 0066
43006 7006 4058 1044
É14009 4013 É0092 0031 É14097 4034 É1004 0036 É28088 8099 É1091 0086 É27034 10081 É1098 1006 É32079 É25083 4062 4071 0024 0015 0029 0021 É44033 9054 0009 0064 É73027 23081 É0024 1036 0075 2015 0019 0041 3092 2005 0077 0044 É5001 3075 É0081 0066 5066 3083 1023 0089 26032 8060 2049 0062 26098 9048 2049 0062 38035 5047 3070 1007 9069 4097 2010 0070 12092 6016 2080 0082 16005 6065 3077 1050
9084 4041 2036 0076
33032 8085 3010 1029
E6n1 + 005n2 7 E6n1 7 P 4n1 + P 4n1 45 gn2 65 E6n1 + 005n2 7 E6n1 7 E6n1 + P 4n1 + 005n2 7 P 4n1 45 gn2 65 E6n1 7 E6n1 + P 4n1 + 005n2 7 P 4n1 45 gn2 65 E6n1 7 P 4n1 + P 4n1 45 gn2 65 E6n1 + 005n2 7 P 4n1 + P 4n1 45 gn2 65
Interfailure time (real failure) (%)
E6n1 7
Pr. assing. 17 Cx2 = 1025
Table 2
(Continued)
Interfailure time (suspected failure) (%)
Interfailure time (real after suspected) (%)
Repair time (real failure) (%)
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
Interfailure time Repair time (suspected failure) (%)
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
batch means approach, every 1,000 units of time the average system objectives were recorded. This yielded 100 batches of 1,000 time units each. The batch size (1,000) and number of batches were chosen, after analysis of representative scenarios, to mitigate any correlation between batch means and to achieve suitably small 95% confidence interval half widths (less than 1% of average estimates for each performance measure). For these representative scenarios, we also confirmed that serial batch mean correlation was small and that the distribution of batch means was approximately normal to ensure good confidence interval coverage (Johnson and Jackman 1996). Table 1 shows the detailed results of two representative systems, illustrating the low interbatch correlation and small confidence interval half widths. In Table 2 we present relative changes in the system performance occurring due to the changes in system variability. There are five columns in this table. Each of them corresponds to changes of variability in each system process (interfailure and service). For each type of system process, there are four separate columns corresponding to different performance estimators. The rows of the table contain relative changes (average—upper cell, standard deviation—lower cell) in performance estimators caused by change in variability of each system process from Cx2 = 1 to Cx2 = 008 or Cx2 = 1 to Cx2 = 1025. The relative changes are presented separately for each priority assignment, and negative values indicate that the system performance had improved. The results presented in this table allow us to draw the following conclusions: 1. High variability of the interfailure and service times (rows with Cx2 = 1025) has little influence on the system performance estimators. This means that the presented model can be easily applied to systems with higher uncertainties (which is closer to real-life cases). 2. Decreasing the variability of service times (the last two groups of columns with Cx2 = 008) leads to improved system performance. 3. Systems with lower variability of the interfailure and service times (rows with Cx2 = 008) are more sensitive and thus different modeling of the arrival and service processes (e.g., using Erlang-2 distribution) should be applied. This change would increase the system state space, but the optimization method could still be applied.
7. Conclusion
Security failures of industrial systems create new maintenance challenges and opens new questions on how best to allocate limited maintenance resources. This paper presents an optimization model for repair priorities in a maintenance system with two types of failures (suspected and real security failures) and uncertain interfailure and repair times. We showed that, given the state of the system, the optimal repair policy follows a unique threshold indicator (either work on the real
Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems
Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.
INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS
failures or the suspected ones) and that the repair priorities can be found using a compact LP formulation. In addition, we examined the behavior of the optimal policy under different failure rates and objectives and presented a number of rules that can provide priority assignment in certain conditions without using the LP model. Our models help prioritize a response to possible security failures of SCADA systems. We illustrated that it is often optimal to focus on suspected device failures to avoid catastrophic system failure. Finally, we examined the robustness of our model to violations in the underlying assumptions and found that the model remains useful in cases with higher variability in interfailure and repair times. We also demonstrated that systems with lower variability in interfailure and repair times are more sensitive and require a different modeling approach for the stochastic processes and corresponding optimization model. Further examination of such systems would present an interesting direction for future research. Supplemental Material
Supplemental material to this paper is available at http://dx .doi.org/10.1287/ijoc.2014.0613.
Acknowledgments
This research was partially supported by the National Science Foundation [Award CNS-1329686].
References Antunes G (2005) Hacking the grid. Red Herring Magazine (May 18), https://www.mail-archive.com/
[email protected]/msg02231 .html. Barlow RE, Proschan F (1996) Mathematical Theory of Reliability, Vol. 17 (SIAM, Philadelphia). Bertsimas D, Niño-Mora J (1996) Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems. Math. Oper. Res. 21:257–306. Bertsimas D, Niño-Mora J (2000) Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. 48:80–90. Bertsimas D, Tsitsiklis JN (1997) Introduction to Linear Optimization (Athena Scientific, Belmont, CA). Bradley T (2011) Water utility hacked. Are critical systems at risk? Accessed September 12, 2014, http://www.pcworld.com/ article/244359/water_utility_hacked_are_our_scada_systems _at_risk_.html. Brenner B (2006) FBI says attacks succeeding despite security investments. Accessed September 12, 2014, http://searchsecurity .techtarget.com/news/1157706/FBI-says-attacks-succeeding -despite-security-investments. Buzacott JA, Shanthikumar JG (1993) Stochastic Models of Manufacturing Systems, Vol. 4 (Prentice Hall, Englewood Cliffs, NJ). Cobham A (1954) Priority assignment in waiting line problems. Oper. Res. 2:70–76. Davis RH (1966) Waiting-time distribution of a multi-server, priority queuing system. Oper. Res. 14:133–136. De Smidt-Destombes KS, Van der Heijden MC, Van Harten A (2006) On the interaction between maintenance, spare part inventories and repair capacity for a k-out-of-n system with wear-out. Eur. J. Oper. Res. 174:182–200. Eckles JE (1968) Optimum maintenance with incomplete information. Oper. Res. 16:1058–1067. Gellman B, Nakashima E (2013) U.S. spy agencies mounted 231 offensive cyber-operations in 2011. Accessed September 12, 2014, http://wapo.st/17sEENT.
117
Gittins J, Glazebrook K, Weber R (2011) Multi-armed Bandit Allocation Indices, 2nd ed. (Wiley, Chichester, UK). Glazebrook KD, Mitchell HM (2002) An index policy for a stochastic scheduling model with improving/deteriorating jobs. Naval Res. Logist. 49:706–721. Glazebrook KD, Mitchell HM, Ansell PS (2005) Index policies for the maintenance of a collection of machines by a set of repairmen. Eur. J. Oper. Res. 165:267–284. Gold A (2013) Researchers find security holes in Philips health info management system. Accessed September 12, 2014, http:// www.fiercehealthit.com/story/researchers-find-security-holes -philips-health-info-management-system/2013-01-22. Greg H (2013) Developing a successful remote patient monitoring program. Beckers hospital review, Beckers Hospital. Accessed September 12, 2014, http://www.beckershospitalreview.com/ healthcare-information-technology/developing-a-successful -remote-patient-monitoring-program.html. Haque L, Armstrong MJ (2007) A survey of the machine interference problem. Eur. J. Oper. Res. 179:469–482. Harchol-Balter M, Osogami T, Scheller-Wolf A, Wierman A (2005) Multi-server queueing systems with multiple priority classes. Queueing Systems 51:331–360. Higgins KJ (2013) Experiment simulated attacks on natural gas plant. Accessed September 12, 2014, http://www.darkreading.com/ perimeter/experiment-simulated-attacks-on-natural/240157897. Hochmuth P (2005) Risks rise as factory nets go wireless. Accessed September 12, 2014, http://www.networkworld.com/article/ 2319251/network-security/risks-rise-as-factory-nets-go-wireless .html. Jaiswal NK (1968) Priority Queues, Vol. 50 (Academic Press, New York). Johnson ME, Jackman J (1996) Interval coverage in multiclass queues using batch mean estimates. Management Sci. 42:1744–1752. Kelton WD, Law AM (2000) Simulation Modeling and Analysis (McGraw Hill, Boston). Klein M (1962) Inspection—maintenance–replacement schedules under Markovian deterioration. Management Sci. 9:25–32. Latouche G, Ramaswami V (1999) Introduction to Matrix Analytic Methods in Stochastic Modeling, Vol. 5 (SIAM, Philadelphia). McCall JJ (1963) Operating characteristics of opportunistic replacement and inspection policies. Management Sci. 10:85–97. Neuts MF (1981) Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach (Johns Hopkins University Press, Baltimore). Niño-Mora J (2008) A faster index algorithm and a computational study for bandits with switching costs. INFORMS J. Comput. 20:255–269. Niño-Mora J (2011) Computing a classic index for finite-horizon bandits. INFORMS J. Comput. 23:254–267. Phimister JR, Oktem U, Kleindorfer PR, Kunreuther H (2003) Nearmiss incident management in the chemical process industry. Risk Anal. 23:445–459. Sleptchenko A, Van Harten A, Van Der Heijden M (2005) An exact solution for the state probabilities of the multi-class, multi-server queue with preemptive priorities. Queueing Systems 50:81–107. Tiemessen HGH, van Houtum GJ (2013) Reducing costs of repairable inventory supply systems via dynamic scheduling. Internat. J. Production Econom. 143:478–488. USDHS (2006) Cyber storm exercise report. Accessed September 12, 2014, https://www.hsdl.org/?view&did=466697. Wagner D (1996) Analysis of a finite capacity multiserver model with nonpreemptive priorities and nonrenewal input. Chakravarthy SR, Alfa AS, eds., Matrix-Analytic Methods in Stochastic Models, Lecture Notes in Pure and Applied Mathematics), Vol. 183 (Marcel Dekker, New York), 67–86. Wagner D (1998) A finite capacity multi-server multi-queueing priority model with nonrenewal input. Ann. Oper. Res. 79:63–82. Wang H (2002) A survey of maintenance policies of deteriorating systems. Eur. J. Oper. Res. 139:469–489. Whittle P (1988) Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25:287–298.