Maintaining Secure and Reliable Distributed ... - Semantic Scholar

6 downloads 527 Views 453KB Size Report
Oct 28, 2014 - that, given the state of the system, the optimal repair policy follows a unique threshold indicator (either work ... tages including ease of data movement, remote control, ... investment in security technologies, the FBI Computer.
INFORMS Journal on Computing

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

Vol. 27, No. 1, Winter 2015, pp. 103–117 ISSN 1091-9856 (print) ó ISSN 1526-5528 (online)

http://dx.doi.org/10.1287/ijoc.2014.0613 © 2015 INFORMS

Maintaining Secure and Reliable Distributed Control Systems Andrei Sleptchenko

Department of Mechanical and Industrial Engineering, College of Engineering, Qatar University, Doha 2713, Qatar, [email protected]

M. Eric Johnson

Owen Graduate School of Management, Vanderbilt University, Nashville, Tennessee 37203 [email protected]

W

e consider the role of security in the maintenance of an automated system, controlled by a network of sensors and simple computing devices. Such systems are widely used in transportation, utilities, healthcare, and manufacturing. Devices in the network are subject to traditional failures that can lead to a larger system failure if not repaired. However, the devices are also subject to security breaches that can also lead to catastrophic system failure. These security breaches could result from either cyber attacks (such as viruses, hackers, or terrorists) or physical tampering. We formulate a stochastic model of the system to examine the repair policies for both real and suspected failures. We develop a linear programming-based model for optimizing repair priorities. We show that, given the state of the system, the optimal repair policy follows a unique threshold indicator (either work on the real failures or the suspected ones). We examine the behavior of the optimal policy under different failure rates and threat levels. Finally, we examine the robustness of our model to violations in the underlying assumptions and find the model remains useful over a range of operating assumptions. Keywords: reliability, maintenance-repairs; queues, priority; queues, optimization; probability, stochastic model applications; probability, Markov processes History: Accepted by Winfried Grassmann, Area Editor for Computational Probability and Analysis; received October 2013; revised February 2014; accepted May 2014. Published online in Articles in Advance October 28, 2014.

1. Introduction

the progress of a customer’s order on the shop floor can be used to advise customers on shipping dates and tracked in a customer relationship management system. Moreover, wireless technologies have made networking factory devices even easier and less expensive, further accelerating this network convergence. For example, General Motors (GM) used to manage many separate control networks but has integrated those onto an Ethernet backbone using wireless technology (Hochmuth 2005). However, opening these control systems to wider access also brings new security risks. For example, hospitals are increasingly reliant on sophisticated networks of wireless sensors and mobile devices to monitor patients. Such patient monitoring systems have been shown to have significant security weakness (Gold 2013) that could lead to failure resulting in patient harm or death. Since many SCADA and PCS applications were designed prior to the rise of Internet security fears, they were not engineered to incorporate sophisticated security provisions. In fact, many were designed without any security or implemented with security features effectively disabled (for example, passwords never changed from their default settings). Moreover,

Distributed control systems form the information and control backbone of most large automated systems like factories, hospitals, utilities, and mass transit systems. Sometimes called supervisory control and data acquisition (SCADA) systems or process control systems (PCS), these networks of sensors, simple controlling devices, routers, and computers, were once isolated networks uniquely designed for their intended purpose. With the rapid rise in enterprise software, many of these control systems have been integrated into wider corporate networks and further exposed to the Internet. Bringing such systems onto larger networks has many advantages including ease of data movement, remote control, and better integration of related business processes. For example, nurse-patient communication systems allow continuous patient monitoring and improved care. Likewise, remote patient monitoring in home settings allow patients to be treated at home, avoiding costly hospital stays (Greg 2013). In process industries, SCADA systems facilitate detailed metrics on the utilization of capital equipment that can be rolled up and presented in broader management dashboards. In discrete manufacturing environments, information tracked on 103

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

104

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

these isolated control systems in many cases were not updated or patched against known vulnerabilities. When integrated into the wider networks, these control systems can become susceptible to everyday viruses, worms, and hackers. Devices that become infected could fail or continue operating in a less reliable state. Moreover, they could harbor malware that lies dormant until some later time when activation leads to immediate failure. A more insidious risk is that devices that appear to operate acceptably may in fact have been compromised. For example, a sensor that continues to report normal operation of the machine it is monitoring, when in fact the machine is running out of control. Sometimes, remote diagnoses can detect abnormalities without showing conclusive failure. For example, an Australian man attacked the Maroochy Shire’s wastewater SCADA system that controlled the flow of waste through the complex system. Using a wireless device, he was able to open release valves at a sewage treatment plant, dumping foul-smelling sludge into local parks and rivers. Before being caught, he had successfully pirated control 45 times and dumped 264,000 gallons of sewage, according to the Government Accounting Office (Antunes 2005). More recently, in 2011 an Illinois water utility reported that its pumping systems had been hacked. Attackers reportedly enabled and disabled a pump repeatedly, eventually damaging it (Bradley 2011). Other recent examples of similar security failures in SCADA systems have occurred in the production of electricity and the distribution grid itself (Antunes 2005). In 2001, hackers were able to exploit a known weakness in the Sun’s Solaris server systems and successfully installed some malware that would allow them to control 75% of California’s power grid. Thankfully, in the 17 days before discovery, they caused little trouble. In 2003, Davis-Besse Nuclear Power Station in Ohio crashed for five hours as a result of the “slammer” worm. Although the failure of the nuclear power plant’s control network generated much concern, no harmful releases occurred during the episode. The possibility of such attacks made world headlines when the Stuxnet computer worm, believed to be developed by the United States and Israel, successfully destroyed Iranian nuclear centrifuges in 2009–2010. Recent reports show the escalation of such military use of SCADA attacks (Gellman and Nakashima 2013). Despite the fact that many firms have increased their investment in security technologies, the FBI Computer Crime Surveys show that enterprises continue to suffer myriad attacks with many incidents going unreported (Brenner 2006). Security breaches create new sources of maintenance burdens on repair resources (people and diagnostic computing devices) and open new questions on how best to allocate limited maintenance resources.

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

Cyber attacks often result in physical equipment failures that look like normal failures; for example, causing a device to repeatedly cycle through different speeds or states to cause fatigue and failure. Even if small security failures do not immediately translate into large-scale system failure, learning from those “nearmisses” has been shown to be an important element in avoiding larger systems failures (Phimister et al. 2003). Of course, traditional maintenance activities have long been an important area for ensuring the productivity and continuity of control systems. GM alone spends more than $1 billion in maintenance of its factory systems (Hochmuth 2005). Maintenance management in private and public sectors has motivated research on a wide variety of related problems. However, our focus is to develop models that help prioritize a response to possible security failures of SCADA systems. A Department of Homeland Security report (USDHS 2006) identified such modeling tools and processes as important elements to enhance the speed of response to a cascading failure. Recent industrial research has focused on continuous monitoring of SCADA systems to profile predictable behavior and identify potential failures (Higgins 2013). Our work resides at the intersection of several substantial streams of work. Operations researchers have been interested in repair problems for more than 50 years (cf. Barlow and Proschan 1996). Problems more closely related to ours are found in the literature of maintaining deteriorating systems (Klein 1962, McCall 1963), where the state of the system is not directly observable or is costly to observe (Eckles 1968). Much of this work is focused on single-unit systems; see Wang (2002) for a detailed survey. Another type of problem similar to ours is the machine interference problem (MIP). Models of this type consider a finite set of (different) operating machines maintained by a group of (specialized) repairmen. To the best of our knowledge, none of the models of this type takes into account the deterioration process (see Haque and Armstrong 2007 for a detailed survey of existing models). For our multi-item system, structural dependence prevents simple reductions to the single-item case (as in most of the works on deteriorating systems). On the other hand, in our system (as will be seen in the discussion of our objective functions) failure of a single component does not necessarily imply system failure. In fact, several devices may fail and the overall system may continue without incident. However, it is often not clear to engineers how many devices can fail without causing a major incident. For example, a failure on a pipe-valve controller in an oil refinery typically results in the valve moving to a failsafe mode (open or closed, depending on the situation) and other redundancies are often in place. Of course, failures must be quickly repaired but the

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

system may be safely operated for some time after such a failure. However, if several controllers have failed, the overall system will reach a failure state. This leads us to focus on objectives like the probability of exceeding k-out-of-N failed devices. De Smidt-Destombes et al. (2006) consider the inventory policy of a k-out-of-N system with wear-out. In their work, devices can be in one of three states: working, deteriorated, or failed. Like such previous research on such deteriorating systems, our focus is on providing reliable operation of the total system by minimizing the expected number of failed components or the probability of k failed components. However, unlike the previous research on deteriorating systems, we consider devices that are also subject to suspected security failures that could be real failures, later result in real failures, or be false alarms with no real danger. Understanding how to manage maintenance resources in this new world of security concerns is an important new area of study. In our system, devices can (and must) be restored—typically through software patches or physical repairs. Thus, inventory, as in De Smidt-Destombes et al. (2006), is not the key issue in this paper. Rather, our focus is on optimal use of the repair resources themselves as its capacity is the key limitation. From a modeling point of view, our problem is also similar to some of the work on restless bandit problems. These problems have been widely underconsidered in recent years (cf. Gittins et al. 2011; Bertsimas and Niño-Mora 1996, 2000; Glazebrook and Mitchell 2002, Glazebrook et al. 2005; Niño-Mora 2008, 2011). The most similar, from our point of view, are the problems considered in Glazebrook and Mitchell (2002), Glazebrook et al. (2005). In those papers, the authors consider a repair model for a group of continuously improving/deteriorating machines and the question is how to schedule repairmen such that a certain cost function is minimized. They present a scheduling heuristic based on Whittle’s indices (Whittle 1988) and develop theoretical bounds on heuristic performance. Another similar model can be found in Tiemessen and van Houtum (2013), where the authors analyze different policies for dynamic repair priorities using a Markov decision process (MDP) formulation and simulations. Here, we treat the system as a two-class priority queue where the server chooses jobs from different queues such that a certain objective is optimized. In this respect, our research is also related to the large stream of work in multiclass queues with service priorities (cf. Cobham 1954, Davis 1966, Jaiswal 1968). Recent developments in that area are focused, however, on measurements of system performance rather than on optimization (cf. Wagner 1996, 1998; Sleptchenko et al. 2005; and Harchol-Balter et al. 2005), or require work conservation laws that would allow optimization

105

of processing priorities using a polyhedral approach (cf. Bertsimas and Niño-Mora 1996). In the model considered in this paper, conservation laws do not hold since jobs can change their type and service time (real failure after suspected). In our multiclass, closed system the optimal priority policy shifts dynamically with the state of the system. We develop a stochastic model for a maintenance system, where the server (or repairman) is faced with two types of failures: “real” and “suspected.” Each failure type can be resolved by the server at different rates. In addition, devices in a suspected failure state may transition to a “real” failure, since the devices with suspected failures are still operating. Given an appropriate (linear) objective function related to the overall system integrity, we develop the optimal repair policy. We present an exact linear programming- (LP)-based optimization technique for determining a time-independent, statedependent scheduling policy. Such an optimization technique is tractable because the closed system results in a modest state space (two-dimensional and finite). Our approach using standard LP techniques allows us to optimize discrete priority rules. A similar property was shown for systems where conservation laws hold (cf. Bertsimas and Niño-Mora 1996). However, for our problem the conservation laws do not hold. Therefore, optimal processing priorities are state-dependent and require an approach based on MDPs. Using techniques from queueing theory, we show how to reduce state and action space. For example, direct use of MDP theory would require working on 2 ⇥ 3N -dimensional problem (N machines with three states each and two actions in each state), whereas using our approach we can formulate the problem as an 4N 4N É 155/2-dimensional optimization problem. We begin by formulating the model (§2). We show that a relaxed formulation where the repair resource chooses the next device to be repaired with some probability results in a deterministic optimal policy (§3). We derive generalized priority assignment rules for some partial cases (§4) and examine the behavior of the optimal policy in other generic settings (§5). Finally, we examine the robustness of our model to violations in the underlying assumptions (§6).

2. Model Formulation 2.1. Optimization Problem We consider a system with N operating devices that are subject to two failure types: “real” and “suspected.” The “real” failures occur at rate ã1 for each device, and we refer to these as “real” because they are instantly identified with certainty. The devices are also subject to “suspected” failures at rate ã2 for each device. Suspected failures may be real failures (for example, with some probability g 0) or may lead to real failures.

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

106

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

Unfortunately, the true condition of devices suffering from suspected failures cannot be known with certainty unless they transition to a real failure mode (with rate ã12 ) or are inspected by the repair server. We assume that the interfailure times are exponentially distributed. Thus, the repair system has two flows of arrivals that occur according to nonstationary Poisson processes, with rates that depend on the number of working devices. That is, given that n1 is the number of devices with “real” failures and n2 is the number with “suspected” failures, the systems arrival rates will be: 4N É 4n1 + n2 55ã1 —arrival rate of “real” failures, 4N É 4n1 + n2 55ã2 —arrival rate of “suspected” failures. In addition, the devices that are already in a “suspected” failure state can have a real failure. Thus we have an additional stream of “real” failures with system rate: n2 ã12 —rate of “real” failures occurring in devices with a “suspected” failure. We consider a system with a single repair server that processes failures of both types with rates å1 and å2 for real and suspected failures, respectively (service times are exponentially distributed). Since the considered system is Markovian (i.e., interarrival and service times are memoryless), it is logical to assume that the optimal dynamic processing priorities do not depend on the current state of the interarrival and service processes or on the processing history. That is, they depend only on the current numbers of failures in the system. We solve for the optimal state-dependent processing policy (dynamic processing priorities). Namely, the server chooses to process “real” failures with parameter Ån1 1 n2 , and correspondingly “suspected” failures with parameter 1 É Ån1 1 n2 . We relax the optimization problem by allowing parameters Ån1 1 n2 to be continuous (Ån1 1 n2 2 601 17). For this relaxation, Ån1 1 n2 can be interpreted as the state-dependent probabilities that the server chooses to process “real” or “suspected” failures. Later we will show that, in cases with linear objective functions, the optimal parameters Ån1 1 n2 obtained from the relaxed problem are always integer (i.e., Ån1 1 n2 2 801 19). Finally, we assume that the server follows a preemptive priority discipline, which means that the server will postpone processing of the device currently in service if the current scheduling rule (parameter Ån1 1 n2 ) dictates the server to process a device with the other type of failure. That is, preemption might occur after each failure event. However, our approach can be directly applied to systems with a nonpreemptive discipline at the expense of a larger state space. In our analysis, we consider two different objective functions. The choice of objective function to apply in any real situation depends on the nature of the real and suspected failures and their relationship to the overall system integrity.

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

Objective 1 (O1). Minimize the probability of the event that n1 + n2 g > d.

This objective (O1) function is most appropriate in cases where exceeding a certain level of failures (real and suspected) may lead to catastrophic system failure so we wish to minimize the probability of such an outcome. Notice that a special case of this objective is where we only measure real failures (g = 0). Objective 2 (O2). Minimize the weighted sum of expected “real” and “suspected” failures 4aE6n1 7 + bE6n2 75.

This objective is more appropriate in cases where the overall system performance degrades linearly in the number of real and suspected failures, where a and b represent the relative impact of a real or suspected failure. If g 0 is interpreted as the probability that the suspected failure is actually a real one (independent of the system state), then setting a = 1, b = g gives the expected number of failures. If the objective is to minimize the expected number of real failures, set a = 1 and b = 0 in objective O2. 2.2. Solving for a Given Priority Assignment The multiclass closed system with preemptive priority behaves like a Markov process with finite state space. The states of the system can be described by the 2-tuple 4n1 1 n2 5, which represents the number of real and suspected failures, respectively, with transition rates as illustrated in Figure 1. Given these transition rates, the equilibrium equations for the system states probabilities can be obtained by equating the flow out of state 4n1 1 n2 5 and the flow into state 4n1 1 n2 5: 4N Én1 Én2 5ã1 +4N Én1 Én2 5ã2 +n2 ã12 +Ån1 1 n2 å1

+41ÉÅn1 1 n2 5å2 pn1 1 n2

= 4N Én1 Én2 +15ã1 pn1 É11 n2 +4N Én1 Én2 +15 ·ã2 pn1 1 n2 É1 +4n2 +15ã12 pn1 É11 n2 +1 +Ån1 +11 n2 å1 pn1 +11 n2 +41ÉÅn1 1 n2 +1 5å2 pn1 1 n2 +1 1 < n1 +n2 < N 1 4N Én2 5ã1 +4N Én2 5ã2 +n2 ã12 +å2 p01 n2 = 4N Én2 +15ã2 p01 n2 É1 +Å11 n2 å1 p11 n2 +å2 p01 n2 +1 n1 = 01 1 < n2 < N 1 4N Én1 5ã1 +4N Én1 5ã2 +å1 pn1 1 0 = 4N Én1 +15ã1 pn1 É11 0 +ã12 pn1 É11 1 +å1 pn1 +11 0 +41ÉÅn1 1 1 5å2 pn1 1 1 1 < n1 < N 1 n2 = 01 n2 ã12 +Ån1 1 n2 å1 +41ÉÅn1 1 n2 5å2 pn1 1 n2 = ã1 pn1 É11 n2 +ã2 pn1 1 n2 É1 +4n2 +15ã12 pn1 É11 n2 +1 n1 +n2 = N 1 4Nã1 +Nã2 5p01 0 = å1 p11 0 +å2 p01 1 1

n1 = n2 = 00

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

107

“Suspected” failures

“Suspected” failures

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

(N – n2)!2

(N – n1– n2)!2 n2 n2

(N – n2)!1 "n1, n2#1

(N – n1– n2)!1 #2

"n1, n2#1

n2!12

n2!12 (1– "n1, n2)#2

n2!12

(1 – "n1, n2)#2 (N – n1)!2

N!2

(N – n1)!1

N!1 A Figure 1

n1

“Real” failures

B

#1

n1

“Real” failures

Flow Diagrams for Different System States

Note that Ån1 1 n2 = 1 when n2 = 0, and Ån1 1 n2 = 0 when n1 = 0. Although the system states probabilities pn1 1 n2 have double indexation, we can assume that they are elements of a column vector p, ordered in, for example, a lexicographical manner; i.e., p = 4p01 0 1 p01 1 1 p11 0 1 p21 0 1 p11 1 1 p01 2 1 0 0 0 0 0 0 1 p01N 1 0 0 0 1 pN 10 5. Then we can write this system of equilibrium equations as a system linear equation of size 4N + 25 · 4N + 15/2 ⇥ 4N + 254N + 15/2 plus the normalization equation: pQ = 01

peT = 10

For certain orders of the probabilities pn1 1 n2 in the vector p, the matrix Q will have a three-diagonal block form (see the appendix for more details) and it is easy to see that we have a finite nonhomogeneous two-dimensional quasi-birth-death (QBD) process. We can solve this system exactly for all the state probabilities, given specific system parameters (Ån1 1 n2 , ã1 , ã2 , etc.) applying known techniques (cf. Latouche and Ramaswami 1999, Neuts 1981). However, our focus here is the development of an optimization procedure to determine parameters Ån1 1 n2 . Note that our optimization procedure can be easily adapted for finding the steady state probabilities given certain Ås by introducing additional constraints.

3. Optimization of the Priority Assignment (for a Linear Objective Function)

Our goal is to find probabilities pn1 1 n2 and parameter Ån1 1 n2 such that a linear objective function is minimized

and the equilibrium equations are satisfied: cpT ! min

s.t. pQ4¡5 = 01 peT = 11 p

(1)

01

0  ¡  11 where vector Å is defined in the same way as vector p. Although the equation pQ4¡5 = 0 is nonlinear, we can use the structure of the generator Q4¡5 and formulate a linear optimization problem. For this we introduce variables yn1 1 n2 = pn1 1 n2 Ån1 1 n2 that will allow us to rewrite the constraints (1) of the optimization as: cpT ! min

s.t. pL + yM = 01 peT = 11

pÉy y

(2)

01

01

where matrix L contains all the terms of the matrix pQ4¡5 without probabilities Ån1 1 n2 and matrix M contains all the terms corresponding to probabilities Ån1 1 n2 (see the appendix for more details). Thus we obtain a linear program with 4N + 25 · 4N + 15/2 + N 4N É 15/2 variables (taking into account that Ån1 1 n2 are fixed for n1 or n2 equal to 0), with 4N + 254N + 15/2 + N 4N É 15/2 + 1 constraints and nonnegative variables. Once the optimal vectors p and y are known, the optimal priorities ¡ can be uniquely recovered (shown

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

108

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

in Lemma 2). Moreover, we also show that the optimal Ås are always integer. Lemma 1. For any feasible vector of priority parameters 0  ¡  1 the finite-state Markov chain describing our maintenance system has a unique stationary distribution with nonzero probabilities of each state. Proof. This lemma follows immediately from the fact that the system is finite and irreducible. The first is obvious. The second fact means that if the system at a certain moment is in a specific state (any state), then there is a positive probability that it will move to any other state in finite time. This follows from the fact that any state can be reached from the state with zero machines in repair, and the zero state can be reached from any other because the repair server is always working unless there are no failures in the system. É Lemma 2. There is a one-to-one correspondence between the pairs 4p1 ¡5 and 4p1 y5 that corresponds to the optimal solutions of the initial optimization problem (1) and of the LP problem (2). Proof. As it is shown in Lemma 1, for each priority vector ¡ there is a unique vector p. The uniqueness of vector y for each vector ¡ follows then from its definition (yn1 1 n2 = pn1 1 n2 Ån1 1 n2 ). Assume now that vector y is known. Using standard knowledge from the theory of finite-state Markov chains (cf. Latouche and Ramaswami 1999) and linear algebra, it is possible to show that the rank of the system of linear equations from problem (2): pL = ÉyM1

peT = 1

is equal to the dimension of the state space (dim4p5). This means that for each vector y we will have a unique vector p. If both of these vectors satisfy the inequalities of problem (2): pÉy

01

y

01

then pair 4p1 y5 will give us unique vector ¡ satisfying the constraints of problem (1). É Theorem 1. The optimal parameter Ån1 1 n2 always takes integer values; i.e., Å?n1 1 n2 2 801 19, for all n1 , n2 .

Proof. It is known from the optimization theory (cf. Bertsimas and Tsitsiklis 1997) that the optimal solution of a LP problem lies normally on one of the vertices of the feasible polytope. In other words, a necessary condition for a feasible point to be the optimal solution is that the number of active constraints in this point is greater than or equal to the number of variables (problem dimension). As in Lemma 2, it is possible to show that the rank of the system of linear equations pL + yM = 01 peT = 1

is equal to the dimension of the state space (dim4p5). This means that in the optimal solution the number of active inequalities from the group pÉy

01

y

0

must be greater than or equal to the number of y-variables (dim4y5). Assume now that there is an optimal solution of system (1) with at least one variable 0 < Ån01 1 n02 < 1 (for certain n01 , n02 ). Then the corresponding inequalities (yn01 1 n02 < pn01 1 n02 and yn01 1n02 > 0) will be nonactive. This means that, to keep the number of active inequalities from the set p É y 0, y 0 greater than or equal to dim4y5, there must exist at least one pair 4n001 1 n002 5 for which both inequalities are active (yn001 1n002 = pn001 1n002 and yn001 1n002 = 0). The last fact assumes that our Markov problem will have states with zero probability, which contradicts Lemma 1. That is, the optimal priority parameters obtained from the optimal solution of the LP problem (2) will always take integer values. É

4. Generalizable Priority Assignments

The structure of the optimal repair priorities depend on the objective function and the other system parameters (failure and service rates). Moreover, the conservation laws do not hold in our problem, since the arrival (failure) rates depend on the number of working devices and thus depend on the processing rules. For example, setting high processing priority to the “suspected” failures will increase the number in queue with “real” failures (and thus, the number of working devices will be lower). So, the available results that employ conservation laws are not applicable for our system. However, in some conditions we can find generalizable priority rules, i.e., rules when one job type always gets priority over the other, regardless of the state of the system. The following lemmas present conditions for such priority rules. Lemma 3. If the objective function requires minimization of probabilities of system states with high n1 (e.g., Objectives O1, O2 with g of b equal to zero) and the “real” failure rate of the devices with already discovered “suspected” failure is not higher than the failure rate of the “new” ones (ã12 É ã1  0), then the optimal repair policy is to always repair the devices with “real” failures. Proof. The probability Pn11 of having n1 “real” failures in the system can be expressed as (summation of the equilibrium equations): Pn11 =

N Én1

X

n2 =0

+

41 É Ån1 1 n2 5pn1 1 n2 + 4N É n1 + 15

ã12 É ã1 E6n2 ó n1 É 170 å1

ã1 1 P å1 n1 É1

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

109

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

Note that the minimization of Pn11 would require the maximization of Ån1 1 n2 . Maximization Ån1 1 n2 will cause E6n2 ó n1 É 17 to increase since the queue with “suspected” failures is not processed. However, in cases with ã12 É ã1  0, an increase of E6n2 ó n1 É 17 will produce even smaller probabilities Pn11 . That is, in cases with ã12 É ã1  0 the “real” failures should be processed first. É

The last lemma is similar in some way to the wellknown cå-rule (cf. Buzacott and Shanthikumar 1993). However, in our case the cå-rule does not work for nonequal cs, which can be seen from the examples presented in §5 (Figures 7 and 8).

Thus in cases where we are only concerned with real failures and ã12 É ã1  0, we want to keep devices in the “suspected” state, since the devices in the “suspected” state have lower “real” failure rate and thus the total “real” failure rates would be lower.

For other combinations of the input parameters, the optimal priority assignment will not always have absolute priority of one type of failure over another. To study such cases we performed a large number of experiments with different parameters and have highlighted only the most interesting combinations. Namely, we fixed the following system parameters: N = 10, ã1 = 1, ã2 = 1, ã12 = 3, å1 = 5 and calculated different priority assignments for different combinations of the remaining system parameters (å2 , g, d, a, and b). In the case of the “k-out-of-N ” problem (Objective O1, g = 0, d = 4) the optimal priorities will have the following as shown in Figure 2. This figure clearly indicates that when the service rate of the “suspected” failures increases, it becomes more important to give high priority in certain system states to the “suspected” failures. This is because in keeping the number of the suspected failures low, we will lower the chance of real failures (note here that ã12 > ã1 ; otherwise we get the case described by Lemma 3). However, along the border of the maximized area (n1 = d), it is optimal to assign higher priority to the “real” failures, such that the system moves more quickly into the “safe” area. Somewhat similar behavior of the optimal priorities can be observed for the “generalized” Objective O1

5. Developing a Plan Using Computational Results

Lemma 4. The total number of failures can be minimized by setting higher priority to jobs with shortest expected processing time. Proof. When the objective is based on the total number of jobs in the queue, we can sum up the equilibrium equations along n1 + n2 = n and get the recursive relations for the total probability PnT as: n X

4Ån1 1 nÉn1 å1 +41ÉÅn1 1 nÉn1 5å2 5pn1 1 nÉn1

n1 =0

T = PnÉ1 4N ÉnÉ154ã1 +ã2 51

T PnT = PnÉ1 4N ÉnÉ15

n ã1 +ã2 X å Éå2 É Ån1 1 nÉn1 1 pn1 1 nÉn1 0 å2 å2 n1 =0

Then, to minimize probability PnT we have to maximize Ån1 1 n2 if å1 É å2 > 0, or minimize Ån1 1 n2 if å1 É å2 < 0. That is, the optimal priorities depend only on the processing rates and not on the system state. É

n2 10 s 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s 0

R R R R R R R R R 1

R R R R R R R R 2

R R R R R R R 3

R R R R R R 4

R R R R R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1

n2 10 s 9 8 7 6 5 4 3 2 1 0

n2

!2 = 7

R R R R R R R R R R

s s s s s s s s s -

s s s s s s s s s R 0

s s s s s s s s R 1

!2 = 20

s s s s s s s R 2

s s s s s s R 3

R R R R R R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n1 Figure 2

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s 0

n2

10 9 8 7 6 5 4 3 2 1 0

!2 = 9.5

R R R R R R R s s R

R R R R R R R R R 1

R R R R R R R R 2

R R R R R R R 3

R R R R R R 4

R R R R R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1

s s s s s s s s s s -

s s s s s s s s s R 0

s s s s s s s s R 1

!2 = 31.2

s s s s s s s R 2

s s s s s s R 3

s s R R s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

n2

R 9 10

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s 0

n2

10 9 8 7 6 5 4 3 2 1 0

!2 = 13

s s s s s s s s s R

R s s s s s s s R 1

R R R s s s s R 2

R R R R R R R 3

R R R R R R 4

R R R R R 5

R R R R 6

s s R 7

s R 8

R 9 10

n1

s s s s s s s s s s -

s s s s s s s s s R 0

s s s s s s s s R 1

!2 = 32

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

n1

Optimal Priority Assignments for a “k-Out-of-N” System (Objective O1, g = 0, d = 4) for Different Service Rates

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n1

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

110

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

n2

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s 0

n2

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

R R R R R R R R R 1

s s s s s s s s s R 0

n2 10 s

!2 = 7

R R R R R R R R R R

R R R R R R R R 2

3

R R R R R R 4

R R R R R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1

s s s s s s s R 2

R s s s s s R 3

s R s s s R 4

R s R s R 5

s R s R 6

s s R 7

s R 8

s s s s s s s s s -

n2 10 s 9 8 7 6 5 4 3 2 1 0

R 9 10

!2 = 7.1

R R R R R R R R R R 0

!2 = 9

s s s s s s s s R 1

R R R R R R R

9 8 7 6 5 4 3 2 1 0

s s s s s s s s s -

R R R R R R R R R 1

s s s s s s s s s R 0

R R R R R R R R 2

3

R R R R R R

R R R s R

4

5

R R R R 6

R R R 7

s R 8

R 9 10

n1

!2 = 11

s s s s s s s s R 1

R R R R R R R

s s s s s s s R 2

s s s s s s R 3

s s s s s R

s s R s R

4

5

s s s R 6

s s R 7

s R 8

R 9 10

n1 Figure 3

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

n2

!2 = 5

R R R R R R R R R R

R R R R R R R R R 1

R R R R R R R R R R 0

0

n2 10 s 9 8 7 6 5 4 3 2 1 0

!2 = 7.5

R R s s s s s s s R

s s s s s s s s s -

R R R s s s s s R 1

s s s s s s s s s R 0

R R R s s s s R 2

3

R R s R s R 4

R R R s R 5

R R R R 6

R s R 7

s R 8

R 9 10

n1

!2 = 12

s s s s s s s s R 1

R R R R s s R

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n1

R R R R R R R R 2

R R R R R R s s R 1

R R R R R R R 3

4

R R R R R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1

!2 = 5.5

R R R R s s s R 2

R R R R R R

R R s s s s R 3

R s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

n2

10 9 8 7 6 5 4 3 2 1 0

n2 10 s

!2 = 5.05

R R R R R R R R R R 0

are fixed to the values as indicated in the first paragraph of this section and in the caption of Figure 6. These sets of experiments (Figures 3–6) also clearly indicate that it is quite difficult to derive rules of thumb for optimal priority assignments and the optimization method presented in §3 is needed. In optimization of the repair priorities, following objective O2, there is no clearly defined zone with the probabilities that have to be maximized (or minimized). Therefore, it is even more difficult to define rules of thumb for optimal priority assignment (except for the cases described in Lemmas 3 and 4). That is, minimization of the expected number of “real” failures E6n1 7 (objective O2, a = 1, b = 0) when ã12 É ã1  0 still falls under Lemma 3, whereas

R R R R R R R R R 1

R R R R R R R R 2

R R R R R R R 3

R R R R R R 4

R R R s R 5

R R s R 6

s s R 7

s R 8

R 9 10

n1

s s s s s s s s s s -

s s s R s s s s s R 0

s R R R s s s s R 1

!2 = 6

s R R s s s s R 2

s R s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

n1 Figure 4

s s s s s s s s s -

Optimal Priority Assignments for the Objective O1 with g = 005 and d = 6

0

n2

9 8 7 6 5 4 3 2 1 0

n1

(with g > 0). In this case, the “safe” area with functioning systems is below the diagonal (or stepwise diagonal) line, as shown in Figures 3–5. Minimization of “weighted” probability P 4n1 + gn2 > d5 with d = 6, different values of g (0.5, 0.8, 1.3), and different service rates å2 of the “suspected” failures gives us the following priority assignments as shown in Figures 3–5. Here, as in the case with g = 0, the “suspected” failures get higher priority more often if they have higher service rate å2 . The previous experiments also indicate that, for higher weight g the “suspected” failures start getting higher priority for lower service rate å2 . This conclusion is confirmed by further experiments with the weight g (Figure 6), where all system parameters (except for g) n2

n2 10 s

Optimal Priority Assignments for the Objective O1 with g = 008 and d = 6

s s R 7

s R 8

R 9 10

n1

9 8 7 6 5 4 3 2 1 0

s s s s s s s s s 0

n2 10 s 9 8 7 6 5 4 3 2 1 0

!2 = 5.1

R R R R R R R R R R

s s s s s s s s s -

R R R R R R R R R 1

s s s s s s s s s R 0

R R R R R R R R 2

s s s s s s s s R 1

R R R R R R R 3

4

R s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n1

!2 = 9

s s s s s s s R 2

R R R s s R

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n1

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

111

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

n2

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

R R R R R R R R R R 0

n2

10 9 8 7 6 5 4 3 2 1 0

1

n2 10 s

!2 = 1

R R R R R R R R R

R R R R R R R R 2

R R R R R R R 3

R R R R R R 4

R R R R R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1

s s s s s s s s s s -

s s s s s s s s s R 0

s s s s s s s s R 1

!2 = 4

s s s s R R R R 2

s s s s R R R 3

s s s s s R 4

R s s s R 5

R R R R 6

R R R 7

R R 8

R 9 10

9 8 7 6 5 4 3 2 1 0

s s s s s s s s s -

R R R R s R R R R R 0

R R R R R R R R 2

R R R R R R R 3

R R R R R R 4

R R s s R 5

R R R R 6

R R R 7

R R 8

s s s s s s s s s -

s s s s s s s s s R 0

s s s s s s s s R 1

2

s s s s s s R 3

s s s s s R 4

s s s s R 5

R R R R 6

R R R 7

R R 8

n2

n2

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s 0 s s s s s s s s s s

s s s s s s s s s s

R R R R R R R R R R 1

n2 10 s

g = 1.2

R R R R R R R R R 2

R R R R R R R R 3

R R R R R R R 4

R R R R R R 5

R R R R R 6

R R R R 7

R R R 8

R R R 9 10

n1 s s s s s s s s s -R 0 1

g = 1.6

s s s s s R R R R

s s s s s R s R 2

s s s s s s R 3

s s s s R R 4

s s s s R 5

s s R R 6

s R R 7

R R 8

R 9 10

n1 s s s s s s s s s -R 0 1

s s s s s s s s R

g=3

s s s s s s s R 2

s s s s s R R 3

s s s s s R 4

s s s s R 5

s s R R 6

s R R 7

R R 8

R 9 10

n1 Figure 6

R R R s s R R s R R

n2

R 9 10

10 9 8 7 6 5 4 3 2 1 0

!2 = 3.5

R R R s s s R R R 1

R R R R R R R R 2

R R s R R R R 3

R s s s R R 4

R s s s R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1

s s s s s s s s s s -

s s s s s s s s s R 0

s s s s s s s s R 1

!2 = 5

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n1

Optimal Priority Assignments for the Objective O1 with g = 103 and d = 6

10 9 8 7 6 5 4 3 2 1 0

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

n1

optimization of the repair priorities for ã12 É ã1 > 0 will produce the following priority assignments as in Figure 7: Here again, the “suspected” failures start having higher priority to decrease the probability of real failures in the system. In cases with the Objective O2 and nonzero weight b we can use the results of the lemmas only if the weight b n2

10 9 8 7 6 5 4 3 2 1 0

0

!2 = 4.9

s s s s s s s R

n1 Figure 5

R 9 10

n1

n2 10 s 9 8 7 6 5 4 3 2 1 0

1

n2

!2 = 3

R R R R s R R R R

9 8 7 6 5 4 3 2 1 0

s s s s s s s s s

n2 10 s 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s

n2 10 s 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s

R R R R s R R R s -R 0 1

is equal to a. Otherwise, optimization is necessary and the priority assignment will behave as shown in Figure 8. It can also be clearly seen in the last experiment that the priority assignments are quite sensitive to the service rates even in the case with relatively large differences between weights (Figure 8). n2

g = 1.3

R R R R s R R R R

R R R R R R R R 2

R R R R R R R 3

R R R R R R 4

R R R s R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1 s s s s s s R s s -R 0 1

g = 1.7

s s s s s s s s R

s s s s s R s R 2

s s s s s s R 3

s s s s R R 4

s s s s R 5

s s R R 6

s R R 7

R R 8

R 9 10

n1 s s s s s s s s s -R 0 1

n2

s s s s s s s s R

g=4

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s R R 6

s R R 7

R R 8

R 9 10

n2

10 9 8 7 6 5 4 3 2 1 0

10 9 8 7 6 5 4 3 2 1 0

10 9 8 7 6 5 4 3 2 1 0

n1

Optimal Priority Assignments for the Objective O1 for Different Values of g 4å1 = 15, å2 = 85

s s s s s s s s s s

s s s s s s s s s s

s s s s s s s s s s

s s s s R R R R s -R 0 1

g = 1.4

s s s R R R R R R

s s s R s s s R 2

s s s R R R R 3

s s R R R R 4

s s s s R 5

s s R R 6

s R R 7

R R 8

R 9 10

n1 s s s s s s R s s -R 0 1

g = 1.9

s s s s s s s s R

s s s s s R s R 2

s s s s s s R 3

s s s s R R 4

s s s s R 5

s s R R 6

s R R 7

R R 8

R 9 10

n1 s s s s s s s s s -R 0 1

s s s s s s s s R

g=7

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s R R 6

s R R 7

R R 8

R 9 10

n1

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

112

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

n2

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

R R R R R R R R R R 0

n2

10 9 8 7 6 5 4 3 2 1 0

R R R R R R R R R 1

n2

!2 = 12.5 R R R R R R R R 2

R R R R R R R 3

R R R R R R 4

R R R R R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1

s s s s s s s s s s -

s s s s s s s s R R 0

1

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

s s s s s s s s s s -

s s s R R R R R R R 0

n2

!2 = 12.6

s s s s s s s s R

10 9 8 7 6 5 4 3 2 1 0

R 9 10

10 9 8 7 6 5 4 3 2 1 0

s s s R R R R R R 1

n2

!2 = 12.505 s s s R R R R R 2

s s s R R R R 3

s s s s R R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n2

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

R R R R R R R R R R

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s R 0

0

!2 = 12.52

s s s s s s R R R 1

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

!2 = 13

s s s s s s s s R 1

n1

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n1

1

n2

!2 = 8

R R R R R R R R R

R R R R R R R R 2

R R R R R R R 3

R R R R R R 4

R R R R R 5

R R R R 6

R R R 7

R R 8

R 9 10

n1

s s s s s s s s s s -

s s s s s s s R R R 0

s s s s s s s R R 1

!2 = 8.005

s s s s s s s R 2

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

s s s R R R R R R R 0

n2

10 9 8 7 6 5 4 3 2 1 0

1

n2 10 s

!2 = 8.0001

s s R R R R R R R

s s R R R R R R 2

s s s R R R R 3

s s s R R R 4

s s s R R 5

s s s R 6

s s R 7

s R 8

R 9 10

9 8 7 6 5 4 3 2 1 0

s s s s s s s s s -

s s s s s R R R R R 0

!2 = 8.001

s s s s s R R R R 1

s s s s s R R R 2

s s s s s R R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

n1

s s s s s s s s s s -

s s s s s s s s s R 0

s s s s s s s s R 1

2

n1

!2 = 8.1

s s s s s s s R

s s s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

n1 Figure 8

s s s s s s R R R R

Optimal Priority Assignments in Minimization of “Real” Failures (Objective O2, a = 1, b = 0)

0

n2

s s s s s s s s s s -

n1

s s s s s s s s s s -

n1 Figure 7

10 9 8 7 6 5 4 3 2 1 0

s s R 7

s R 8

R 9 10

n1

Optimal Priority Assignments When Weighted Sum of Expected Numbers of Failures aE6n17 + bE6n27 Is Minimized 4a = 1, b = 0055

6. Simulation To examine the robustness of our optimal policies to violations in the underlying assumptions, we simulated systems under a wide range of operating conditions. In the experiments, we tested six groups of priority assignments corresponding to different objective functions (see Figure 9, left-hand side). All assignments were obtained by optimizing the corresponding objectives with the system parameters shown in the bottom of the figure (common parameters) and next to each priority assignment (specific for each priority assignment). For each priority assignment, different combinations of square coefficients (Cx2 = 008, 1, 1.25) for interfailure and service process were tested. This yielded 35 = 243

Table 1

Example Simulation Output Illustrating Representative Performance P 4n1  N É k5 P 4n1 + gn2  d5

Sc. 1 Average Standard deviation 95% int. halfwidth Halfwidth/average Batch correlation Sc. 82 Average Standard deviation 95% int. halfwidth Halfwidth/average Batch correlation

009736 000024 000005 000095 000167

009906 000014 000003 000003 001284

009864 000016 000003 000003 É001406

009953 000009 000002 000002 É000919

E6n1 7

aE6n1 7+ bE6n2 7

104624 000240 000048 000033 000158

107494 000254 000051 000029 000364

101462 104356 000234 000244 000047 000049 000041 000034 É001427 É001382

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

113

s R R R R R R R R R

Priority assignment 10

Max P(n1 + gn2 ≤ d)

0 10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

Priority assignment 13

Max P (n2 + gn2 ≤ d)

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

Priority assignment 16

s s s s s s s s s s -

R R R R s R R R s R 0

6

s s R 7

s R 8

R 9 10

s R R R R 5

s R R R 6

s R R 7

s R 8

R 9 10

R R R R R R 4

R R R s R 5

R R R R 6

R R R 7

R R 8

9

R 10

R R R s R R R

R R s s s R 4

R s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

!2 = 8 g = 1.3

R R R R R R R R

R R R R R R R 3

R R R R R R 4

R R R s R 5

"1 = 1,

R R R R 6

R R R 7

R R 8

9

"2 = 2,

R 10

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

1

s s s R R R R R R R 0

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

s s s s R R R R R R

s s s s s s s s s s -

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

R R R R 6

s s R 7

s s s s s s s s s R

s s s R R R R

s s s s R 5

s s s R R R 4

8

R 9 10

s s s R 6

s s R 7

R s R R R s R

s s s s R 5

R R s R s R 4

R R s s s s R 3

s s s R 6

s s R 7

!1 = 1,

s R 8

R 9 10

R R R s R 5

R R R R 6

R R R 7

R R 8

9

R 10

R s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

!2 = 11 g = 1.3

s s s R R R R R 2

8

R 9 10

!2 = 17 g = 0.8

R R R s s s s R

s s s s s s s s R

s R

!2 = 22 g = 0.5

3

2

1

s R

!2 = 24.4 g = 0.5

R R R R s s s R

R R R R R s s s R

s s s s s R

s s s s s s s s s s 0

"1 = 1,

R R R R R 5

4

3

2

1

s s s s R R R

s s s R R R R R

R R R s s s s s R

R R R R R R R R R R

4

3

2

1

0 10 9 8 7 6 5 4 3 2 1 0

2

1

0

s s s s R R R R

s s s R R R R R R

R R s s s s s s s R

3

R R R R R R

!2 = 42 g = 0.5

s s s R R R R R R 1

0 10 9 8 7 6 5 4 3 2 1 0

2

R R R R R s R

s s s R R R R 3

s s s s R R 4

a = 1,

s s s s R 5

R R R R 6

R R R 7

R R 8

b = 0.5,

9

R 10

k = 6,

Priority assignment 3

Priority assignment 2

0

s s s s s s s R

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

s s s s s s s s s R 0

Priority assignment 6

s s s R

!2 = 16 g = 0.8

N = 8,

Figure 9

s R R R R R

R R R R R R R

3

2

5

4

3

2

s s s R R

!2 = 21 g = 0.5

R R R R R R R R

R R R R s R R R R 1

s R R R R R R

R R R R R R R R

R R R R R R R R R

s s s R R R 4

3

2

1

8

R 9 10

!2 = 24.28 g = 0.5

s R R R R R R R

R R R R R R R R R

R R R R R R R R R R

s s R R R R R 3

2

1

0 10 9 8 7 6 5 4 3 2 1 0

s R R R R R R R R

R R R R R R R R R R 0

s s R R R R R R 2

1

7

s R

!2 = 40 g = 0.5

s s s s s s s s R

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

10 9 8 7 6 5 4 3 2 1 0

s s s s s R R R R R

s s s s s s s s s s -

10 9 8 7 6 5 4 3 2 1 0

0 10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s -

10 9 8 7 6 5 4 3 2 1 0

s s s s s s s s s s 0

6

s s R 7

s R 8

R 9 10

s s s R 6

s s R 7

s R 8

R 9 10

s s s s s R

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

!2 = 24 g = 0.5 R s s s s s R

R R s s s R 4

R R R s R 5

R R R R 6

R s R 7

s R 8

R 9 10

!2 = 18 g = 0.8 s R s s s s R 3

s s s s s R 4

s s s s R 5

s s s R 6

s s R 7

s R 8

R 9 10

!2 = 15 g = 1.3

s s s s s s s R 2

s s s s R 5

4

3

2

s s s s s R

s s s s s s R

s R R s s s s R

s s s s s s s s R 1

s s s R

!2 = 24.7 g = 0.5

s s s s s s s R

s R R s s s s s R

s s s s s s s s s R

R R R R R 5

4

3

2

1

s s s s s s R

s s s s s s R R

s s s s s s s s R

s s s s s s s s s R

R R R R R R 4

3

2

1

0

s s s s s s R R

s s s s s s R R R

s s s s s s s s s R

3

2

1

s s s s s s R

!2 = 44 g = 0.5

s s s s s s R R R

s s s s s s R R R R

s s s s s s s s s s -

s s s s s s s R 2

1

0

!2 = 50 g = 0.5

s s s s s s s s R 1

0

Priority assignment 9

s s s s s s s s s s -

6

R R R

s s s s s s s s s R

Priority assignment 12

10 9 8 7 6 5 4 3 2 1 0

5

R R R R

!2 = 41 g = 0.5

s R R R R R R R R 1

4

R R R R R

s s s s s s s s s s -

Priority assignment 15

Priority assignment 7

Max Ea[n1] + bE [n2]

0

3

R R R R R R

10 9 8 7 6 5 4 3 2 1 0

Priority assignment 18

s R R R R R R R R R

2

R R R R R R R

Priority assignment 5

s s s s s s s s s s -

1

R R s s s s s R

Priority assignment 8

Priority assignment 4

Min E(n1)

10 9 8 7 6 5 4 3 2 1 0

!2 = 35 g = 0.5

s s s s s s s s R

Priority assignment 11

s s s s s s s s s R

Priority assignment 14

s s s s s s s s s s -

Priority assignment 17

Priority assignment 1

10 9 8 7 6 5 4 3 2 1 0

0

Max P (n1 + gn2 ≤ d)

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

Max P(n1 ≤ N – k)

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

s s s s s s R 3

s s s s s R 4

s s s s R 5

R R R R 6

R R R 7

R R 8

R 9 10

d=6

Scenario Settings

combinations for each priority assignment for a total of 18 · 35 = 41374 experiments. In each simulation, four performance estimators were measured: P 4n1  N É k5, P 4n1 + gn2  d5, E6n1 7, and aE6n1 7 + bE6n2 7 with parameters a, b, g, and d as indicated in Figure 9. In scenarios where we employed the exponential distribution, we implemented the simulation to exploit its unique properties. For cases where we simulated nonexponential processing and interfailure

times, distributions were generated using the first two moments in one of the following distributions: (a) Erlang (Ek1 kÉ1 ), for 0 < cx < 1, (b) HyperExponential, for cx > 1, p where cx = Var6x7/E6x7. Each simulation was started in the state with all devices working and run for 104,000 time units, disregarding the first 4,000 time units to eliminate the effects of initial transient bias (Kelton and Law 2000). Using the

Cx2 = 008

Pr. assing. 8 Cx2 = 1025

Cx2 = 008

Pr. assing. 7 Cx2 = 1025

Cx2 = 008

Pr. assing. 6 Cx2 = 1025

Cx2 = 008

Pr. assing. 5 Cx2 = 1025

Cx2 = 008

Pr. assing. 4 Cx2 = 1025

Cx2 = 008

Pr. assing. 3 Cx2 = 1025

Cx2 = 008

Pr. assing. 2 Cx2 = 1025

Cx2 = 008

Pr. assing. 1 Cx2 = 1025

Table 2

11015 6043 2051 1047

9006 5092 1094 1026

10057 5089 2038 1047

10045 5099 2027 1033

9061 5096 2006 1029

10015 5081 2027 1041

9092 5083 2016 1033

9093 5098 2017 1034

P 4n1  45

5053 3030 1025 0081

5086 3091 1026 0085

3020 2010 0068 0054

3039 2020 0073 0052

4090 3018 1005 0071

4053 2093 0097 0073

5019 3035 1010 0082

5068 3070 1021 0088

P 4n1 + gn2  65

28045 12096 6012 1059

26048 12022 5062 1061

37014 13079 8005 1039

35009 14010 7055 1072

32095 13081 7010 1077

35075 13038 7097 1058

34098 13063 7075 1070

34060 13060 7074 1074

E6n1 7

18026 8071 4002 1011

17058 8075 3081 1021

26086 10017 5091 1001

24039 10051 5040 1032

23052 10089 5023 1048

31016 12038 6091 1055

31001 12079 6084 1067

31004 12088 6092 1074

E6n1 + 005n2 7

Interfailure time (real failure) (%)

14093 8021 1023 0078

10019 7058 0091 0077

8085 5056 0067 0047

9082 6007 0078 0054

9038 6040 0080 0060

7061 5007 0063 0047

7071 5038 0061 0043

7076 5050 0062 0048

P 4n1  45

9080 5064 0075 0052

10012 6074 0080 0063

3001 2024 0021 0019

3062 2057 0027 0022

6040 4032 0051 0040

4079 3039 0035 0030

6035 4043 0043 0034

6076 4070 0047 0039

P 4n1 + gn2  65

33070 14082 2054 0069

25028 11060 2007 0069

30019 15082 2008 0087

27020 13016 1097 0064

24061 11051 1094 0056

30064 14014 2039 0084

30027 13049 2032 0073

29087 13033 2029 0072

E6n1 7

Interfailure time (suspected failure) (%)

32088 13044 2075 0075

31026 12048 2069 0069

30046 13069 2027 0078

30035 13037 2038 0072

30083 13015 2053 0068

33085 13072 2063 0081

35043 13033 2068 0068

35022 13030 2067 0070

E6n1 + 005n2 7

Change of Performance Estimators Due to the Change in System Variability

9067 5064 1049 1004

13084 9006 0089 0061

1046 0089 0083 0057

5065 3038 1021 0083

9087 6039 1025 0089

1011 1031 0072 0059

3016 3008 0082 0069

4049 3091 0092 0068

P 4n1  45

3052 2029 0059 0046

6074 5001 0045 0037

0030 0026 0021 0020

1022 0088 0030 0025

4017 2099 0055 0044

0065 0075 0033 0031

1073 1077 0040 0040

2051 2037 0047 0041

P 4n1 + gn2  65

19067 4054 3017 0082

29052 5044 2083 1011

5013 1068 2043 0069

14063 3047 3029 0089

23018 4097 3060 0099

0084 1052 1032 0048

3050 2094 1045 0058

5038 3071 1062 0056

E6n1 7

7088 1091 1036 0049

11010 2084 1011 0047

2097 0085 1048 0044

7030 1030 1072 0044

11023 2024 1087 0047

0051 1016 1003 0043

2035 2002 1010 0052

3085 2061 1029 0053

E6n1 + 005n2 7

Interfailure time (real after suspected) (%)

É44025 17083 1068 0043

É39035 18064 1030 0085

É23086 10028 1056 0028

É31006 12012 1065 0031

É34023 14031 1042 0045

É20039 9045 1026 0033

É22002 10023 1019 0047

É23042 10079 1023 0042

P 4n1  45

É17039 8025 1051 0029

É23074 11009 1071 0037

É5014 2099 0093 0030

É6079 3053 1006 0031

É14073 6056 1067 0027

É8035 4076 1020 0029

É11028 6041 1018 0030

É12090 7006 1028 0024

P 4n1 + gn2  65

E6n1 + 005n2 7

É62004 É46070 3044 5009 1000 0023 0059 0034

É63065 É48094 2049 4096 1005 0017 0078 0042

É56083 É46039 3069 4049 1017 0070 0061 0040

É62054 É50027 2063 4047 1000 0019 0062 0038

É65039 É53076 2027 4045 0085 É0015 0073 0041

É47021 É42094 4045 4081 1019 1010 0071 0071

É48040 É44088 4039 4064 1023 1008 0075 0081

É49026 É46019 4063 4092 1030 1014 0072 0079

E6n1 7

Repair time (real failure) (%)

É9034 6024 É0035 0042

É2014 1027 É0009 0028

É10012 6069 É0001 0023

É8005 5058 É0009 0025

É4002 2099 É0007 0024

É9013 5090 0006 0026

É7096 5029 0000 0021

É7019 4092 0000 0027

P 4n1  45

É7040 5029 É0026 0031

É1050 0095 É0006 0019

É3023 2051 0012 0009

É3049 2069 0008 0011

É2093 2033 É0004 0016

É4002 3022 0009 0014

É4023 3045 0005 0013

É3099 3027 0003 0018

P 4n1 + gn2  65

É15007 4043 É0041 0046

É7052 4007 É0029 0044

É21018 5062 0002 0050

É14066 4006 É0004 0040

É9065 3043 É0014 0037

É26058 6046 É0012 0044

É25011 6058 É0023 0042

É24002 6089 É0019 0043

E6n1 7

É14049 5002 É0041 0031

É10008 6028 É0034 0032

É20043 5002 0001 0037

É15038 4080 É0006 0028

É12006 5052 É0011 0026

É27086 6024 É0003 0041

É27002 6046 É0010 0041

É25099 6090 É0008 0040

E6n1 + 005n2 7

Interfailure time Repair time (suspected failure) (%)

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

114 Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

Cx2 = 008

Pr. assing. 16 Cx2 = 1025

Cx2 = 008

Pr. assing. 15 Cx2 = 1025

Cx2 = 008

Pr. assing. 14 Cx2 = 1025

Cx2 = 008

Pr. assing. 13 Cx2 = 1025

Cx2 = 008

Pr. assing. 12 Cx2 = 1025

Cx2 = 008

Pr. assing. 11 Cx2 = 1025

Cx2 = 008

9099 6021 2020 1027

17057 6075 4025 1067

17065 6068 4023 1060

14062 7033 3035 1069

15069 6075 3078 1070

13043 6048 3011 1047

9051 6013 2003 1027

14044 6071 3038 1070

10022 5016 2040 0087

7091 3052 1084 0083

8000 3043 1085 0079

7077 3068 1082 0086

6007 3023 1040 0082

5093 3029 1033 0080

5057 3060 1019 0077

5069 3010 1030 0083

P 4n1 + P 4n1  45 gn2  65

18059 8019 4003 1008

17052 9020 3088 1039

20021 10064 4047 1061

26043 12095 5064 1076

22022 11014 4097 1064

21083 10079 4085 1060

25094 11089 5048 1058

28033 13060 6012 1083

E6n1 7

11037 5036 2046 0072

14040 7085 3017 1021

15026 7098 3034 1017

15095 7056 3047 1000

18066 9066 4015 1045

18031 9058 4004 1047

16069 8022 3060 1014

20011 9031 4038 1022

E6n1 + 005n2 7

Interfailure time (real failure) (%)

(Continued)

Pr. assing. 10 Cx2 = 1025

Cx2 = 008

Pr. assing. 9 Cx2 = 1025

Table 2

11022 8012 1012 0083

31063 9026 2091 1027

30037 9047 2087 1023

21045 10046 2000 1012

24000 9011 2013 1007

15050 7050 1042 0077

10017 7084 0090 0075

20087 9005 1074 0092

42095 6057 4016 1021

22011 7073 1071 0087

21076 7064 1071 0082

20063 8008 1062 0082

11012 5062 0084 0053

10062 5075 0079 0053

10014 6060 0077 0057

10003 5052 0073 0048

P 4n1 + P 4n1  45 gn2  65

21079 10014 2004 0058

31082 12048 2085 0074

32099 13068 3002 0085

34035 14078 2097 0099

33023 13064 2091 0078

28038 12044 2056 0071

24053 11082 2005 0069

40034 19064 3018 1015

E6n1 7

Interfailure time (suspected failure) (%)

25013 9021 2039 0058

31078 11043 2083 0069

32020 11090 2092 0074

32030 12021 2086 0086

34062 12048 2099 0071

33032 11094 2086 0065

30080 12013 2065 0074

35084 14078 2096 0091

16043 9076 1003 0061

1094 1037 1008 0064

3022 1097 1021 0069

11048 6006 1073 1007

1080 1069 1001 0065

6025 4098 1048 0096

14009 9019 0097 0060

2082 1050 1017 0074

É13013 6054 É1018 0070

0003 0058 0016 0032

0033 0044 0018 0026

1054 1007 0031 0028

1011 1009 0041 0031

2051 2021 0053 0042

5074 4033 0042 0032

0089 0066 0039 0031

E6n1 + P 4n1 + 005n2 7 P 4n1  45 gn2  65

24060 3083 2001 0064

1025 0054 0072 0035

3029 1074 1020 0065

18076 8096 2085 1021

1051 1008 0093 0040

4082 2049 1038 0046

29023 6001 2079 0096

4069 1045 1077 0068

E6n1 7

6070 3007 0046 0032

0053 0048 0040 0034

1038 0070 0058 0039

6008 1077 0097 0042

0076 0069 0059 0041

2034 1027 0078 0044

9099 2034 0099 0042

2041 0086 0096 0048

É49015 20087 1032 0081

É53058 19069 0013 1010

É57023 21056 0006 1015

É66017 26067 1005 0096

É41092 16091 0049 0098

É40038 15085 0061 0073

É41009 18088 1039 0070

É41079 17011 1001 0069

É62092 16040 É0063 1018

É18040 7091 0097 0032

É20011 8086 0096 0027

É32019 14030 1016 0029

É13032 6095 0096 0025

É16016 8043 1007 0025

É22013 10018 1068 0027

É11099 6042 1009 0028

E6n1 + 005n2 7

É54030 É37089 2068 4017 0094 0043 0050 0026

É30068 É26005 6029 6004 0029 0028 0036 0032

É36033 É29028 6067 6005 0018 0021 0039 0032

É59056 É42066 4050 4049 0070 0030 0066 0034

É35066 É31011 6064 6038 0051 0049 0047 0043

É38013 É33043 5046 5071 0052 0044 0042 0040

É63003 É47032 2047 4074 1002 0027 0061 0030

É48073 É38055 7025 6059 0057 0038 0051 0033

E6n1 7

Repair time (real failure) (%)

E6n1 + P 4n1 + 005n2 7 P 4n1  45 gn2  65

Interfailure time (real after suspected) (%)

É3041 1089 É0039 0041

É43044 17017 É1090 1019

É40032 16038 É1095 1001

É17078 10027 É1013 0085

É32044 15018 É1016 0092

É18049 8046 É0074 0058

É3017 2016 É0022 0033

É25007 13081 É0072 0072

É14082 5034 É1023 0062

É26037 12015 É0074 0073

É25007 11069 É0077 0063

É15018 8073 É0078 0058

É12009 6090 É0015 0032

É9034 5051 É0030 0036

É2075 1090 É0020 0025

É10091 6095 É0003 0029

P 4n1 + P 4n1  45 gn2  65

É6077 3023 É0062 0038

É25008 6038 É0099 0037

É25025 6078 É1017 0046

É17090 3040 É0096 0055

É27003 6099 É0095 0034

É23036 7005 É0095 0034

É8012 3070 É0039 0039

É26079 7032 É0062 0049

E6n1 7

É6093 3073 É0058 0028

É24061 5081 É0089 0032

É23079 5066 É0099 0035

É15054 3040 É0076 0034

É27023 6030 É0083 0029

É24070 6060 É0086 0028

É10029 5070 É0042 0029

É23094 5070 É0053 0036

E6n1 + 005n2 7

Interfailure time Repair time (suspected failure) (%)

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

115

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

116

É16055 4096 É0099 0032 É16047 5052 É1006 0037 É44055 10048 É2078 0099 É53062 13034 É3061 1030 É22049 É18098 5038 4093 0000 0004 0020 0018 É27004 6019 0023 0047 É71058 19059 É1063 1065 0001 1025 0003 0035 1018 1013 0032 0036 É3043 3013 É0085 0073 0081 2022 0063 0087 23092 8069 2040 0062 23017 9073 2039 0071 38068 5003 3082 1009 8057 4097 1086 0081 10042 5085 2028 0095 Cx2 = 008

Pr. assing. 18 Cx2 = 1025

Cx2 = 008

19038 6065 4061 1031

9032 4019 2015 0066

43006 7006 4058 1044

É14009 4013 É0092 0031 É14097 4034 É1004 0036 É28088 8099 É1091 0086 É27034 10081 É1098 1006 É32079 É25083 4062 4071 0024 0015 0029 0021 É44033 9054 0009 0064 É73027 23081 É0024 1036 0075 2015 0019 0041 3092 2005 0077 0044 É5001 3075 É0081 0066 5066 3083 1023 0089 26032 8060 2049 0062 26098 9048 2049 0062 38035 5047 3070 1007 9069 4097 2010 0070 12092 6016 2080 0082 16005 6065 3077 1050

9084 4041 2036 0076

33032 8085 3010 1029

E6n1 + 005n2 7 E6n1 7 P 4n1 + P 4n1  45 gn2  65 E6n1 + 005n2 7 E6n1 7 E6n1 + P 4n1 + 005n2 7 P 4n1  45 gn2  65 E6n1 7 E6n1 + P 4n1 + 005n2 7 P 4n1  45 gn2  65 E6n1 7 P 4n1 + P 4n1  45 gn2  65 E6n1 + 005n2 7 P 4n1 + P 4n1  45 gn2  65

Interfailure time (real failure) (%)

E6n1 7

Pr. assing. 17 Cx2 = 1025

Table 2

(Continued)

Interfailure time (suspected failure) (%)

Interfailure time (real after suspected) (%)

Repair time (real failure) (%)

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

Interfailure time Repair time (suspected failure) (%)

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

batch means approach, every 1,000 units of time the average system objectives were recorded. This yielded 100 batches of 1,000 time units each. The batch size (1,000) and number of batches were chosen, after analysis of representative scenarios, to mitigate any correlation between batch means and to achieve suitably small 95% confidence interval half widths (less than 1% of average estimates for each performance measure). For these representative scenarios, we also confirmed that serial batch mean correlation was small and that the distribution of batch means was approximately normal to ensure good confidence interval coverage (Johnson and Jackman 1996). Table 1 shows the detailed results of two representative systems, illustrating the low interbatch correlation and small confidence interval half widths. In Table 2 we present relative changes in the system performance occurring due to the changes in system variability. There are five columns in this table. Each of them corresponds to changes of variability in each system process (interfailure and service). For each type of system process, there are four separate columns corresponding to different performance estimators. The rows of the table contain relative changes (average—upper cell, standard deviation—lower cell) in performance estimators caused by change in variability of each system process from Cx2 = 1 to Cx2 = 008 or Cx2 = 1 to Cx2 = 1025. The relative changes are presented separately for each priority assignment, and negative values indicate that the system performance had improved. The results presented in this table allow us to draw the following conclusions: 1. High variability of the interfailure and service times (rows with Cx2 = 1025) has little influence on the system performance estimators. This means that the presented model can be easily applied to systems with higher uncertainties (which is closer to real-life cases). 2. Decreasing the variability of service times (the last two groups of columns with Cx2 = 008) leads to improved system performance. 3. Systems with lower variability of the interfailure and service times (rows with Cx2 = 008) are more sensitive and thus different modeling of the arrival and service processes (e.g., using Erlang-2 distribution) should be applied. This change would increase the system state space, but the optimization method could still be applied.

7. Conclusion

Security failures of industrial systems create new maintenance challenges and opens new questions on how best to allocate limited maintenance resources. This paper presents an optimization model for repair priorities in a maintenance system with two types of failures (suspected and real security failures) and uncertain interfailure and repair times. We showed that, given the state of the system, the optimal repair policy follows a unique threshold indicator (either work on the real

Sleptchenko and Johnson: Maintaining Secure and Reliable Distributed Control Systems

Downloaded from informs.org by [130.115.80.128] on 05 December 2014, at 10:32 . For personal use only, all rights reserved.

INFORMS Journal on Computing 27(1), pp. 103–117, © 2015 INFORMS

failures or the suspected ones) and that the repair priorities can be found using a compact LP formulation. In addition, we examined the behavior of the optimal policy under different failure rates and objectives and presented a number of rules that can provide priority assignment in certain conditions without using the LP model. Our models help prioritize a response to possible security failures of SCADA systems. We illustrated that it is often optimal to focus on suspected device failures to avoid catastrophic system failure. Finally, we examined the robustness of our model to violations in the underlying assumptions and found that the model remains useful in cases with higher variability in interfailure and repair times. We also demonstrated that systems with lower variability in interfailure and repair times are more sensitive and require a different modeling approach for the stochastic processes and corresponding optimization model. Further examination of such systems would present an interesting direction for future research. Supplemental Material

Supplemental material to this paper is available at http://dx .doi.org/10.1287/ijoc.2014.0613.

Acknowledgments

This research was partially supported by the National Science Foundation [Award CNS-1329686].

References Antunes G (2005) Hacking the grid. Red Herring Magazine (May 18), https://www.mail-archive.com/[email protected]/msg02231 .html. Barlow RE, Proschan F (1996) Mathematical Theory of Reliability, Vol. 17 (SIAM, Philadelphia). Bertsimas D, Niño-Mora J (1996) Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems. Math. Oper. Res. 21:257–306. Bertsimas D, Niño-Mora J (2000) Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. 48:80–90. Bertsimas D, Tsitsiklis JN (1997) Introduction to Linear Optimization (Athena Scientific, Belmont, CA). Bradley T (2011) Water utility hacked. Are critical systems at risk? Accessed September 12, 2014, http://www.pcworld.com/ article/244359/water_utility_hacked_are_our_scada_systems _at_risk_.html. Brenner B (2006) FBI says attacks succeeding despite security investments. Accessed September 12, 2014, http://searchsecurity .techtarget.com/news/1157706/FBI-says-attacks-succeeding -despite-security-investments. Buzacott JA, Shanthikumar JG (1993) Stochastic Models of Manufacturing Systems, Vol. 4 (Prentice Hall, Englewood Cliffs, NJ). Cobham A (1954) Priority assignment in waiting line problems. Oper. Res. 2:70–76. Davis RH (1966) Waiting-time distribution of a multi-server, priority queuing system. Oper. Res. 14:133–136. De Smidt-Destombes KS, Van der Heijden MC, Van Harten A (2006) On the interaction between maintenance, spare part inventories and repair capacity for a k-out-of-n system with wear-out. Eur. J. Oper. Res. 174:182–200. Eckles JE (1968) Optimum maintenance with incomplete information. Oper. Res. 16:1058–1067. Gellman B, Nakashima E (2013) U.S. spy agencies mounted 231 offensive cyber-operations in 2011. Accessed September 12, 2014, http://wapo.st/17sEENT.

117

Gittins J, Glazebrook K, Weber R (2011) Multi-armed Bandit Allocation Indices, 2nd ed. (Wiley, Chichester, UK). Glazebrook KD, Mitchell HM (2002) An index policy for a stochastic scheduling model with improving/deteriorating jobs. Naval Res. Logist. 49:706–721. Glazebrook KD, Mitchell HM, Ansell PS (2005) Index policies for the maintenance of a collection of machines by a set of repairmen. Eur. J. Oper. Res. 165:267–284. Gold A (2013) Researchers find security holes in Philips health info management system. Accessed September 12, 2014, http:// www.fiercehealthit.com/story/researchers-find-security-holes -philips-health-info-management-system/2013-01-22. Greg H (2013) Developing a successful remote patient monitoring program. Beckers hospital review, Beckers Hospital. Accessed September 12, 2014, http://www.beckershospitalreview.com/ healthcare-information-technology/developing-a-successful -remote-patient-monitoring-program.html. Haque L, Armstrong MJ (2007) A survey of the machine interference problem. Eur. J. Oper. Res. 179:469–482. Harchol-Balter M, Osogami T, Scheller-Wolf A, Wierman A (2005) Multi-server queueing systems with multiple priority classes. Queueing Systems 51:331–360. Higgins KJ (2013) Experiment simulated attacks on natural gas plant. Accessed September 12, 2014, http://www.darkreading.com/ perimeter/experiment-simulated-attacks-on-natural/240157897. Hochmuth P (2005) Risks rise as factory nets go wireless. Accessed September 12, 2014, http://www.networkworld.com/article/ 2319251/network-security/risks-rise-as-factory-nets-go-wireless .html. Jaiswal NK (1968) Priority Queues, Vol. 50 (Academic Press, New York). Johnson ME, Jackman J (1996) Interval coverage in multiclass queues using batch mean estimates. Management Sci. 42:1744–1752. Kelton WD, Law AM (2000) Simulation Modeling and Analysis (McGraw Hill, Boston). Klein M (1962) Inspection—maintenance–replacement schedules under Markovian deterioration. Management Sci. 9:25–32. Latouche G, Ramaswami V (1999) Introduction to Matrix Analytic Methods in Stochastic Modeling, Vol. 5 (SIAM, Philadelphia). McCall JJ (1963) Operating characteristics of opportunistic replacement and inspection policies. Management Sci. 10:85–97. Neuts MF (1981) Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach (Johns Hopkins University Press, Baltimore). Niño-Mora J (2008) A faster index algorithm and a computational study for bandits with switching costs. INFORMS J. Comput. 20:255–269. Niño-Mora J (2011) Computing a classic index for finite-horizon bandits. INFORMS J. Comput. 23:254–267. Phimister JR, Oktem U, Kleindorfer PR, Kunreuther H (2003) Nearmiss incident management in the chemical process industry. Risk Anal. 23:445–459. Sleptchenko A, Van Harten A, Van Der Heijden M (2005) An exact solution for the state probabilities of the multi-class, multi-server queue with preemptive priorities. Queueing Systems 50:81–107. Tiemessen HGH, van Houtum GJ (2013) Reducing costs of repairable inventory supply systems via dynamic scheduling. Internat. J. Production Econom. 143:478–488. USDHS (2006) Cyber storm exercise report. Accessed September 12, 2014, https://www.hsdl.org/?view&did=466697. Wagner D (1996) Analysis of a finite capacity multiserver model with nonpreemptive priorities and nonrenewal input. Chakravarthy SR, Alfa AS, eds., Matrix-Analytic Methods in Stochastic Models, Lecture Notes in Pure and Applied Mathematics), Vol. 183 (Marcel Dekker, New York), 67–86. Wagner D (1998) A finite capacity multi-server multi-queueing priority model with nonrenewal input. Ann. Oper. Res. 79:63–82. Wang H (2002) A survey of maintenance policies of deteriorating systems. Eur. J. Oper. Res. 139:469–489. Whittle P (1988) Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25:287–298.

Suggest Documents