dependable distributed automation systems within

0 downloads 0 Views 41KB Size Report
automation systems are both dependable and flexible .... a Primary Substation Automation System (PSAS), ... requirements in the deregulated electricity market.
Power Systems and Communications Systems Infrastructures for the Future, Beijing, September 2002

DEPENDABLE DISTRIBUTED AUTOMATION SYSTEMS WITHIN AN OPEN COMMUNICATION INFRASTRUCTURE Geert Deconinck, Vincenzo De Florio, Ronnie Belmans K.U.Leuven, Dept. Electrical Engineering (ESAT), Kasteelpark Arenberg 10, B-3001 Leuven, BELGIUM phone +32 16 32 11 26, fax +32 16 32 18 13, [email protected]

via the interpretation of user-defined recovery strategies. The backbone is hierarchically structured to maintain a consistent system view and contains self-testing and self-healing mechanisms. • The high-level configuration-and-recovery language is used to configure the basic fault tolerance tools from the library and to express recovery strategies. The application developer specifies these configurations and recovery actions via a language called ARIEL. For configuration purposes, ARIEL is able to set parameters and properties of the basic tools. For expressing recovery strategies - i.e. indicating fault tolerance strategies by detailing localization, containment and recovery actions to be executed when an error is detected - ARIEL allows building database queries and attaching actions to such queries. These actions allow interaction with the elements from the library, or to start, terminate, inform or isolate an entity (a node, a task or a set of tasks). As such, it is possible to start a standby task, to reset a node or link, to generate synchronization signals for reconfiguration, etc. Following this framework approach, increasing the dependability of an application implies the configuration and integration of basic fault tolerance tools from the library into the application and writing the recovery strategy in ARIEL, i.e. a script describing the recovery actions to be executed when an error is detected. This script is translated into a compact code executed at run-time by the backbone when an error is detected. It matches well a number of coarse-grained local and distributed fault tolerance mechanisms: the different ARIEL templates support standby sparing, recovery blocks, N-modular redundancy, etc. The power of ARIEL is its ability to describe local, distributed or system-wide recovery and reconfiguration strategies. The backbone takes care of passing the necessary information to other nodes in the system and of initiating the recovery actions at the nodes. Using ARIEL and the framework approach lets the developer separately address the (non-functional) aspects of application recovery (written in ARIEL) from those pertaining to the

1. Introduction Industrial distributed embedded systems –such as those found in the control and automation of electrical energy and telecom infrastructures– rely on hardware and software off-the-shelf components to ensure cost-efficient exploitation. In this race for cost reductions, open communication systems, such as TCP/IP and Internet, are considered as alternatives to dedicated communication lines, when these embedded systems are interconnected for distributed control purposes or data-exchange. This results in additional vulnerabilities that can hamper the efficient exploitation of the services. As a result, mechanisms need to be introduced to ensure that the local automation systems continue to function autonomously when their interconnection breaks down, and they need to reintegrate when this interconnection is up again. In other words, they have to deal with dynamic environments. This paper presents a set of techniques at middleware level that can be deployed to ensure that automation systems are both dependable and flexible, via a framework approach providing fault tolerance capabilities to embedded systems by exploiting the systems’ distributed hardware and by separating the functional behaviour from the recovery strategy (i.e., the set of actions to be executed when an error is detected). This conceptual framework consists of the following three entities [1, 2, 3, 4]. • A library of basic fault tolerance tools provides basic elements for error detection, localisation, containment, recovery and fault masking. The tools are software-based implementations of well-known fault tolerance mechanisms, grouped in a library of adaptable, parametric functions. These basic tools can be used on their own, or as co-operating entities attached to the control backbone (see below). Examples include watchdogs, voting units, support for acceptance tests and replicated memory. • The control backbone is a distributed application extracting information on the application’s topology, its progress and status. It stores this information in a replicated database and coordinates the fault tolerance actions at run-time

-1 -

(functional) behaviour that the application should have in the absence of faults (written in C or more dedicated programming languages). This allows, for instance, modifying the recovery strategy with only a limited impact on the application, and vice-versa. It results in a higher flexibility and a better maintainability of the application (assuming a reliable interface and an orthogonal division of application functionality from fault tolerance strategies). The innovative aspects of this approach do not come from the implementation of the library of well-known fault tolerance tools, but rather from the combination with the backbone executing userdefined recovery actions when an error is detected. It is important to note that the ARIEL on its own does not provide a complete fault tolerance solution. The ARIEL recovery scripts have to be triggered by the error detection mechanisms from the library of basic tools, the application or the platform. This implies that the coverage of the fault tolerance strategy driven by the ARIEL scripts cannot be higher than the coverage of the error detection tools triggering their execution. Furthermore, it is the task of the developer to provide the ARIEL configuration parameters and recovery scripts appropriate for a given application on a given platform. The developer also needs to assess if the applicationspecific timing constraints are met under the worst case execution times of recovery strategies. Software -implemented fault tolerance may need to be complemented by other approaches or techniques on lower levels (hardware or operating system), for instance, to be able to meet hard realtime requirements, and/or by application-specific mechanisms. The target systems are distributed embedded automation applications, having a statically defined set of tasks executing on a predefined number of nodes, and interconnected via a real-time network. This framework approach has been partially implemented as middleware on top of the operating systems Windows CE, VxWorks, and the POSIXcompliant real-time operating system QNX. This three-tiered approach (library, backbone, ARIEL) conforms to current trends in research and implementations. For instance, the suitability of libraries of software-implemented fault tolerance solutions to improve the dependability of distributed applications has been shown in [5, 6]. Besides, the middleware approach towards fault tolerance gained much support recently [7, 8, 9, 10], especially as middleware provides a certain level of compatibility among platforms and applications. The main innovation is the adoption of the configuration-andrecovery language ARIEL.

substation for electricity distribution. The PSAS requires protection, control, monitoring and supervision capabilities and it is representative for many applications with dependability requirements in the energy field [11]. In an ongoing renewal of the infrastructure, the company decided to replace the dedicated hardware-based fault tolerance solutions by commercial, distributed platforms – industrial computers and dedicated processing boards – running real-time operating systems. This decision was motivated by the need for more functionality, where development of new, dedicated (hardware based) solutions was considered to be too expensive and not flexible enough to keep up with the evolving requirements in the deregulated electricity market. The required dependability level has to be reached by using hardware redundancy in the distributed platform, combined with software-implemented fault tolerance solutions at middleware level. The need for adaptability to new situations and maintainability of the software is accomplished using the configuration-and-recovery language ARIEL. Although software-based fault tolerance may have less coverage than hardware-based solutions, this was not considered inhibitive, because the physical (electrical, non-programmable) protection in the plant continues to act, as a last resort, as a safeguard for non-covered faults. Besides, a highquality level of software engineering and on-site testing remains important not to introduce software design faults that could hamp er mission-critical functionality. The major source of faults in the system is electromagnetic interference (EMI) [11] caused by the process itself (opening and closing of HV connections) in spite of the attention paid to design for electromagnetic compatibility. This results in errors in communication, computation and memory subsystems. As an example of the presented approach, a software module has been designed, implementing so-called stabilizing memory [12], as a mechanism combining physical with temporal redundancy (and with several protocols) to recover from transient faults affecting memory or computation, and preventing incorrect output to the field. With respect to a previous solution relying exclusively on dedicated hardware boards, this software implementation of the stable memory module better supports maintainability. For instance, it is possible to set parameters for the size of the stabilizing memory, the number of physically redundant copies, the number of temporally redundant copies, etc. The developer can also modify the allocation of the physically distributed copies to the available resources. The configuration-and-recovery language ARIEL sets all configuration parameters as well as the recovery actions to be taken in case an error is detected. The additional flexibility offered by these recovery strategies allows for instance for

2. Pilot: primary substation automation system This framework approach has been integrated in a Primary Substation Automation System (PSAS), i.e. the embedded hardware and software in a

-2 -

application A on site Y application A on site Z

application A on site X intra-site network on site Y

intra-site network on site Z

intra-site network on site X

shared, inter-site network for applications A, B, C, etc.

Figure 1: Architecture of the target system. The application A runs on 3 sites (X, Y, Z) as a distributed real-time application. The interconnection among the different parts of the application A happens via a non-real-time network (Internet-alike) that is also used by other applications B, C. reconfiguration by re-allocating the distributed copies to non-failed components in case that a fault occurred. The recovery strategies are not hard-coded in the application code, but are specified in ARIEL as an ancillary application layer and executed by the backbone interacting with the module. As the interface to the dedicated board and to the software module is identical, the complexity for the developer is equivalent in both implementations. The framework-based implementation of the stabilizing memory module meets the real-time requirements of the application, having a cycle time of 100 ms. This framework approach applied to this single module as well as applied to the entire PSAS application, confirmed to improve the maintainability of the application: in the traditional methodology with dedicated hardware solutions, every change in the environment (larger system, additional functionality) resulted in a different implementation of the dedicated solution, or to adaptations of the existing controller hardware. By using ARIEL and the framework approach, the configuration of the framework elements and recovery strategies themselves can be adapted without major modifications to the application code. Analogously, if the functional aspects of the application need to be modified, this does not necessarily interfere with fault tolerance strategies. For the company, integrating the framework approach in the PSAS met the primary objectives of increased flexibility and maintainability, while the application continued to fulfil its requirements in terms of functionality, timing and dependability.

breakdown of the electricity distribution or in case of local overloads (distributed control). In this context, one can take advantage of modelling these automation systems at two levels (Figure 1) [13]. • The intra-site level corresponds to the distributed embedded application for which its nodes are connected via a local area network, or via dedicated point-to-point connections. This intrasite network is only used by this application, and the application has complete control. This network als o provides real-time support. • The inter-site level interconnects the sites via a non-dedicated, open network (for instance, the Internet) that shared with other applications, and hence not under control of the particular application. This inter-site network is mainly used for non-real-time communication. However, near-future industrial demands impose quality-of-service or (soft) real-time requirements on this inter-site communication. This inter-site network allows introducing costsaving features into the applications, such as: • Remote diagnosis of local sites, for instance by step-by-step execution of a process guided via visual feedback over the inter-site network. • Remote maintenance of embedded systems, e.g., software updates or upgrades of system modules without shutting down the entire local distributed system and while still providing partial services. • Remote control of embedded systems over nondedicated inter-site connections, if a certain quality-of-service can be guaranteed by the intersite communication system. These interconnected distributed embedded automation applications are not only subject to classical (physical) hardware faults affecting parts of the intra-site system or the inter-site connection system, leading to unavailability of computing nodes, or of parts of the network. They can also suffer from external, malicious faults (attacks,

3. Extension to interconnected systems The distributed embedded automation systems become more and more interconnected via nondedicated networks. For instance, all PSAS in a region are interconnected to allow load balancing and orchestrated reactions in case of partial

-3 -

intrusions) affecting the inter-site connections; these may cause the unavailability of the inter-site network (for instance denial-of-service attacks) or endanger the integrity of the data. Furthermore, the presence of other applications, making use of the same inter-site network, results in a dynamic environment, leading to bandwidth reduction or nondeterminism. In this context, the configuration-and-recovery language ARIEL and the framework approach provide a powerful way to deal with these interconnected distributed applications, due to the separation of the functional aspects of the application and the recovery strategies that are to be executed in case an anomaly is detected, at intra-site or at inter-site level. Specifically for the dynamic environment, one can modify the recovery strategies dynamically, by providing different recovery scripts corresponding to different situations. For instance, the security of the inter-site communication can be increased in case of attacks, by selecting a different encryption scheme, at the cost of a higher performance overhead.

maximise this probability. Source routing (i.e. the indication of the nodes to pass by on the route to the destination node) is deployed to ensure different paths that have as few common sections as possible. From an architectural point of view, the different sites of the distributed embedded application are interconnected by means of a gateway to the open inter-site communication network. The initial locations of the gateways of the different sites of the application have to be known statically. This gateway has as main goal to support QoS via encrypted and redundant communication over open Internet connections. Such a gateway is considered as a single unit, but it may consist of several entities, each connected to different Internet Service Providers, for improved availability or load balancing. The functionality of the gateway is threefold. • Map discovery. Each gateway discovers a portion of the Internet topology, located around the main path interconnecting this gateway to the gateways of the other sites. This is obtained by means of periodical source-routed packet probes with a random target. A topology database is built by each gateway, based on the information obtained by these probes. The database contents is periodically exchanged among gateways so that an aggregated view is obtained. The implemented system is able to work both with IP version 4 and IP version 6, collecting topology information regarding both networks. • Packet analysis. When a packet arrives at the gateway for inter-site transmission, it is analysed • to identify the required QoS, in a scale from 0 (lowest) to 4 (highest); • to identify if the communication requires encryption; and • to authenticate if the sender is allowed to transmit inter-site packets by means of a comparison between the source IP address and an access list configured on the gateway. The 5 QoS levels are encoded in 3 bits and the request for encryption in 1 bit. As this traffic is proprietary, 4 bits of the Type-of-Service field in the IP header can be used. The application developer sets all parameters. • Multi-path transmission. The number of paths along which the packet is sent, is determined on one hand by the requested QoS, and on the other hand by using a model of the inter-site network, fuelled by network load parameters (e.g. roundtrip-time), collected by discovery probes. Simulations provide evidence that 2 to 5 replicas are optimal in congested networks to obtain the highest QoS level. The paths themselves are determined from the Internet topology database, with the constraint that first the number of common nodes is minimised, followed by the number of hops for each path. If the encryption flag is set, the packet is encapsulated and

4. Providing quality-of-service in open, inter-site communication Off-the-shelf Internet is based on best effort mechanisms for communication and does not support quality-of-service (QoS) by default. There are several schemes, supported by specific routers, that are able to manage QoS: e.g., DiffServ and IntServ [14]. The former is based on traffic flow labels and priority queuing mechanisms; this leads to QoS on a per-flow basis. The latter is based on resource reservation procedures and has less performance, but the system does not require that all routers implement such a mechanism and no state information is maintained in a router. Anyway, the diffusion of these techniques is actually poor [14]. An investigation of the communication traffic for the target embedded applications shows that not all data transfer among the sites is evenly critical: control- and alarm-related messages require a higher QoS than data for management purposes [15]. Furthermore, the relative amount of inter-site traffic for a given application is relatively low with respect to the intra-site communication on one hand and with the Internet load on the other. With this observation in mind, and taking into account that few Internet routers actually support the DiffServ and IntServ strategies, a different approach has to be followed. The main idea is to obtain soft QoS by sending several replicas of a single packet, along different routes. This increases the probability that a packet is delivered to its destination without the need for retransmissions. However, these extra messages cause the network traffic load to increase, in turn reducing the delivery probability. This observation underlines the trade-off to be considered, in order to

-4 -

encrypted, so to allow a higher confidentiality (including authentication) and an additional integrity check. The packets are then tunnelled from one gateway to the others by replicating the packet into the selected paths. The PSAS application from section 2 additionally possesses a wireless, GPRS-based dedicated line, to ensure that the most critical intersite communications can continue, in case the Internet connections are unavailable (e.g., during a denial-of-service attack). This more expensive solution, based on diversity, is only used when the connection to the open communication infrastructure is not able to provide the required QoS.

cascading effects. It would only be possible to study these complex systems by appropriate simulations. The best way to cope with these uncertainties is to consider end-to-end properties from the application (or service) point of view: bandwidth, latency, availability, etc, which have to be determined by focusing on the interconnection of the different subsystems [19]. Focusing on integration would imply a systematic approach, namely knowledge of the effects of interconnecting components in a given manner; this cannot carry the connotation of incomplete knowledge of the impact of the interconnected components on the behaviour of the resulting system. Focusing on interconnection starts from the hypothesis that the cloud of the Internet is not really a cloud, but a set of physical and logical connections, which can (at least stochastically) be characterised for the relevant properties, and may also be able to deal with the fact that only qualitative or semi-quantitative information is available. A maximum of autonomy and fault tolerance at intra-site level is thereby essential.

5. Outlook on interdependencies Currently, our research concentrates on collecting dependability data from the deployed configuration and on performance measurements of the open, inter-site communications. It also includes modelling fault propagation through the different involved infrastructures (telecommunication, electricity, and automation systems) in order to identify how fault and failure models need to be adapted. This will allow to investigate vulnerabilities in detail, and to quantify their effect on performance, QoS, real-time behaviour, dependability, and costs; or to trade off one against another [16, 17]. There are many open questions: the level of modelling that is useful; the determination of is a set of relevant parameters, which is neither too small to be unrealistic, nor too large to be unmanageable, etc. If a failure in one infrastructure causes faults in a dependent system, it is obvious that these failures cannot be treated as independent, being the typical approach in reliability and performability analysis. For interdependent systems, this becomes even more complicated, as a single fault can both lead to direct errors and failures, and to secondary effects after propagation through other infrastructures. For instance, the failure of a computing node of an embedded automation system (such as a PSAS) can bring down part of the electricity network, which can in turn lead to communication failures, which may prevent triggering appropriate recovery actions inside the embedded automation system. One of the open questions to be solved concerns the representation of these interdependency loops and feedback paths [18, 19]. These feedback loops are a source of non-linearity, and ultimately of emergent behaviour that cannot be easily foreseen from the functionality of an individual system, producing a lack of determinism in the system behaviour (at least from the point of view of predictive timing behaviour). That means that single failures may propagate, causing new disturbances, inside the infrastructure as well as in other infrastructures. Feedback loops hinder the prediction of the dynamics of faults, and of interactions and

Acknowledgements This project has been partially supported by the European project IST-2000-25434 (DepAuDE), the K.U.Leuven project GOA/2001/04, and by the Fund for Scientific Research - Flanders (Belgium, F.W.O.) through the Postdoctoral Fellowship for Geert Deconinck and “Krediet aan Navorsers 1.5.148.02”. References 1. G. Deconinck, V. De Florio, O. Botti: “Separating Recovery Strategies from Application Functionality: Experiences with a Framework Approach,” Proc. 2001 Ann. Reliability & Maintainability Symp. (RAMS2001) (IEEE, Piscataway, NJ), Philadelphia, PA, USA, Jan. 22-25, 2001, pp. 246-251. 2. G. Deconinck, V. De Florio, R. Lauwereins, R. Belmans: “A Software Library, a Control Backbone and User-Specified Recovery Strategies to Enhance the Dependability of Embedded Systems,” Proc. 25th EUROMICRO Conf. (EuroMicro’99) (IEEE Comp. Soc. Press, Los Alamitos, CA), Workshop on Dependable Computing Systems, Milan, Italy, Sep. 8-10, 1999, Vol. II, pp. 98-104. 3. V. De Florio, G. Deconinck, R. Lauwereins: “Software Tool Combining Fault Masking with User-Defined Recovery Strategies,” IEE Proc. Software, Special issue on dependable computing systems (IEE, London, UK), Vol. 145, No. 6, Dec. 1998, pp. 203-211. 4. G. Deconinck, V. De Florio, O. Botti: “Software Implemented Fault Tolerance and Separate Recovery Strategies Enhance Maintainability,” accepted for IEEE Trans Reliability (TR-2001-054, scheduled to appear in Jun. 2002).

-5 -

5. Y. Huang, C.M.R. Kintala: “Software Fault Tolerance in the Application Layer,” in “Software Fault Tolerance,” M. Lyu (Ed.), John Wiley & Sons, Mar. 1995. 6. M.R. Lyu (Ed.): “Handbook of Software Reliability Engineering,” McGraw-Hill, New York, 1995. 7. Z.T. Kalbarczyk, R.K. Iyer, S. Bagchi, K. Whisnant: “Chameleon: A Software Infrastructure for Adaptive Fault Tolerance,” IEEE Trans. On Parallel and Distributed Systems, Vol. 10, No. 6, Jun 1999, pp. 560-579. 8. K.H. Kim: “ROAFTS: A Middleware Architecture for Real-time Object-oriented Adaptive Fault Tolerance Support,” Proc. High-Assurance Systems Engineering Symp (HASE'98) (IEEE CS), Washington, D.C., Nov. 1998, pp. 50-57. 9. M. Cukier, J. Ren, C. Sabnis, D. Henke, J. Pistole, W. Sanders, D. Bakken, M. Berman, D. Karr, R. Schantz: “AQUA : An Adaptive Architecture That Provides Dependable Distributed Objects,” Proc. 17th Symp. on Reliable and Distributed Systems (SRDS-17), West Lafayette, IN, Oct. 1998, pp. 245-253. 10. D. Bakken, Z. Zhan, C. Jones, D. Karr: “Middleware Support for Voting and Data Fusion,” Proc. Int. Conf. on Dependable Systems and Networks (DSN-2001), IEEE/IFIP, Göteborg, Sweden, July 1-4, 2001, pp. 453-462. 11. R. Gargiuli, P.G. Mirandola, et al.: “ENEL Approach to Computer Supervisory Remote Control of Electric Power Distribution Network,” Proc. 6th IEE Int. Conf. on Electricity Distribution (CIRED’81), Brighton (UK), 1981, pp. 187-192. 12. G. Deconinck, O. Botti, F. Cassinari, V. De Florio, R. Lauwereins: “Stable Memory in Substation Automation: a Case Study,” Digest of Papers of 28th Annual Int. Symp. on Fault-Tolerant Computing (FTCS-28) (IEEE Comp. Soc. Press, Los Alamitos, CA), Munich, Germany, Jun. 1998, pp. 452-457.

13. G. Deconinck, V. De Florio, G. Dondossola, H. Kuefner, G. Mazzini, S. Calella, S. Donatelli: “Distributed embedded automation systems: dynamic environments and dependability,” Supplement of the Int. Conf. on Dependable Systems and Networks (DSN2001 - Special Track: European Dependability Initiative), Gothenburg, Sweden, Jul. 1-4 2001, pp. D16-D19. 14. C. Huitema: “Routing in the Internet,” 2nd Ed., Prentice Hall, 2000, 450 pages. 15. DepAuDE consortium: “Collection of dependability requirements for embedded distributed automation systems in dynamic environments,” D1.1 project deliverable, Jul 2001, 79 pages, http://www.depaude.org 16. M. Masera: “Mastering the vulnerabilities of information and interdependent infrastructures”. Report from the 1st Meeting of the European Working Group on Information Infrastructure Interdependencies and Vulnerabilities, Milan, Italy, Nov. 2001. http://deppy.jrc.it 17. H.A.M. Luiijf, M.H.A. Klaver, IN BITS AND PIECES: Vulnerability of the Netherlands ICT infrastructure and consequences for the information society, English version of “BITBREUK, de kwetsbaarheid van de ICT-infrastructuur en de gevolgen voor de informatiemaatschappij,” workshop “Vulnerabilities of ICT-networks,” Amsterdam, The Netherlands, Mar. 2000. http://www.infodrome.nl 18. N. Kyriakopoulos, M. Wilikens: “Dependability of complex open systems: A unifying concept for understanding Internet-related issues.” Proc. 3 rd Information Survivability Workshop (ISW2000) (IEEE Comp. Soc. Press, Los Alamitos, CA), Boston/Cambridge, MA, USA, Oct. 24-26, 2000, 4 pages. 19. N. Kyriakopoulos, M. Wilikens: “Dependability and Complexity: Exploring ideas for studying open systems. Full report,” Joint Research Centre, Ispra, Italy, Dec 2000, 31 pages. http://deppy.jrc.it

-6 -