Systems Oriented Approach to Production Support A Viable Framework Veerendra K Rai Systems Research Laboratory Tata Consultancy Services 54 B Hadapsar Industrial Estate, Pune- 411013, India
[email protected] Abstract— Production support environment gives the impression as if it is all about managing discrete events. A deeper inquiry reveals production support must be modeled as a closed loop feedback monitoring and control system rather than an open loop system, which cannot handle exceptions nor can it learn from experience. This paper adapts production support to Viable System Model framework. It identifies and develops the five subsystems of Viable System Model in the production management environment. The paper identifies how complexity can be unfolded and how variety attenuation and amplification can be implemented in production support environment to handle the complexity. The paper integrates System Dynamics approach with Viable System Model to implement three levels of adaptations- homeostatic, anticipatory and structural for production support. Keywords-production support; viable system model; adaptation; system dynamics.
I.
INTRODUCTION
Production support, IT service management or IT service delivery, whichever be the most appropriate name, is an industrial practice of supporting the IT systems and applications, which are currently being used by the end users. End users raise various kinds of incidents and requests. A production support person or team is responsible for receiving incidents and requests from end-users, analyzing them and either responding to the end user with a solution or escalating it to the other IT teams. These teams may include developers, business analysts, system engineers and database administrators. The team is required to deliver and manage services at agreed service levels to customers. Production support processes include incident management, event management, problem management, request fulfillment and access management. Some or all of these processes are implemented across three major functional areas. Technical management (such as servers, network, storage etc.). Application management, and Operations management (such as batch jobs, monitoring and control). The primary aim of production support is to ensure stability of operations of IT services, applications and infrastructure within defined Service Level Agreements (SLAs) while optimizing costs and resource utilization [1]. Incidents could
broadly be of two types: - user-experienced incidents and technical incidents. User-experienced incidents could either occur at application level (e.g. services not available) or at hardware level (system is down; printer is not working). Userexperienced incidents are manifest, while technical incidents are sub-clinical. In case of technical incidents there could be a gradual decline, which user may not notice it- until they become manifest. Technicians monitoring the IT infrastructure system proactively can diagnose these incidents. Technical incidents could have larger impact if not resolved in time. Disc is nearly full can be noticed while monitoring, but it still does not amount to a user incident as it is not completely full yet [2]. Incident management and problem management are the two major components of production support. Objective of the former is to resolve the incident and restore operations as soon as possible for users. Underneath incidents are problems. A problem is an unknown underlying cause of one or more incidents. Once the incidents are resolved operation moves to problem management. Problem management consists of problem control to find root cause and error control to fix the problem. Feedback loop from problem management to incident management is at the core of closed loop feedback monitoring system for production support. Fixing the problem may involve bringing about change in the application or software system, which may range from changing the code to changing the architecture itself. II.
THE VIABLE FRAMEWORK
A. Introducing the Viable System Model Stafford Beer’s Viable System Model (VSM) [3], [4], [5] has been used for decades now to diagnose whether a given organization structure and communications meets the necessary and sufficient conditions of viability [6]. Viability is a systemic property that means a system is able to perpetuate its identity in its environment while engaged in purposeful behavior and meeting stakeholders’ expectations. VSM has a set of five cooperating viable subsystems conveniently named systems 1-5. System 1 is operations- the work an organization performs, which defines its identity as a manufacturer or service provider, for instance. The operations may be spread across product or service lines and therefore we need mechanisms to coordinate the activities of system 1. This is referred to as System 2- the coordination or regulation. System 3 is monitoring, control and synergy between system 1 units where synergy is defined as
Policy • • •
Structural adaptation Policy design & Meta rules Structural changes and evolution in production support model
• • •
Anticipatory adaptation Predicting trends & planning Changing internal control algorithms Model of production support Simulation
Planning Future environment
• •
Total environment
Monitoring -Dashboard monitoring -SLA compliance - Investigation of deviation -Audit
• • • • •
System Dynamics Model
Operational Control Homeostatic adaptation Optimization (cost & resource) Performance measure & Target setting Synergy (between levels) Internal algorithms to control production support
Monitoring
Regulation
Production support
Current environment
Management of level 1
Management at level 2
Management at level 3
Level 1 support
Level 2 support
Level 3 support
Regulation Production support plan Defining escalation rules Coordination between levels and teams Ticket types and Staffing rules Scheduling & shift planning
Figure 1. A VSM based framework for production support
being the whole, which is more than sum of its part [5]. In VSM framework only system 1 and System 4 interact with the environment. System 1 interacts with current environment for functional purposes while system 4 looks for future trends in the environment for planning and strategy. System 5 denotes the policy function and puts all components together to constitute the personality of the organization as a whole. B. Adopting VSM for Production Support Please refer to Fig. 1 for the overall VSM based framework for production support. System 1 consists of 3 Levels of support for activities such as Incident, problem, and change
management; event management, request fulfillment and access management. These are the core operational activities of production support. System 1 also consists of management of these core activities. System 2 is about regulation and coordination, and regulates the activities of system 1 with the help of production support plan, escalation rules, coordination between levels and teams, incident classification, staffing rules and overall scheduling of activities. System 3, the monitoring and control system consists of homeostatic adaptation with respect to a target. SLA compliance, synergy between levels, publishing of metrics dashboard and overall optimization along cost, effort,
and service availability are part of System 3 in production support environment. This study has constructed a system dynamics (SD) model to augment the Viable System Model. System dynamics is a methodology to understand behavior of complex systems across time [7], [8]. It is based on the premise that behavior of a system is manifestation of its structure and structure is made of feedback loops. SD is also used for policy analysis and policy design. Structure is captured in causal loop diagram (CLD) [7] & [8] by connecting the parameters that describe the problem domain with causal relationship. So, there will be a link between parameters A and parameter B if A causes B or vice versa. Based on causal loop diagram a stock and flow diagram (SD model) is constructed, which consists of level variables, rate variables, auxiliary variables and constants. Eventually, mathematical relationship between parameters is established and the SD model is simulated to see the behavior of the model in time. An SD model is said to be valid when its behavior in simulation space coincides with the actual behavior of the phenomenon under consideration. III.
ADAPTATIONS
This paper discusses 3 levels of adaptations for production support environment. Homeostatic, anticipatory and structural adaptation and discusses each of them separately in the following section. A. Homeostatic adaptation Homeostatic adaptation works as given in Fig 2. State of activities and operations of production support are reported into the dashboard in the form of metrics of interest, which is monitored by the controlling unit. If and when the targets are missed a policy review and analysis is done to see if existing policies needs to be changed. Once policy change is determined corresponding parameters and their values are reset. System dynamics model is re-simulated with new parameter values to see if targets are achieved. All this happens in simulation space and intervention in the real space is done based on simulation result and confidence in the validity of the model. By definition, homeostatic adaptation is about maintaining the equilibrium in steady state. This is what precisely system 3 of VSM does, which is required for production support in steady state. It is extremely important to understand that the sole purpose of production support is to ensure ‘service availability’ to end users. Everything else, such as, reduced cost of production support, increase in operational process capacity, productivity gains etc. are objectives. Service availability cannot be compromised for any of these objectives. From this point of view the sole purpose of system 3 (monitoring and control) is to meet SLAs and ensure service availability. In other words homoeostatic adaptation is all about meeting SLAs and equilibrium in steady state is compliance to SLAs. Adaptations for control systems are mentioned in [9]. It refers to homeostatic, morphostatic, morphogenetic adaptations. It does not, however, discuss how these
adaptations could be implemented. This study takes system dynamics approach to implement these adaptations. Monitoring & Control
Target dashboard
Policy analysis
Activity / operations
Adjusting parameters values.
SD Model simulation
Figure 2: Homeostatic adaptation
In VSM terminology system 4 is intelligence, planning and strategy formulation. In production support environment system 4 is mainly concerned with anticipatory adaptation, predicting trends and planning and changing internal control algorithms or escalation rules. System 4 contains a model of production support. Internal control algorithms are about tuning up the rate variables in system dynamics model. Fig. 3 shows anticipatory adaptation. In VSM framework system 4 is supposed to have model of the production system. In this study that model is a system dynamics model and as given in Fig. 1 the model connects to system 3, 4 and 5 for different reasons. Scenarios anticipation
Trend sensing
Policy analysis
Production environment scanning
Adjusting parameters values
SD Model simulation
Figure 3. Anticipatory adaptation
B. Anticipatory adaptation Anticipatory adaptation is about sensing the trends which give rise to scenarios / events and changing rules or internal control algorithms (rate variables) of the system dynamics model to deal with the situation as and when it arises. An example of a trend is a large number of users logging into the system thus giving rise to the event of relatively large number of incidents being generated thus reducing Mean Time Between Incidents (MTBI). As given in Fig. 3 policy analysis is done with respect to the likely scenario and adjusting the model parameters corresponding to the policies and simulating the model to see if these policies will be effective in the likely
scenario. When MTBI decreases incident resolution efficiency must increase in order to comply with SLAs. Gain in efficiency can be attained by changing staffing policies (e.g. deploying more resources), increasing operational process capacity, increasing KEDB effectiveness etc. Each of these policies or a combination thereof is tried and the most optimal solution is selected for implementation. In anticipatory adaptation structural adjustment to the model may be done though no major structural changes are implemented. Major structural changes are done for structural adaptation as discussed in the next section. It must be noted here that these segregationshomeostatic, anticipatory, and structural adaptation are conceptual in nature. Eventually there is only one system. C. Structural adaptation Fig. 4 explains structural adaptation. Modeler initially builds the model with his / her understanding of the domain. This model evolves as a result of following reasons- policy changes initiated to meet objectives; initial model was not adequate to handle certain scenarios, and the model has evolved as a result of learning and experience. The core of system oriented service management as opposed to event oriented approach to service management is that there is no way to capture learning in the latter while in the former learning can be captured in structure. Without structure it is difficult to capture learning from experience [10]. For instance, causal structure can capture learning from experience and structure determines behavior is one of the basic premises of system dynamics. Hence, causal model underlying the system dynamics model can be changed as a result of learning from experience and this change is reflected in model behavior. Changes to the model are effected in the following ways- by deleting elements, by adding elements and by changing the causal relationships among elements. Evolved SD model is used to update the domain knowledge. Homeostatic, anticipatory and structural adaptations together provide full control over the system and help implement system oriented production support. Structural adaptation is called for when major policy decision is taken. For instance, management may decide to add problem management level to preexisting incident management level or remove an existing level.
Domain knowledge
Initial System dynamics model
Policy change initiated to meet objectives Model is not adequate to handle a trend / scenario Evolution through learning & experience
Deleting components
Evolved SD model
Adding components
Readjusting causal relationships
Figure 4. Structural adaptation
D. The system dynamics model The model consists of the following process related viewsincident generation and response, incident management and problem management, which consists of root cause analysis. Cutting across these views are some common modules to handle resource management, competency management, metrics computation, and cost computation & control data. In system dynamics policy analysis and policy design happens around rate variables. Table 1 contains major policy groups, policies and rate variables in system dynamics model this study has used for production support. The policy groups considered are operating model, resource management and service delivery. As mentioned in the previous paragraph the system dynamics model in this study implements a typical view of production management consisting of 3 levels. Level 1 consists of incidence acknowledgement and response, level 2 consists of incident resolution and level 3 consists of problem management and root cause analysis. Escalation takes place from level 1 to level 3 governed by Service Level Agreements (SLAs) and escalation rules. 3 types of incidents have been considered in the model- commoditized events, functional incidents, and technical incidence. Commoditized incidents are common place generic incidents, which do not require any particular skill to resolve. Functional incidents are business process related queries, which can be solved by business analyst who understands business aspects of software application. Technical incidents relate to the software aspect of the application. It requires technically trained staff to handle these incidents. E. Handling Complexity and Variety If the entire complexity of production support were concentrated at one point or at one level, the complexity will be impossible to handle. From systems thinking point of view development life cycles are created essentially to unfold and spread the complexity in time and space. Software development life cycle (SDLC) is a case in point. Imagine requirements engineering, design, development and testing
done simultaneously all at once. It is like start developing the code right away to create an application or an information system. Likewise, in production support complexity can be spread across applications, across levels and across separation of concern among incident management, problem management and change management. Variety attenuation and amplification in production support environment happens by recognizing the fact the following inequality, which holds true for all organizations [11] also holds true for production support. Variety (E) ≥ Variety (O) ≥ Variety (M) This expression denotes the fact that variety of Environment is greater than variety of Operations and variety of operations is greater than variety of Management. This holds true for production support too and the following variety amplification and attenuation strategies could be used. Variety amplification is used to increase the variety of management to handle greater variety of operations and environment, while variety attenuation is used to absorb the greater variety of operations and environment. As per the principle of requisite variety by Ross Ashby [3], [4], [5] variety of controller must be greater or equal to the variety of controlled for any effective control and management. Therefore, management always endeavors to absorb incoming variety and enhance outgoing variety. In production support environment variety amplification can be achieved by Known Error Data Base (KEDB). KEDB contains incident-solution pair, which tells how an incident was handled in the past. If similar incident occurs again past solution can be readily used and incident is resolved in much less time, which increases incident management and operational process capacity- the total number of incidents a team can resolve in a given time period. Standard Operating Procedures (SOP) are another way to amplify variety. SOPs clearly spell out how an incident is to be managed and help users in ‘do it yourself’. By enabling users to handle issues themselves SOPs help reduce ‘incidents’ that are escalated for support. Virtual service desk is yet another way of variety amplification. With the help of appropriate tools and internet a virtual service desk [12] creates an illusion of a single centralized service desk even though support personnel could be located across different geographies and time zones. This arrangement results in increased productivity, efficient routing and improved control. Automation and tools for monitoring the trend are other two ways to amplify the variety of management to handle incident management. Automation helps increase efficiency and reduces errors. Tools for monitoring trends help anticipate trend and select strategies proactively. Variety attenuation means how to absorb variety that is coming from environment and operations to management and ensure effective management. This study has identified the following ways to mitigate variety in production supportincident classification, creation of levels, staffing policies, separation of concerns, accountability, resourcing bargaining, interventions, and identification of patterns.
Incident classification helps in identifying skill type needed to handle the incident. For instance, skill requirement will be different for commoditized incidents, functional incidents and technical incidents and thus incident classification influences staffing policies. Separation of concern between incident management, problem management and change management helps teams to focus on different aspects of production support simultaneously. The objective of incident management is to resolve the incident and restore the operations as soon as possible. Problem management, on the other hand, is concerned with doing root cause analysis to identify the problem behind the incident. Objective of change management is to rectify the error so that incidents associated with that error do not occur again. It appears separation of concern is possible because objectives are different. This study has also identified a set of invariants in production support and used them for design of production support system. Some of the invariants identified are as follows.
Incidents beyond level 1 support are indicative of problems.
Ascending technical niche across levels. Higher the level more skilled staff must be deployed to resolve the incident, perform root cause analysis or develop workaround.
Large majority of incidents (80%) are resolved at first level of support. Production support would not become viable and sustainable and service availability would be seriously impacted if majority of incidents are not resolved at first level of support.
There is a precedence order defined on production support activities. Incidence management precedes problem management and problem management precedes change management. This precedence order is similar to SLDC lifecycle.
Separation of monitoring and control from incident and problem management. As in case of VSM, monitoring and control (i.e. System 3) is separated from System 1 activities. In the context of production support monitoring and control is separated from incident and problem management. This can also be seen in Fig. 1 and Fig. 2.
TABLE I.
Policy group Operating model
POLICY AND CRITICAL RATES FOR PRODUCTION SUPPORT IMPLEMENTATION
Policy Dedicated team for incident response Dedicated team for problem management
Creation of standard operating procedures Right sourcing
Change management
Resource management
Service delivery
Dedicated teams for commoditized, functional and technical incidents (or, teams segregated by incident types).
Distributed teams to resolve incident types across support levels 2 and 3 Self-help portal for incident logging and categorization
Critical rates affected Incident response rate Problem response rate Problem investigation rate (Root cause analysis rate) Problem workaround rate Standard operating procedures validation rate Standard operating procedures validation rate Incidents resolved at level 1 (Onsite & Offsite) Incidents resolved at level 2 (Onsite & Offsite) Incidents resolved at level 3 (Onsite & Offsite) Known error resolution rate Problem workaround rate Incidents escalated to problem management Incident arrival rate Incidents resolved at level 1 (Across different Incident types and using KEDB or not using KEDB) Incidents resolved at level 2 (Across different Incident types and using KEDB or not using KEDB) Incidents resolved at level 3 (Across different Incident types and using KEDB or not using KEDB) Incidents escalated to problem management Problem response rate Problem investigation rate (Root cause analysis rate)
Incident response rate Incident arrival rate
REFERENCES CONCLUSION This study puts together a viable framework for system oriented production support. It adapts Viable System Model (VSM) for the purpose and augments it with system dynamics modeling and simulation to understand the complex dynamics of production support and incorporate learning from experience through causal structures. System dynamics approach has also helped in identifying a set of policies and controls in the form of rate variables in the system dynamics model. Knowledge of how to unfold complexity, attenuate and amplify variety to handle complexity and knowledge of invariants in the domain has helped to construct the viable framework. This study has also discussed three layers of adaptation- homeostatic, anticipatory and structural, which has been implemented through system dynamics approach.
[1]
ITIL Service Operations, Version 3 chapters 4, 5 and 6. Office of Governement Commerce (OGC), 2007. [2] ITILA guide to incident managementUcisa. http://www.ucisa.ac.uk/~/media/Files/members/activities/ITIL/service_o peration/incident_management/ITIL_a%20guide%20to%20incident%20 management%20pdf. [3] Beer, S., “The Heart of Enterprise”, Chichester: Wiley, 1979. [4] Beer, S., “Brain of the Firm, Chichester”: Wiley, 1981. [5] Beer, S., “Diagnosing the System for Organisations”, Chichester: Wiley, 1985. [6] A Leonard, “Viable System Model and Knowledge Management” Vol. 29 No. 5/6, 2000, pp. 710-715. [7] J. D. Sterman, System dynamics modelling: Tools for learning in a complex world, California management review 43 (1): 8 25, 2001. [8] P K J Mohaptra, P Mondal, M C Bora, Introduction to System Dynamics, Universities Press (India) 1994. [9] C Herring and S Kaplan, “Viable System Model for Software”, 4th World Multiconference on Systemics, Cybernetics and Informatics (SCI’2000). [10] ITIL Service Strategy, Version 3, Chapter 1. Office of Governement Commerce (OGC), 2007. [11] T Hilder, “The Viable System Model”, Cavendish Software Ltd, Presentation 1.03, 1995. [12] ITIL Service Operations, Version 3, Chapters 5 and 6. Office of Governement Commerce (OGC), 2007.