Int. J. Human Factors and Ergonomics, Vol. 2, Nos. 2/3, 2013
Disruption management processes during emergencies on the railways David Golightly*, Nastaran Dadashi and Sarah Sharples Human Factors Research Group, University of Nottingham, University Park, NG7 2RD, UK E-mail:
[email protected] E-mail:
[email protected] E-mail:
[email protected] *Corresponding author
Meena Dasigi Network Rail, No. 1 Eversholt Street, London, NW1 2DN, UK E-mail:
[email protected] Abstract: The following study presents an analysis of incident logs, recorded during major rail incidents (incurring more than 1,000 minutes total delay) to understand regular patterns in management and resolution across different incident types and across multiple roles. The analysis found that much effort goes into coordination of multiple actors and diagnosing both the cause and scale of the disrupting factor, as would be expected. Rather than events taking place merely in parallel, they are closely intertwined, with activities in one area (e.g., repairs to reopen track) being constrained by other activities (e.g., coordinating and mobilising appropriate people, arranging track access within a rescheduled service). These results question existing linear models of disruption management, both at an individual and organisational level, and have implications for decision-support for emergency management. Keywords: rail; incident management; decision support; incident analysis. Reference to this paper should be made as follows: Golightly, D., Dadashi, N., Sharples, S. and Dasigi, M. (2013) ‘Disruption management processes during emergencies on the railways’, Int. J. Human Factors and Ergonomics, Vol. 2, Nos. 2/3, pp.175–195. Biographical notes: David Golightly is a Senior Research Fellow in the Human Factors Research Group, University of Nottingham. His background is in the study of learning, problem solving and expertise, particularly involving the use of ICT. He also investigates methods for capturing and integrating user requirements in user-centred design and implementation processes. He has worked on a number of rail-related projects including understanding situation awareness in rail traffic control, evaluating the impact of procedural change on signalling workload and awareness, and an evaluation of best practice for implementing rail research within the rail sector.
Copyright © 2013 Inderscience Enterprises Ltd.
175
176
D. Golightly et al. Nastaran Dadashi has over five years of experience (including an industry-led PhD within Human Factors Research Group, University of Nottingham) in guiding and implementing human factors of complex environments. Her PhD, ‘Human factors of future railway intelligent infrastructure’, focused on adapting a cognitive approach to the design and understanding of complex multi-agent systems within socio-technical environments. She has been involved in a wide range of projects, including alarm handling in railway electrical control, emergency management within railways and improving the human interaction with the railway control. Sarah Sharples is a Professor of Human Factors at the University of Nottingham and the Head of the Human Factors Research Group. She has been a Researcher, Research Manager or grant holder on a number of industrial, government and EU funded projects, including a long term programme of research for Network Rail. She is a registered Ergonomist, and her main areas of interest and expertise are human-computer interaction, cognitive ergonomics and development of quantitative and qualitative research methodologies for examination of interaction with innovative technologies in complex systems. Meena Dasigi is a Principal Engineer at Network Rail. She is part of a team working on traffic management systems and ERTMS, and is the Project Director of the EU FP7 ON-TIME project. She has a Doctorate in Mathematics with over 30 years experience in modelling and simulations and optimisation both in academia and industry. Besides her railway experience, she has worked in several industries including airbus and the European space agency.
1
Introduction
Despite best efforts by rail stakeholders to maintain a punctual and reliable service, disruptions inevitably occur. Causes of disruption range from minor delays, such as passengers requiring longer than usual boarding times due to an icy platform, through to major events and emergencies, such as on-track fatalities or a failure of critical infrastructure. In 2006–7, 0.8 million incidents on the Great Britain rail network led to 14 million minutes delay at an estimated cost of £1 billion (NAO, 2008). From these, over 1,000 incidents each incurred over 1,000 minutes of total delay. These most severe delays have major implications not just in delayed passengers but also in terms of the amount of effort to coordinate people and equipment, issues with crowd control at stations, and spill-over into other transport modes as people seek travel alternatives. In the longer term, perceived unreliability and unpredictability can do much to undermine the perception of public transport as a viable option to the car (Friman and Garling, 2004) at a time when policy initiatives are trying to encourage more sustainable forms of travel (European Commission, 2011).
Disruption management processes during emergencies on the railways
177
In addition to the need to prevent incidents in the first place, it is vital to have clear strategies to contain and mitigate emergencies, and then work to restore normal service as quickly as possible. A potential tool for rapid emergency management is decision-support and an area of particular interest is supporting the re-planning of the timetable. During incidents there are a number of re-planning activities to keep the train service running around the affected area and then to bring the affected area back into service after the incident has closed. Currently, support for re-planning is often limited, reactive and dependent on the dispatchers’ skills (Reinach, 2007; Golightly et al., 2011). Decision-support could improve the potential outcome of these planning actions by presenting viable options given the current disruption context, and then communicating the new plan to other rail stakeholders (Lenior et al., 2006; Kauppi et al., 2006). Any decision-support tools used during disruption should also recognise the impact of crew and rolling stock management (Freling et al., 2004; Jespersen-Groth et al., 2009), as even best efforts to keep a service running might be hampered by crew or trains ending up out of position due to the temporary timetable. Finally, emerging forms of ICT and social media mean that passengers have increased expectations for remaining informed during incidents. In the absence of accurate information, misinformation may circulate as to the cause or estimated impact of disruption (Houghton and Golightly, 2011), so being able to disseminate information to public quickly and accurately is a priority for rail operations (ATOC, 2012). In practice, emergency and disruption management is a complex process, and when disruption does occur, various aspects of the rail socio-technical system come to the fore. As Kohl et al. (2007) note from emergency management in aviation, when planning and scheduling a flight, individual elements and resources involved with the process (crew, aircraft and passengers) are reviewed independently and in some form of isolation. When things go wrong, however, the inter-relation between these elements plays an important role. So it is with the railways, with close linkages between independent roles becoming apparent. It is worth describing the organisational structure of the railways here (represented diagrammatically in Figure 1). While this description is based on GB, and vastly simplified, the same basic structure is found in many countries including Italy, Belgium, the Netherlands and France. In other countries, such as Sweden, roles may be merged (Golightly et al., 2013). There are two types of organisation – the infrastructure manager (IM) who builds, maintains and operates the track (Network Rail in Great Britain), and train operating companies [sometimes railway undertakers (RUs)] who operate the trains. There is usually only one IM, though there may be many RUs. A parallel can be drawn to aviation where there may be one organisation that controls traffic, but many airlines. Front-line control of traffic is by a signaller (a dispatcher elsewhere), who sets points and signals to give a train authority to proceed. A train driver, part of the RU, then drives the train in accordance with this authority. During an incident, the primary communication is between drivers in the affected area and the signaller. For example, a
178
D. Golightly et al.
driver will communicate that there is an obstruction on the line. Also, the signaller may see indications of technical failures, such as points failures, on their displays (for more see Golightly et al., 2011). When an incident may potentially generate delays that cannot be managed within the timetable, the issue is escalated to incident control roles in both the RU and the IM. Together, these roles will coordinate an alternative to the timetable and, depending on the type and severity of the incident, may call on additional resources. Key activities during incidents include tactical re-planning by the front-line controller (Kauppi et al., 2006), re-organisation of the timetable or re-allocation of assets by a strategic control function (Farrington-Darby et al., 2006), integration with emergency services (Smith and Dowell, 2000) and the dissemination of information to other rail stakeholders and the travelling public (ATOC, 2012). Also, in the case of technical failure, the appropriate maintenance team must be informed of the issue and dispatched to make repairs. These repairs, and other activities during emergencies, can involve access to the track that requires careful planning and confirmation of protection arrangements (Golightly et al., 2012). Triangulating multiple agents in distributed environments is challenging and coordination is highly dependent on understanding the constraints that all actors are working to (Woods and Branlat, 2010). Rail staff may lack confidence and knowledge during incidents when dealing with roles from outside of the railways, such as emergency services (Cheng and Tsai, 2011). In Great Britain, particular attention has been paid towards improving coordination and communication with police, paramedics and fire service (NAO, 2008). Responses to events are likely to be specific and strategic. First, operators involved in re-planning vary in terms of the strategies they use. While some prefer riskier solutions that may bring the system back to normal more quickly, others prefer solutions that are more robust, if slower (Kauppi et al., 2006; Cheung, 2011). Discussion with incident controllers as pre-cursors to the analysis presented in this paper highlights simplicity of plans as another important constraint. Given the large numbers of actors involved, it may be preferable to have a simpler solution that can be interpreted and implemented accurately by all, rather than a complex strategy that may be prone to error. Experience is also a factor that will influence how incidents are managed, both in terms of general expertise and length of service (Cheng and Tsai, 2011) and in terms of local knowledge (Davis Associates Ltd., 2003; Golightly et al., 2011). Another element of variance with incidents is that the circumstances of each event are unique (Aminoff et al., 2007). On the railways, a failure in the same type of equipment will have radically different implications whether it is on a remote branch line, or an approach to a busy terminus. The contingencies that need to be compared for effective action mean that it may be impossible to fully automate decision-making and requires, as an alternative, that decision-support presents the potential costs and benefits of different courses of action, but leaves the ultimate planning of action to dispatchers and incident controllers (Lenior et al., 2006). One implication is that quantification of the success of any given re-planning option is difficult. In GB there are roles dedicated to post-event analysis for organisational learning and assessment of the quality of response, but this qualitative and on a case by case basis. The only metric of disruption is through delay attribution (discussed below) in terms of total minutes delay, duration of delay and trains affected. Also, there is some evidence that existing rail automation, in the form of automatic route setting, for example, is at its least effective during emergency and disrupted conditions (Balfe et al., 2012). Being able to model different types and characteristics of disruption could therefore contribute to more effective routing algorithms.
Disruption management processes during emergencies on the railways Figure 1
179
Schematic of key roles and organisations involved in incident management (see online version for colours)
There are existing models of disruption and emergency management for the railways. Cheng and Tsai (2011) use an adaption of the COSMO (Kontogiannis, 1999) model for railways – emphasising three high level stages of 1
diagnosis and assessment
2
choices among alternative options or goals
3
scheduling tasks and minimising risks.
Ezzedine and Kolski (2007) derive Petri nets of disruption handling based on Rasmussen’s (1986) decision ladder. Belmonte et al. (2011) model disruptions using functional resonance accident model (Hollnagel, 2007) approach, basing their model around a process of 1
monitoring
2
incident detection
3
incident diagnostic
4
choice of recovery action
5
action.
180
D. Golightly et al.
What these models [and similar models from other domains, such as Kohl (2007)] have in common is that they are fundamentally underpinned by a sequential model of situation awareness, decision making and action. There are a number of potential limitations with this view, however. First, there is a risk of interpreting these models as being sequential and incremental, and that situation awareness or decision making are discrete steps to be completed before the next stage can begin. In practice, emergency management is an inherently dynamic environment, where all information is not necessarily known at the outset. New problems come to light and plans must adapt as conditions deteriorate or improve. It is unclear how readily actual incident management conforms therefore to a linear sequence. A second limitation of using these models to understand larger organisational incident management processes is that they tend to focus on the individual and their action or strategy. As highlighted previously, emergency management is a coordinated activity, but typically these models focus on one role. Those models that are more organisational tend to be normative (e.g., NAO, 2008), describing how the process should proceed, rather than the reality of how these processes are coordinated. Teams involved in activities such as electrical isolation (Stanton et al., 2009) or arranging track access (Golightly et al., 2012) work in cycles of consensus building, action and cross checking. Building a contextual understanding of the nature of any socio-technical system, and the constraints it works under in practice, is essential for the successful implementation of new emergency management technology. It is also vital to fully understand processes and constraints relevant to emergency management training (Kontogiannis, 1999). Research using simulation (Belmonte et al., 2011), subjective data (Cheng and Tsai, 2011) or real-time observation (Farrginton-Darby et al., 2006) is useful for understanding individual perspectives, but there is a need for a whole-organisation perspective of incident management. The following study comprises an exploratory analysis of incident logs of major rail incidents (classifying ‘major’ as any incident that incurs over 1,000 minutes of train delay). The source data is taken from the GB control centre incident log (CCIL). Each incident log is an ongoing record, completed during the incident by relevant roles, of major tasks, decisions, actors involved and events. As well as being a record of all actors involved in incident management, and therefore useful to understand how organisations manage rail incident, information is recorded in real-time that should increase the reliability of the information contained within. Every incident also has a quantification of delay in terms of time and cost. In GB, delays result in delay attribution penalties paid to all affected parties. For example, if a points failure occurs then the infrastructure manager may be liable to pay delay attribution costs to any affected train operating company. This requires a formal process for attributing delay, and a set of ‘delay attribution officers’ responsible for recording all delay in real-time (Delay Attribution Board, 2013). This leads to accurate records of the number of trains affected by an incident, the delay each of those trains incurs, and therefore a calculation of total delay incurred for any incident. The aim of the analysis was to understand the types of activities involved in major rail incident management, and how those activities were structured and related. One critical question was how similar events were, and how much the type of incident would influence the trajectory of incident management, or the types of activity involved. A further aim of the analysis was to validate and extend some of the theoretical models of railway emergency management that have appeared to date in the literature. The approach taken was inductive and exploratory, in that there were no specific hypotheses
Disruption management processes during emergencies on the railways
181
to test, but rather that a more descriptive approach to incident analysis should be used in order to contrast the patterns found with existing models, such as those in Cheng and Tsai (2011), or Belmonte et al. (2011). Finally, from the analysis it would also be possible to identify specific details relating to certain types of activity – for example, what were the different service options selected during the incident. This could determine the sets of options that were relevant to any given activity, for inclusion in tools such as decision support, or training. One of the underpinning assumptions of using this dataset was number of entries was representative of the volume of activity associated with an event. It was therefore also necessary to test whether the number of entries correlated with any of the measures severity of the incident such as number of trains affected or minutes delay. This was tested using correlations.
2
Method
2.1 Data source The GB railway CCIL incident log is a record completed in near real-time of events during rail incident management. The logs are completed by incident managers based at Network Rail, the GB railway infrastructure provider, during the course of the incident. Each entry in the log gives a short description of the event at that point. A hypothetical example is presented to Table 1. Table 1
Hypothetical example of log entry for overhead line dewirement
Time
Entry
15:22
Driver of 1A23 (a standard GB train identifier code) called signaller Tyne IECC (a signalling control centre), to notify that overhead line has come down on the up fast line near Chester-Le-Street.
15:23
All trains on East Coast Mainline (ECML) held at Durham and Newcastle.
Containment
15:24
Tyne mobile operations manager (MOM) notified and en-route ETA 15:35.
Mobilisation
15:28
East Coast Trains, Transpennine, Cross Country and DB (train and freight operating companies) notified.
Coordination
15:30
Following OIS message relayed to ECML, “due to problems with overhead line equipment between Newcastle and Durham, services on the East Coast Main Line are subject to severe delays. Engineers are heading to site to assess the problem and an update will be given when the problem has been fully assessed. We apologise for the disruption to your journey today.”
Information
15:35
MOM on scene and inspecting damage.
15:48
Conference call with East Coast Trains, Transpennine and Cross Country (train operating companies) to arrange alternative service plan planned for 15:50.
…
…
Note: Activity codes are indicated in the third column.
Code Diagnosis
Diagnosis Coordination
…
182
D. Golightly et al.
Each incident log also includes data on attributed delay, including total attributed delay and number of trains affected, and start and end time of the incident from which it was possible to calculate the duration of the delay.
2.2 Incident selection Discussion with seven subject matter experts from rail infrastructure managers derived a short-list of ten incident types that were perceived as being particularly disruptive and difficult to handle (Dadashi et al., 2013). While some of these incident types (e.g., passenger boarding issues) were seen to be minor but high frequency, several were more acute, with a single event or failure leading to potentially massive delay. These four incident types were !
Overhead line dewirement – the overhead power supply for electric traction breaks causing both a complete loss of power supply, and often requiring substantial repair to infrastructure.
!
Fatality – cases where people on-track are killed by trains or power supply. This can be through events such as trespass, suicide or accidents at level crossings.
!
Signalling failures – the signalling system is the underpinning safety-critical system of the railway. Signalling failures may involve issues with line side signals, or loss of track circuits which detect the presence of trains.
!
Points failures – points allow trains to move from track to track. Not only do point failures mean that certain routes cannot be set, they also present a risk of derailment.
Events in CCIL were filtered to identify the ten most recent incidents of each of the above incident types with over 1,000 minutes of attributed delay, where attributed delay is a total of how many minutes delay were incurred by all affected trains as a result of the incident. In GB delay is calculated through the delay attribution process. This involves a specific role in most rail control centres on the GB rail network, who’s function it is to attributes delay minutes to a specific cause. Total delay is calculated through determining the number of minutes delay attributed to each and every affected service, and therefore also supports a calculation of number of trains affected by the delay. Incident duration is calculated from when the incident is first identified and logged in CCIL, to the point where delays are no longer attributed to that specific cause. Delay attribution is highly accurate, as the number calculated is the source of compensation paid between various stakeholders on the railways – for example, if delay to a number of trains is due to an infrastructure failure, the infrastructure manager (Network Rail, in GB) is mandated to reimburse the affected train operating companies.
Disruption management processes during emergencies on the railways
3
183
Analysis
The analysis proceeded in two steps. The first step involved performing correlations to determine whether linear relationships were present between factors such as delay attribution, delay duration and number of event logs. The second step involved coding log items for each of the incidents, in order to determine the activities involved. This process involved reviewing approximately half of the incident logs, and describing each of the entries in terms of their major activity. Authors DG and ND reviewed these comments and agreed a coding strategy. Each incident was then analysed by reading through the log in detail in Microsoft Excel, and assigning one of the agreed codes. A sample of the analysis was reviewed and discussed and, in some cases, codes were dropped or merged. The final coding system was applied as follows: !
access – any event involving planning, execution or handback of track or electrical isolation (e.g., setting signals to red so that track workers can access as failed point)
!
alternative – setting up alternative service arrangements for trains and passengers during the incident (e.g., diverting trains, arranging bus replacements)
!
containment – stopping trains in the immediate vicinity to the incident to prevent escalation (e.g., holding all trains at the preceding station while the exact nature of an incident is investigated)
!
coordination – communication between stakeholders to determine a plan of action or to negotiate arrangements (e.g., contacting the British Transport Police and Train Operators to inform them of an emergency)
!
diagnosis – activities related to determining the cause of an incident (e.g., receiving a call from a driver that the overhead line has come down)
!
extraction – activities to get trapped train and passengers out of the immediate vicinity of an incident (e.g., moving a diesel train into position to extract an electric locomotive trapped during an overhead line dewirement)
!
follow-up – carrying out work after the event (e.g., carrying out investigative work to determine the exact cause of a signal failure, after working repairs have been made)
!
information – communication to the public of cause and estimates related to major delays (e.g., putting information on screens at all affected station platforms)
!
mobilisation – moving people or plant into place to perform other activities (e.g., getting staff on site to diagnose a fault)
!
organisation – assigning people to specific roles for the duration of the incident [e.g., specifying the name and contact details of the temporary rail incident officer (RIO)]
!
rectification – returning the infrastructure to a serviceable state (e.g., repairing a signal failure)
!
restoration – opening a part of the infrastructure so that services can resume (e.g., re-opening a line affected by a points failure).
184
D. Golightly et al.
Examples of the application of these codes are given in the hypothetical example in Table 1.
4
Results
4.1 Incident demographics The following data sources were used to !
incident duration (from delay attribution record in the incident log)
!
total minutes delay (from delay attribution record in the incident log)
!
number of trains affected (from delay attribution in the incident log)
!
number of log entries (calculated from count of number of entries recorded for a given incident).
Descriptive data for the 40 incidents is presented in Table 2. Pearson’s correlations were carried out between the four variables, and are presented in Table 3. A significant correlation was found between trains affected and incident duration (r = 0.40, df =38, p < 0.05), number of log entries and delay attribution (r = 0.59, p < 0.01), trains affected and delay attribution (r = 0.89, df = 38, p < 0.01) and number of log entries and trains affected (r = 0.62, df = 38, p < 0.01). Table 2
Means and standard deviations for the four incident types analysed Incident duration Attributed delay Event log entries Trains affected (mins) (mins) (no.)
OHL (n = 10) Fatalility (n = 10) Signal failure (n = 10) Points failure (n = 10) Total average (n = 40) Table 3
3,011.5 (4,992.2) 553.5 (494.8) 741.5 (674.0) 4,231.1 (8,026.6) 2,134.4 (4,821.1)
4,902.2 (5,065.0) 3,062.3 (2,875.4) 3,421.0 (4,598.7) 1,710.0 (562.3) 3,273.9 (3,755.7)
340.7 (303.3) 163.6 (120.3) 208.1 (198.1) 202.9 (121.9) 228.1 (204.0)
85.3 (98.0) 52.2 (27.1) 26.6 (13.6) 30.0 (19.6) 48.5 (55.5)
Correlation matrix for dependent variables for the incidents (n = 40 for all cells)
Duration Attributed delay Trains affected Event log entries
Duration
Attributed delay
Trains affected
Event log entries
1
0.05 1
0.40* 0.55** 1
0.10 0.88** 0.62** 1
Note: Correlation matrix for the dependent variables for the incident (n = 40 for all cells) (* = p < 0.05; ** = p < 0.01)
Disruption management processes during emergencies on the railways
5
185
Content analysis
In total, 1,941 entries were coded. Table 4 shows the total frequency for each code, and the percentage share of that code. The share is also shown in Figure 2. Table 4
Total frequencies and percentages for codes
Code
Count
Percentage
Access
120
6.2
Alternative
288
10.9
Containment
35
1.5
Coordination
381
17.3
Diagnosis
285
17.9
Extraction
156
5.8
Follow-up
49
2.6
Information
20
0.9
Mobilisation
194
10.6
Organisation
86
3.8
Rectification
182
12.1
Restoration
145
10.5
Figure 2
Frequencies of coded activities for all incidents
186
D. Golightly et al.
Percentage shares were calculated for each incident. Table 5 shows the percentage means for each code by incident type. The percentage share is also shown in Figure 3. As each entry was time stamped, it was possible to present the activities as a time series for any given incident. Example time series are shown in Figure 4 for signal failure and Figure 5 for fatality. While some activities had specific start and end points (e.g., separate log entries for the granting and hand-back of protection for on-track workers), not all activities had logs specifying when that activity ended. Also other activities such as communication or extraction were often discrete points in time. Therefore, it was not reliable or meaningful to calculate the period over which specific activities took place as time. Table 5
Frequencies of code as % means by incident type
Code
OHL
Fatality
Signal
Point
Access
11.1
2.4
1.5
9.8
Alternative
10.7
13.6
8.0
11.2
Containment
1.2
2.7
1.7
0.3
Coordination
14.1
21.6
14.5
19.0
Diagnosis
13.8
12.6
29.8
15.5
Extraction
12.6
8.4
1.2
1.1
Follow-up
1.6
5.8
2.2
0.9
Information
0.7
1.3
0.3
1.3
Mobilisation
7.4
14.9
7.8
12.4
Organisation
6.4
3.8
2.7
2.1
Rectification
14.1
8.0
10.5
15.6
Restoration
6.4
5.0
20.0
10.8
Figure 3
Frequency of percentages of coded log entries by incident type
Disruption management processes during emergencies on the railways Figure 4
Time line analysis for signal failure 1 and 2
Figure 5
Time line analysis for fatality 1 and 2
187
188
D. Golightly et al.
Figure 5
6
Time line analysis for fatality 1 and 2 (continued)
Discussion
Both trains affected and delay attributions are significantly correlated with number of log entries. This gives confidence that the logs are in some way representative of the activities that are occurring during major disruption, and therefore a useful record of incident events worthy of further analysis. There was no significant correlation, however, between overall duration and number of log entries or total delay, and only a weaker correlation with number of services affected. This suggests that it is not duration of the event that dictates severity, but other factors, most likely to be where and when the event occurs. A failure on a busy track section or at a busy time of day will generate many delay minutes, and require much incident handling, without necessarily taking longer to resolve. The lack of significant correlation between duration and incurred delay may also be because the analysis has already selected a subset of incidents – those with 1,000 minutes or more delay. A correlation of all incidents, including those with small delays, may have reached significance but this was not relevant to the current study. Turning to the frequencies of codes, the major activities are diagnosis and coordination. Diagnosis (17.9%) involves identifying the cause of the incident, and forms much of the initial incident activity. From the content of the codes, much of the information associated with diagnosis relates to estimating the severity of the causal factor (for example, is the fatality suspicious, which involves a much longer closure of track for police investigation). The content of the codes indicate that the diagnostic process also includes finding the exact location of the fault, and determining the geographic scope of the incident (e.g., how much line is affected) as this has major implications for determining alternative service plans. Contrary to the assumption of linear models that have situation awareness as a single, discrete step (e.g., Cheng and Tsai, 2011) diagnostic activities are not just the initial stage of the incident. Signalling failure 2 (Figure 4) is a case in point. Though the initial investigation gave some early indications as to the fault, it was only after extended investigation, some time into the incident, that the true fault was identified. Waiting until there was a full picture of events before deciding on service alternatives would have placed an intolerable delay on trains
Disruption management processes during emergencies on the railways
189
in the area. Instead, alternatives are often implemented very quickly, then adjusted as new information comes to light. Signalling incidents, and the complexity of signalling equipment, pose a particular challenge for diagnosis, as highlighted by the substantially higher share of time devoted to diagnosis in that kind of incident (see the Figure 3 for the frequency of ‘diagnosis’ as an activity, by incident type). There are also several incidents where secondary problems become apparent, leading to additional disruption and replanning. For example, in several of the overhead line dewirements, it was only during the course of diagnosis and, in some cases, rectification that additional damage to track or rolling stock became apparent. Therefore, there is not one point early on where the extent of the damage is known, but rather this can be a dynamic process. The implication is that re-planning, whether by human operators or by decision support, cannot take place at one time, but that decisions must be sensitive to changes in the situation over the course of incident management. Coordination (17.3%) is an activity that is present across the whole of the incident, and this is consistent with findings from prior work in emergency management (Smith and Dowell, 2000; Bigley and Roberts, 2001). It was noted from the analysis that coordination involves communication within Network Rail, between Network Rail and other train operating companies, and with external agencies such as British Transport Police and service providers. This is most apparent for fatalities, which involves close coordination between agencies to determine cause of the incident, and to get the track back into service (see Table 3 and Figure 3). Within coordination there are specific technologies referenced in comments such as Nexus Alpha Tyrell (http://www.nexusalpha.com/index.php?location=3.1.1) which support broadcasts of information to many rail stakeholders. Also, there are distinct strategies to support coordination with the arrangement of conference calls to key stakeholders and designated levels of incident management, which denote steps where agents need to coordinate. For example, in overhead line dewirement 3, a conference call is held with infrastructure managers and train operators to agree the service plan, and to discuss estimates for how long rectification work will take place. This call then allowed managers to disseminate information to signallers, crew on trains and station, and maintenance staff at the incident site. The nature of the railways means much of it takes place in remote locations, where mobile phone signals can be unreliable, so structured attempts to coordinate can be hampered due to technical issues. Rectification (12.1%), involving repairing the line back to service, is closely linked to restoration (10.5%), which involves putting that part of the line back into service. Again, it is notable that this is not a single point that occurs towards the end, but rather that it is dispersed across the process (e.g., signal failure 2). In several of the OHL dewirements and point failures, attempts are made to provide ‘running repairs’ or to reduce the geographic spread of the incident, so that some of the track can be brought back into service. At this point, some degraded form of restoration takes place, such as allowing electric trains to coast under a gap in the overhead wires until such a time as full repairs can be affected. There are also several examples where attempts at restoration are not successful and further rectification work is needed. Service alternatives (10.9%) are applied across the incident management process, and constitute the plans to run the service around the incident, or sometimes through the incident but with some form of degraded service. There are an array of different service alternatives, including transfer of passengers onto buses, additional station stops, and cancellations. In many cases, the proposed strategy, if infrastructure is available, is to run
190
D. Golightly et al.
the service as near as possible to normal, and try and keep delays to a minimum. As noted, the use of alternatives will therefore change as more of the network becomes available. Also, alternative service plans will go on some time after the primary cause of the incident has been addressed, in order to establish a near normal timetable. There are, however, constraints due to rolling stock being out of position, or unavailability of drivers with appropriate routes. This is further evidence that decision support for service re-planning must be closely coupled to decision-support for crew management. Mobilisation (10.6%) involves specific activities to move people or equipment into position, and is closely linked to coordination. This is a large share of activities, due in part to the large number of people required for many of the incidents and in part due to sometimes remote location of the incidents. Many of the comments relating to mobilisation cover estimates of when staff and resources are going to get to an incident scene, and one of the major issues is dealing with traffic. One of the reasons why an incident takes on high priority is because it takes place during a peak period, but this means roads are also congested, and most staff travel to incidents in cars or vans. As with coordination, the high number of actors means mobilisation is of particular importance during fatalities. A number of other activities comprise a smaller part of the log, but may have a large bearing on incident management. Rectification of OHL and points failures is determined by track access (6.2%). Without safe access, staff are unable to assess damage or effect repairs, but during the day, access may be difficult to arrange as it must fit around an unfamiliar timetable. Estimates on the time for successful restoration are ultimately determined by when access can be handed back to the dispatcher, and therefore on when alternative service plans can be implemented or normal service resumed. For several incidents, extraction (5.8%) is a major task, especially when a train has been damaged or power has been lost through isolation of electrical power supply. Typically, extraction does not take place until a fairly late stage of the incident (e.g., fatality 2 – Figure 4). The logistics of extraction often mean that a driver and a mechanic need to be mobilised, requiring further coordination and mobilisation constraints that can increase the time before the line can be cleared. Additionally, the need to in some cases to get a ‘rescue’ locomotive into position requires both time, and further re-planning of the train paths to move the locomotive through train traffic into position. Organisation (3.8%) takes place at the early stages of the incident and involves assigning people to roles as part of a temporary incident management team. The majority of these roles are physically on site, though one difficulty with longer incidents is the need to manage crew changes as staff approach their working time limits. Containment (1.5%) is often a single, but crucial step. At this point, the dispatcher will make a decision at the outset of an incident to hold all traffic or re-route traffic away from the incident. How quickly the operator reacts to the initial alert of the incident, and the strategy they choose, will have implications for how many trains may be trapped in the vicinity. This is analogous to the notion that dispatcher’s monitoring strategy and time to respond are critical in the early stages of incident management (Belmonte et al., 2011). Information (1%) is only a small part of the activity logs, taking the form of critical broadcasts across a number of information channels. Once the initial protocol for incident management has been implemented, much further information is updated on the ground by staff or automatically, e.g., through electronic timetable displays in stations, however the mechanism for this is not covered in the log entries.
Disruption management processes during emergencies on the railways
191
One of the general characteristics of this dataset is the high level of variability between different incident types, and within the types of incident themselves. As noted above, certain activities are much more prevalent in some incident types than others. For example, the loss of power during Overhead line dewirements makes extraction a major issue (12.6%) in comparison to point failures (1.1%) where trains can normally be re-routed. The time series diagrams also indicate both rapid switching between activities, and that there are bursts of activity. For example, in Fatality 1 (Figure 4) there is high number of activities to deal with the incident as it first occurs, then another group of activities towards the end as the incident is resolved and the line comes back into service. This indicates that workload is variable and distributed across the incident management process. Managing access for rectification activities may be particularly challenging for the signaller/dispatcher, who is responsible for arranging signalling protection while keeping a re-planned, and therefore unfamiliar, timetable running on other parts of the network (Golightly et al., 2012). This is in theory the time when automatic route setting could be valuable to keep other unaffected parts of the network running, though the evidence to date is that such automation is at its least reliable during disruption (Balfe et al., 2012).
7
Conclusions
The analysis above has demonstrated that it is delay severity, rather than delay duration, that is the major factor driving the intensity of activity during rail disruption. Diagnosis and coordination are the most common incident activities, though many other activities play a key role. Importantly, the situation is highly dynamic – activity types tend not to be clustered at one point in the process, but are distributed across the event reflecting a range of different processes and shifting goals. The conclusion to be drawn is that models of major rail disruption based on a linear ‘understand-decide-act’ type process are too simplistic. The analysis presented here is at an organisational, rather than individual level as previously studied (e.g., Ezzedine and Kolski, 2005; Cheng and Tsai, 2012) , and it could be argued that while organisations may exhibit complex patterns of activity, individuals may act in something more akin to linear processing. It would appear, however, from the complexity the timelines generated and the interlinked nature of activities that this is overly simplistic, as even individuals (e.g., the dispatcher, the person arranging track access, the incident officer coordinating people at a fatality site) must be constantly updating their plans and interpretations based on unfolding events. Overall, effective action by both the individual and the organisation could only be possible by taking a cyclical and iterative approach to incident management (Hollnagel and Woods, 2005). From these results it is possible to draw some conclusions about decision-support. The interlinked nature of rail emergencies, as with aviation (Kohl et al., 2007), means that specific parts of the process cannot be treated individually. Activities such as rectification, extraction and restoration, and the ‘human’ process behind these such as coordination and mobilisation, do not merely take place in parallel with alternative service planning. They are constraints on alternative service plans and their successful execution makes alternative service, as well as the final return to service, possible. Similarly, alternative service plans (and the constraints on alternatives) are linked to crew availability, which also affects extraction. Extraction can be dependent on rectification of
192
D. Golightly et al.
power, which is dependent on access, which is dependent on gaps in the service plan, and so on. Therefore, any element of automation or decision support must be sensitive to the performance-shaping factors of the whole emergency management system. Also, as demonstrated by the timeline analysis, activities are not conveniently clustered, so these constraints are also dynamic. Dynamic needs and interlinked activities imply iterative action and evaluation. As with newer forms of re-planning (Kauppi et al., 2006) it should be possible to trial plans to assess their robustness in the face of change, before committing to action. One further benefit of this analysis has been the development of the coding scheme. It is hoped that further incidents can be codified and analysed in the future, to build up the database of scenarios analysed. Another implication of this method is that it is now possible to go through incidents, selecting entries related to a specific code to capture relevant actions and decisions related to that log entry. For example, Table 6 presents a list of some of the alternative routing strategies used in the incidents, and the considerations and implications of taking these strategies that needed to be considered by actors. Capturing these different decision options, and linking them with the various precursors that influenced their choice, can serve as the basis for the informing the strategies embedded within decision support tool. A similar approach could also be used for understanding the typical communications and coordination actions between actors, or the options and constraints that impinge upon extraction strategies. One of the key limitations with this study is that it has looked at the organisation, rather than at the cognitive processes of individuals. To compliment the approach presented in this paper, there is a parallel stream of work using repertory grid and critical decision method to understand the cues and decisions used by specific actors in different incidents (Dadashi et al., 2013). The aspiration is that, together, these streams of research will build on prior work in re-planning (Kauppi et al., 2006) or incident control (Farrington-Darby et al., 2006) to build a more complete picture of incident management strategies. Another limitation is the quality of the data. There is some variance in how different people elect to complete the log (e.g., some provide long textual descriptions, while others in note form). Also, as noted earlier, some of the activities may persist over time and with better quality data it would be possible to calculate durations for those activities, rather than only percentage of relevant logs. To resolve quality issues, one aim is to take a much bigger dataset in the future, and to use statistical analysis to test predictions that have arisen from the analysis presented here. Another option is to augment this dataset with parallel data captured via observations in the control room (this work is already underway at Network Rail) or through communications analysis, which has been done in the past for engineering work (Golightly et al., 2012). Such data would also support analysis of additional metrics of complexity, such as number of communications, or number of actors involved in an incident which is not clearly specified in the current data. One further avenue for future work is to use a modelling approach such as WESTT (Houghton et al., 2008) which has already been used for emergency management modelling, to build networks of the actors involved and the relationship between tasks and knowledge (as expressed through communication) to further understand the rail incident management process. While the specific sequence of activities, and whether one was more likely to follow another, was out of scope for the current study, a combination of modelling enhanced with additional data sources such as those described above would be a valuable next phase.
Disruption management processes during emergencies on the railways Table 6
193
Examples of service alternatives, and some of the considerations and implications, found in the incident logs
Alternative Divert trains
Benefits Keeps basic service operational
Considerations All drivers need to be certified to driver alternative routes. Dependent on alternative infrastructure – diversion may add significant time to journey, may not be suitable for all traction.
Cancel trains
Reduces traffic in and around an affected area
Passengers need alternatives either with extra services or alternative; may be easier for freight than passengers.
Cancel stops
Keeps service operational but closer to timetable than running all trains.
Passengers needs to be notified, alternatives arranged for passengers waiting at stations or expecting to alight at cancelled stations.
Additional stops
Allows service to be thinned while providing a service to all stations
Requires effective communication to signallers and train crew that this strategy is in place
Transport alternatives (allow passengers to travel on other routes, buses or metro systems)
Keeps passengers moving
Potential congestion on alternative services; coordination with other transport providers
Running on reduced infrastructure
Allows routes and services to remain approximately as planned
Difficult to arrange access if rectification work is taking place in the vicinity; potentially high workload for signaller operating in degraded conditions; slower service leading to congestion
One final limitation is that this data may give a GB-centric view of incident management. The results of this analysis will be taken forward and compared with other EU operating contexts, though previous work to compare rail operations across countries (Golightly et al., 2013) has found the context to be similar enough that the implications of the results in this paper can be successfully transferred elsewhere.
Acknowledgements The work presented in this paper was funded under EU FP7 Project Optimal Networks for Train Integration Management Across Europe (ON-TIME), Grant No. 285243.
194
D. Golightly et al.
References Aminoff, H., Johansson, B. and Trnka, J. (2007) ‘Understanding coordination in emergency response’, Proceedings of the EAM Europe Annual Conference on Hum Decis-Mak and Manual Control, Lyngby, DK. Association of Train Operating Companies (ATOC) (2012) Approved Code of Practice – Passenger Information During Disruption [online] http://www.atoc.org/clientfiles/files/ ACOP015v3%20-%20PIDD%20(2).pdf (accessed 15 July 2013). Balfe, N., Wilson, J.R., Sharples, S. and Clarke, T. (2012) ‘Development of design principles for automated systems in transport control’, Ergonomics, Vol. 55, No. 1, pp.37–54. Belmonte, F., Schön, W., Heurley, L. and Capel, R. (2011) ‘Interdisciplinary safety analysis of complex socio-technological systems based on the functional resonance accident model: an application to railway traffic supervision’, Reliability Engineering & System Safety, Vol. 96, No. 2, pp.237–249. Bigley, G.A. and Roberts, K.H. (2001) ‘The incident command system: high reliability organizing for complex and volatile task environments’, Academy of Management, Vol. 44, No. 6, pp.1281–1300. Cheng, Y.H. and Tsai, Y.C. (2011) ‘Railway-controller-perceived competence in incidents and accidents’, Ergonomics, Vol. 54, No. 12, pp.1130–1146. Cheung, J. (2011) Report to Network Rail on Signaller Strategy, MEng Dissertation. Dadashi, N., Golightly, D. and Sharples, S. (2013) ‘Requirements elicitation for disruption management support’, Proceedings of 4th Rail Human Factors Conference, April, London. Davis Associates Ltd. (2003) Managing Large Events and Perturbations at Stations: Passenger Flow Modelling Technical Review, Report prepared for Rail Safety and Standards Board, Doc. No. RS021/R.03 [online] http://www.rssb.co.uk/allsearch.asp (accessed 8 October 2013). Delay Attribution Board (2013) Delay Attribution Guide [online] http://www.delayattributionboard. co.uk/documents/dag_pdac/April%202013%20DAG.pdf (accessed 8 October 2013). European Commission (2011) White Paper on Transport – Roadmap to a Single European transport Area – Towards a Competitive and Resource-efficient Transport System, Publications Office of the European Union, Luxembourg. Ezzedine, H. and Kolski, C. (2005) ‘Modelling of cognitive activity during normal and abnormal situations using object Petri nets, application to a supervision system’, Cognition, Technology & Work, Vol. 7, No. 3, pp.167–181. Farrington-Darby, T., Wilson, J.R., Norris, B.J. and Clarke, T. (2006) ‘A naturalistic study of railway controllers’, Ergonomics, Vol. 49, Nos. 12–13, pp.1370–1394. Freling, R., Lentink, R.M. and Wagelmans, A.P. (2004) ‘A decision support system for crew planning in passenger transportation using a flexible branch-and-price algorithm’, Annals of Operations Research, Vol. 127, No. 1, pp.203–222. Friman, M. and Garling, T. (2001) ‘Frequency of negative critical incidents and satisfaction with public transport services’, Journal of Retailing and Consumer Services, Vol. 8, No. 2, pp.105–114. Golightly, D., Ryan, B., Dadashi, N., Pickup, L. and Wilson, J.R. (2012) ‘Use of scenarios and function analyses to understand the impact of situation awareness on safe and effective work on rail tracks’, Safety Science, July 2013, Vol. 56, pp.52–62. Golightly, D., Sandblad, B., Dadashi, N., Arnesson, A.W., Tschirner, S. and Sharples, S. (2013) ‘A socio-technical comparison of rail traffic control between GB and Sweden’, Proceedings of 4th Rail Human Factors Conference, April, London. Golightly, D., Wilson, J.R., Lowe, E. and Sharples, S. (2011) ‘The role of situation awareness for understanding signalling and control in rail operations’, Theoretical Issues in Ergonomics Science, Vol. 11, Nos. 1–2, pp.84–98. Hollnagel, E. (2007) ‘Flight decks and free flight: where are the system boundaries?’, Applied Ergonomics, Vol. 38, No. 4, pp.409–416.
Disruption management processes during emergencies on the railways
195
Hollnagel, E. and Woods, D.D. (2005) Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, CRC, Boca Raton, FL. Houghton, R.J. and Golightly, D. (2011) ‘Should a Signaller Look at Twitter?’ The Value of User Data to Transport Control [online] http://de2011.computing.dundee.ac.uk (accessed 8 October 2013). Houghton, R.J., Baber, C., Cowton, M., Walker, G.H. and Stanton, N.A. (2008) ‘WESTT (workload, error, situational awareness, time and teamwork): an analytical prototyping system for command and control’, Cognition, Technology & Work, Vol. 10, No. 3, pp.199–207. Jespersen-Groth, J., Potthoff, D., Clausen, J., Huisman, D., Kroon, L., Maróti, G. and Nielsen, M.N. (2009) ‘Disruption management in passenger railway transportation’, in R.K. Ahuja, R.H. Möhring and C.D. Zaroliagis (Eds.): Robust and Online Large-Scale Optimization, Lecture Notes in Computer Science, 5868, pp.399–342, Springer-Verlag, Berlin. Kauppi, A., Wikström, J., Sandblad, B. and Andersson, A.W. (2006) ‘Future train traffic control: control by re-planning’, Cognition, Technology & Work, Vol. 8, No. 1, pp.50–56. Kohl, N., Larsen, A., Larsen, J., Ross, A. and Tiourine, S. (2007) ‘Airline disruption management – perspectives, experiences and outlook’, Journal of Air Transport Management, Vol. 13, No. 3, pp.149–162. Kontogiannis, T. (1999) ‘Training effective human performance in the management of stressful emergencies’, Cognition, Technology & Work, Vol. 1, No. 1, pp.7–24. Lenior, D., Janssen, W., Neerincx, M. and Schreibers, K. (2006) ‘Human-factors engineering for smart transport: decision support for car drivers and train traffic controllers’, Applied Ergonomics, Vol. 37, No. 4, pp.479–490. National Audit Office (NAO) (2008) Reducing Passenger Rail Delays by Better Management of Incidents: Report, Together with Formal Minutes, Oral and Written Evidence, Vol. 655, HMSO, London. Rasmussen, J. (1986) Information Processing and Human-Machine Interaction: An Approach to Cognitive Engineering, North-Holland, New York. Reinach, S.J. (2006) ‘Toward the development of a performance model of railroad dispatching’, Proceedings of the Human Factors and Ergonomics Society Annual Meeting, October, SAGE Publications, Vol. 50, No. 17, pp.2042–2046. Smith, W. and Dowell, J. (2000) ‘A case study of co-ordinative decision-making in disaster management’, Ergonomics, Vol. 43, No. 8, pp.1153–1166. Stanton, N.A., Salmon, P.M., Walker, G.H. and Jenkins, D. (2009) ‘Genotype and phenotype schemata and their role in distributed situation awareness in collaborative systems’, Theoretical Issues in Ergonomics Science, Vol. 10, No. 1, pp.43–68. Woods, D.D. and Branlat, M. (2010) ‘Hollnagel’s test: being ‘in control’ of highly interdependent multi-layered networked systems’, Cognition, Technology & Work, Vol. 12, No. 2, pp.95–101.