From the SelectedWorks of Heather M Brotherton
May 2011
Data center recovery best practices: Before, during, and after disaster recovery execution
Contact Author
Start Your Own SelectedWorks
Notify Me of New Work
Available at: http://works.bepress.com/heatherbrotherton/4
Graduate School ETD Form 9 (Revised 12/07)
PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance This is to certify that the thesis/dissertation prepared By Heather McCall Brotherton Entitled DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER RECOVERY EXECUTION
For the degree of
Master of Science
Is approved by the final examining committee: Gary Bertoline
J. Eric Dietz Chair
W. Gerry McCartney
Jeffrey Sprankle
To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.
J. Eric Dietz Approved by Major Professor(s): ____________________________________
____________________________________ 04/04/2011
Approved by: Jeffrey L. Brewer Head of the Graduate Program
Date
Graduate School Form 20 (Revised 9/10)
PURDUE UNIVERSITY GRADUATE SCHOOL Research Integrity and Copyright Disclaimer
Title of Thesis/Dissertation: DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER RECOVERY EXECUTION
For the degree of
Master Science Choose of your degree
I certify that in the preparation of this thesis, I have observed the provisions of Purdue University Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.* Further, I certify that this work is free of plagiarism and all materials appearing in this thesis/dissertation have been properly quoted and attributed. I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the United States’ copyright law and that I have received written permission from the copyright owners for my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless Purdue University from any and all claims that may be asserted or that may arise from any copyright violation.
Heather McCall Brotherton
______________________________________ Printed Name and Signature of Candidate
04/04/2011
______________________________________ Date (month/day/year)
*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html
DATA CENTER RECOVERY BEST PRACTICES: BEFORE, DURING, AND AFTER DISASTER RECOVERY EXECUTION
A Thesis Submitted to the Faculty of Purdue University by Heather M. Brotherton
In Partial Fulfillment of the Requirements for the Degree of Master of Science
May 2011 Purdue University West Lafayette, Indiana
ii
TABLE OF CONTENTS
Page LIST OF TABLES ........................................................................................... iv LIST OF FIGURES.......................................................................................... v LIST OF ABBREVIATIONS............................................................................ vi ABSTRACT ....................................................................................................vii CHAPTER 1. INTRODUCTION....................................................................... 1 1.1. Statement of purpose ......................................................................... 1 1.2. Research Question ............................................................................. 1 1.3. Scope.................................................................................................. 2 1.4. Significance......................................................................................... 2 1.5. Assumptions ....................................................................................... 3 1.6. Limitations........................................................................................... 3 1.7. Delimitations ....................................................................................... 4 1.8. Summary............................................................................................. 4 CHAPTER 2. LITERATURE REVIEW............................................................. 6 2.1. Critical cyberinfrastructure vulnerability .............................................. 6 2.2. Barriers to cyberinfrastructure resiliency ............................................ 8 2.3. Mutual aid ........................................................................................... 9 2.3.1. Mutual Aid Association ............................................................. 11 2.4. Training ............................................................................................. 12 2.5. Testing .............................................................................................. 13 2.6. Summary........................................................................................... 14 CHAPTER 3. FRAMEWORK AND METHODOLOGY .................................. 15 3.1. Framework ........................................................................................ 15 3.2. Researcher Bias ............................................................................... 16 3.3. Methodology ..................................................................................... 16 3.4. Data Collection.................................................................................. 17 3.5. Authorizations ................................................................................... 17 3.6. Analysis............................................................................................. 18 3.6.1. Triangulation............................................................................. 18 3.7. Summary........................................................................................... 19 CHAPTER 4. CASE STUDIES...................................................................... 20 4.1. Commerzbank................................................................................... 20 4.1.1. Background.................................................................................... 20 4.1.2. World Trade Center Attacks ..................................................... 21 4.1.3. Conclusion................................................................................ 28
iii Page 4.2. FirstEnergy........................................................................................ 29 4.2.1. Background .............................................................................. 29 4.2.2. Northeast Blackout of 2003 ...................................................... 29 4.2.3. Conclusion................................................................................ 38 4.3. Tulane ............................................................................................... 39 4.3.1. Background .............................................................................. 39 4.3.2. Hurricane Katrina...................................................................... 40 4.3.3. Conclusion................................................................................ 48 4.4. Commonwealth of Virginia ................................................................ 49 4.4.1. Background .............................................................................. 49 4.4.2. August 2010 outage ................................................................. 51 4.4.3. Conclusion................................................................................ 65 CHAPTER 5. ANALYSIS............................................................................... 67 5.1. Best Practice Triangulation ............................................................... 67 5.1.1. Before-Planning........................................................................ 67 5.1.2. During-Plan execution .............................................................. 73 5.1.3. After-Plan improvement............................................................ 78 CHAPTER 6. CONCLUSION ........................................................................ 86 CHAPTER 7. FUTURE RESEARCH............................................................. 89 BIBLIOGRAPHY............................................................................................ 91 APPENDICES Appendix A............................................................................................. 103 Appendix B............................................................................................. 104 VITA ............................................................................................................ 118 PUBLICATION Disaster recovery and business continuity planning: Business justification.............................................................................. 120
iv
LIST OF TABLES
Table Page Table 5.1 Tolerance and objectives .................................................................... 68 Table 5.2 Aid relationship utilized during recovery .............................................. 78
v
LIST OF FIGURES
Figure Page Figure 5.1 Adherence to established procedures................................................ 74 Figure 5.2 Sample IT incident command structure.............................................. 77 Figure 5.3 Reported average downtime revenue losses in billions ..................... 80 Figure 5.4 Reported critical application and data classifications ......................... 81 Figure 5.5 Components of a resilient system ...................................................... 85
vi
LIST OF ABBREVIATIONS
CIO Chief Information Officer DMV Department of Motor Vehicles DR disaster recovery EMAC Emergency Management Assistance Compact EMS Emergency Management System EOC emergency operations center FE FirstEnergy FEMA Federal Emergency Management Agency FERC Federal Energy Regulatory Commission HVAC Heating, Ventilating, and Air Conditioning IT information technology ITIL Information Technology Infrastructure Library MOU Memorandum of Understanding MTPOD Maximum Tolerable Period of Disruption NIMS National Incident Management System NRC Nuclear Regulatory Commission ROI Return On Investment RPO Recovery Point Objectives RTO Recovery Time Objectives SCADA Systems Control and Data Acquisition SAN Storage Area Network VITA Virginia Information Technologies Agency
vii
ABSTRACT
Brotherton, Heather M. M.S., Purdue University, May 2011. Data center recovery best practices: before, during, and after disaster recovery execution. Major Professor: J. Eric Dietz.
This qualitative multiple case study analysis reviews well documented past information technology disasters with a goal of identifying practical before, during, and after disaster recovery best practices. The topic of cyberinfrastructure resiliency is explored including barriers to cyberinfrastructure resiliency. Factors explored include: adherence to established procedures, staff training in recovery procedures, chain of command structure, recovery time and cost, and mutual aid relationships. Helpful tools and resources are included to assist planners.
1
CHAPTER 1. INTRODUCTION
1.1. Statement of purpose The purpose of this research is to attempt to bridge the gap of unmet needs in the area of cyberinfrastructure business continuity and disaster recovery. Information systems are complex and vital to modern infrastructure. Loss of computer information system availability can financially cripple companies and potentially cause basic necessities such as clean water to be unavailable. In many cases, organizations fail to implement business continuity measures due to the high cost of remote failover systems and training. Cyberinfrastructure resiliency is dependent upon creating practical, attainable implementations. Through this research, the effectiveness of various business continuity and disaster recovery practices will be explored to increase information systems resiliency. 1.2. Research Question What are best practices in planning, during, and after disaster recovery execution?
2
1.3. Scope The scope of the research is identification of best practices for business continuity and disaster recovery. Factors affecting the success of cyberinfrastructure incident recovery will be identified through case study analysis. Success will be determined by reviewing factors such as practicality, recovery time, and business impact. Practical tools and resources to assist best practice implementation and execution will also be identified. 1.4. Significance Aside from IT professionals, very few think about the impacts of information system failure. Growing dependence upon computer information systems has created vulnerabilities that have not been uniformly addressed. Information systems are the ubiquitous controllers of critical infrastructure. Many business processes and services depend upon computer information systems resulting in myriad factors to consider in data center contingency planning. These systems experience failures on a regular basis, but most failures are unnoticed due to carefully crafted redundant mechanisms that seamlessly continue processing. However, massive failures have occurred that resulted in widespread, severe negative impact on the public. While most large corporations have remote failover locations, there are many organizations important to critical functions that do not have the resources to develop and implement business continuity and disaster recovery plans. Practical, understandable planning and recovery guidance, developed through the findings of this research, may help
3
ensure the stability of cyberinfrastructure and by extension the safety and well being of all. 1.5. Assumptions Assumptions for this study include: •
Examination of the experiences of organizations who have sustained catastrophic information systems failures will yield information that will contribute to disaster recovery best practices body of knowledge.
•
The use of qualitative case study analysis is appropriate to study the phenomenon of interest.
•
Existing publicly available documents are the best source of the actions and policies in place at the time of the incident. 1.6. Limitations
Limitations include: •
Contact with primary actors from cyber infrastructure failures is infeasible due to: o Difficulty identifying actors o Limitations on what may be discussed due to risk of liability o Degraded memory of actual events and policies active at the time of the incident
•
Highly detailed information will not be available in the documentation. Therefore, this research will not address topics that cannot be examined based on the detail of the available documentation.
4
•
Observation of large-scale cyber infrastructure failure is not feasible due to inherent unpredictability; observation of other information systems failures and recovery will lack external validity. 1.7. Delimitations
Delimitations include: •
Many sources to assist in business continuity and disaster planning exist; this research will not attempt to add to planning, but will focus on the successes and failure of the planning and methods employed before, during, and after information system recovery execution.
•
Possible causal relationships will not be examined in this exploratory research study.
•
Information systems failures that are not well documented will not be addressed.
•
The number of case studies will be limited to ensure in depth coverage of recovery methods employed.
•
Realistic simulation of catastrophic failures is neither ethical nor feasible and will not be attempted for the purpose of study. 1.8. Summary This chapter is an introduction to the disaster recovery best practices
research project. The purpose of the research is to meet needs of cyberinfrastructure resiliency. Cyberinfrastructure resiliency is defined as the ability of an infrastructure level information system to tolerate and recover from
5
adverse incidents with minimal disruption. The scope of the project is defined in this chapter as well as the significance, assumptions, limitations, and delimitations. The following chapter will review literature on topic related to cyberinfrastructure resiliency.
6
CHAPTER 2. LITERATURE REVIEW
This chapter provides an overview of the importance of systems resiliency and introduces the concept of mutual aid. Computer information system vulnerabilities and threats are discussed. The barriers to systems resiliency and the challenges associated with removing these barriers to implement resiliency are highlighted. Potential uses of mutual aid agreements as a pragmatic, costeffective risk mitigation alternative resiliency tool are discussed. Literature related to systems resiliency is reviewed to provide a background of the problems and to support the exploration of the employment of mutual aid agreements. 2.1. Critical cyberinfrastructure vulnerability The Clinton, Bush, and Obama administrations have recognized society’s dependency on cyberinfrastructure in presidential communications. Presidential Decision Directive 63 declared "cyber-based systems essential to the minimum operations of the economy and government. They include, but are not limited to, telecommunications, energy, banking and finance, transportation, water systems and emergency services, both governmental and private." (Clinton Administration, 1998, p. 1). This communication set forth policy to implement cyberinfrastructure protections by 2000 (Clinton Administration, 1998).
7
However, despite this directive, in 2003 the Northeast portion of the United States suffered an extended widespread power outage due in large part to failure of the computer system (U.S.-Canada Power System Outage Task Force, 2004). Transportation, communication, and water were unavailable leaving many stranded in subways and trapped in elevators. In some cases, people were unable to make non-cash purchases for essentials such as flashlights (Barron, 2003). Findings published by the New York Independent System Operator state "the root cause of the blackout was the failure to adhere to the existing reliability rules" (New York Independent System Operator, 2005, p. 4). "ICF Consulting estimated the total economic cost of the August 2003 blackout to be between $7 and $10 billion" (Electricity Consumers Resource Council (ELCON), 2004, p. 1). In more recent history, Google announced a directed attack from China (Scherr & Bartz, 2010). This announcement was shortly followed by an announcement from the Obama administration regarding initiatives to protect critical resources such as power and water from cyber attack (Scherr & Bartz, 2010). No initiatives to date have resulted in substantial hardening of cyberinfrastructure in fact the problem appears to be growing. Losses of intellectual property alone from 2008 to 2009 were approximately one trillion dollars (Internet Security Alliance (ISA)/American National Standards Institute (ANSI), 2010).
8
2.2. Barriers to cyberinfrastructure resiliency Computer information systems are inherently difficult to protect. They remain in a state of constant flux due to technological advances and updates to patch known vulnerabilities (Homeland Security, 2009). Each patch or fix applied runs the risk of causing an undocumented conflict due to customization as well as creating a new vulnerability. Constant connection to the Internet has increased the usefulness of computers, but this has also increased vulnerability. Information systems are highly complex, even information technology experts are segmented. Upper level mangers as tend to be “digital immigrants" resulting in increased difficulty in convincing them to fund cybersecurity projects. (Internet Security Alliance (ISA)/American National Standards Institute (ANSI), 2010, p. 12) This disconnect is the doom of continuity planning, without high-level backing to push policy change and supply resources there is little chance for success (Petersen, 2009). Funding alone will not make a resilient cyberinfrastructure, collaboration among departments is necessary to create and maintain a plan that addresses the business requirements of an organization (Caralli, Allen, Curtis, White, & Young, 2010). There must also be organizational understanding and commitment to the practices that contribute to the documentation required to have an up to date continuity plan. These cultural changes require strong actively committed leadership to enact.
9
Leadership lacking the fundamental understanding of the importance of failover testing can render an otherwise solid continuity plan useless. In some cases, companies have disaster recovery plans, but are reluctant to test live systems due to the possibility of service interruptions (Balaouras, 2008). This short sightedness can lead to disastrous costly consequences. Planned testing can be scheduled during low traffic periods when the staff can be prepared to quickly recover any outage. These tests serve to identify system and failover plan weaknesses and make the staff more comfortable with the failover and recovery process. A common and somewhat illogical barrier to planning for resiliency is the idea that some disasters cannot be planned for because they are too large. (Schaffhauser, 2005) The National Incident Management System (NIMS) provides a framework for managing incidents of any size and complexity. (FEMA) Information and training for NIMS is freely available on the Federal Emergency Management Agency training website. The site address is listed in Appendix A. The use of this framework is highly recommended because it is widely used and provides a framework for integrating outside organizations into the command and incident response structure. 2.3. Mutual aid Mutual aid agreements have evolved over human history as a means to pool resources to solve a common problem. The redundant resources required to maintain systems continuity may not be economically feasible for many
10
organizations. Rather than forgoing implementing remote failover locations, it may be advisable to pool resources by forging reciprocal agreements. The September 2010 San Bruno gas pipeline explosion is a good example of the advantages of an existing Mutual aid compact. San Bruno’s disaster activated 42 fire agencies, 200 law enforcement officers. (Jackson, 2011) “85 pieces of fire-fighting apparatus” were also provided for on site response. (Jackson, 2011) The resources required for this incident were far beyond feasible maintainability for the city’s budget. The California Mutual Aid System along with an Emergency Operations plan ensured the city was able to quickly and effectively respond to this unforeseen explosion. (Jackson, 2011) The possibility that the utilization of IT mutual aid agreements will allow organizations the ability to make better use of available resources is worth exploring (Swanson, Bowen, Wohl Phillips, Gallup, & Lynes, 2010). Collocation of critical services provides systems redundancy without the need to build a dedicated recovery data center. Reciprocal relationships are generally defined by a memorandum of understanding (Swanson, Bowen, Wohl Phillips, Gallup, & Lynes, 2010). Memorandums of understanding, often referred to as an MOU, define protocol, costs, resources available, and compatibility requirements. It may be desirable to include nondisclosure agreements in the MOU. Staffing is a key resource that could be negotiated for through mutual aid agreements. Sharing staffing increases the likelihood that adequately trained staff will be available should a catastrophic event occur. Some catastrophes may make staff unavailable due to personal impact and additional staff may be
11
required to maintain or recover operations to prevent or reduce business impact (Schellenger, 2010). Partnering with another organization to pool staffing resources can ensure efficient contingency operations through cross-trained staff. The end result may be cost savings. Fewer contractors and consultants would be necessary and business impact could be minimized as a result of extra staff that is familiar with the computer system. Another possible advantage of mutual aid agreements is the ability to share training expenses. General conceptual information and in-house training can be shared between partner organizations. This may not only save costs of developing and providing training, but will provide a "common language" for the partnered organizations (FEMA, 2006, p. 4). Ideally, additional specialized training for employees on incident management teams would be trained with counterparts to ensure good communication between the teams. The ability to communicate efficiently and effectively will also contribute to the reduction of downtime. 2.3.1. Mutual Aid Association Mutual Aid agreements are common for police, fire departments, and utilities. Associations have been formed to fill the gaps in situations where an organization lacks the necessary resources to respond adequately to an incident. These relationships have been used to the benefit of society at large allowing seamless performance of incident response duties. This is possible due to predetermined procedures and protocols that exist in mutual aid agreements. Organizations generally hold regular training with reciprocal partners. According
12
to Hardenbrook, utilities "showed the most advanced levels of cooperation" during the Blue Cascades exercise (2004, p. 4). The Blue Cascades II exercise focused on information technology dependencies. The FEMA website has links to a few mutual aid associations such as Emergency Management Assistance Compact (EMAC). EMAC emerged in 1949 in response to concern of Nuclear Attack. (EMAC) In 1996 the U.S Congress recognized EMAC as a national disaster compact through Public Law. (EMAC) EMAC is designed to assist states, but this model may work for non-profit, education, and business. Creation of a similar association for information technology may be warranted due to the special skills, equipment, and resources required for response to a large-scale event. 2.4. Training Training is a key factor in business continuity and disaster recovery. Human error is often cited as the primary cause of systems failure (U.S.-Canada Power System Outage Task Force, 2004). In many cases, the incident is initiated by another type of failure (software, hardware, fire, etc), but the complicating factor becomes human error (U.S.-Canada Power System Outage Task Force, 2004). Automation of "easy tasks" leaves "complex, rare tasks" to the human operator. (Patterson, et al., 2002, p. 3) Humans "are not good at solving problems from first principals…especially under stress" (Patterson, et al., 2002, p. 3) "Humans are furious pattern matchers" but "poor at solving problems from first principals, and can only do so for so long before" tiring (Patterson, et
13
al., 2002, p. 3). Automation "prevents …building mental production rules and models for troubleshooting" (Patterson, et al., 2002, p. 4). The implications of this are that technologists are not efficient at solving problems without experience. Training provides the opportunity to build "mental production rules" and allows the technologist to quickly and more accurately respond to incidents. 2.5. Testing Surveyed literature reinforces the importance of testing and experimentation. Testing provides the opportunity to assess the effectiveness of business continuity procedures, equipment, and configuration. Part of the reasoning for testing is that “emergency systems are often flawed…only an emergency tests them, and latent errors in emergency systems can render them useless." (Patterson, et al., 2002) Incident response procedures vary in complexity. Some procedures are employed on a regular basis; these situations are not the focus of the testing discussed here. Large-scale recovery and continuity procedures are rarely employed by an organization; however the effectiveness of these plans is decisive in the organization's survival in the event of a large-scale disaster. Disasters have not only been historically costly, but have resulted in permanent closures (Scalet, 2002). The costs of neglecting business continuity and disaster recovery testing are too high to risk.
14
2.6. Summary Critical resource and service dependencies upon information systems have created the necessity to protect the underlying cyberinfrastructure. Barriers to the resilience of complex and often fragile systems must be removed. Leadership must be educated on the requirements of systems resiliency. Practices that support maintained system and business process documentation must be integrated into the organizational culture. The cost of redundant cyberinfrastructure renders implementing resiliency out of reach for many organizations. The cultivation of reciprocal relationships is one option to reduce the cost of maintaining remote failover. Training and testing are key factors in implementing effective business continuity and disaster recovery procedures.
15
CHAPTER 3. FRAMEWORK AND METHODOLOGY
The purpose of this research is to examine data center recovery planning, execution, and post-execution activities to identify best practices that emerge from the analysis. Qualitative methods will be applied to facilitate the exploration of this topic. This chapter details the research methodology employed as well as data collection and analysis methodologies. 3.1. Framework Information technology Business continuity and disaster recovery planning has become a popular topic due increased information system interdependency. Organizations cannot afford downtime due to primarily for financial reasons. Methodologies have emerged to guide organizations through planning, implementation, and maintenance lifecycle phases. Execution is addressed from a theoretical point of view, but how does execution play out in real life, high impact situations? Execution of cyberinfrastructure disaster recovery procedures and protocols remain virtually unexamined. Research of documented, high impact cyberinfrastructure recovery processes may uncover valuable information that may enrich understanding of best practices. Best practices revealed or reinforced through this research will be documented for future use.
16
3.2. Researcher Bias I present here my personal bias on this topic, to inform the reader of beliefs that may encroach upon the findings of this research. Preparedness, in my mind, enables us to deal more effectively with adverse conditions. I wholeheartedly believe that documentation and practice exercises contribute to incident mitigation, quicker recovery time, and reduced personal stress during emergency. I acknowledge that not every contingency can be included in planning and the ingenuity of the incident responders is the key to success. I believe that an all hazards approach, established chain of command, and welltrained staff enable a more coordinated and efficient recovery process. 3.3. Methodology Collective case study will be utilized in this qualitative phenomenological study. This method is used because creating reliably accurate quantitative measures is not feasible in the study of high impact cyberinfrastructure recovery processes. Primarily due to the rare occurrence of this type of event, it is highly unlikely to be presented the opportunity to observe the actual phenomenon of interest. Quantitative methods are impractical because, while the cases used will have some timeline and procedural documentation, the accuracy of these measures is questionable due to the high stress nature of the recovery situations and lack of highly detailed procedural information.
17
Lab research was also considered and while this would produce high internal validity, it is not feasible to realistically simulate a true disaster situation. Therefore, external validity would be low and would likely result in unrealistic findings. 3.4. Data Collection Purposeful sampling methods were employed. The criteria for selection included: •
High impact cyberinfrastructure incident
•
Documented resolution
Phenomenon related documents, artifacts, and archival records were used rather than interviewing, which also reduces the possible impact of researcher bias. Multiple cases were included in the case study. This method of data collection may not produce findings generally applicable to information systems in every sector. The area of interest is high impact cyberinfrastructure; the findings using this methodology are expected to be highly generalizable to critical infrastructure information systems. 3.5. Authorizations Authorization for this research was granted by Purdue University College of Technology and Purdue Institute of Homeland Security. The advisory committee of the researcher approved this research to add to the body of
18
knowledge related to information systems business continuity and disaster recovery. IRB approval was obtained for all written communication. 3.6. Analysis Cross-case analysis was used to create a multidimensional profile of disaster recovery processes and protocols. Recurring themes or practices, both those resulting in positive and negative results, were identified. Factors explored include: •
Adherence to established procedures
•
Staff training in recovery procedures
•
Chain of command structure
•
Recovery time and cost
•
Mutual aid relationships
3.6.1. Triangulation The purpose of including more than one case study is to collate the commonalities. The identification of common problems and successes contributes to the understanding of best practices for disaster recovery. Generalizable practices from other disciplines will also be used to reinforce the identified and recommended best practices.
19
3.7. Summary This chapter details the methodology, sampling, and analysis techniques used in this research. Rationales for the methods employed were also discussed. Findings and sources used for the case study are included in following chapters.
20
CHAPTER 4. CASE STUDIES
4.1. Commerzbank
4.1.1. Background Commerzbank is the second largest bank in Germany established in 1870.(Availability Digest, 2009) In 2001, Commerzbank was the 16th largest in the world.(Editorial Staff of SearchStorage.com, 2002) The bank has overcome many adversities since its establishment such as World War I and socialism.(Availability Digest, 2009) The bank has survived calamities in the United States as well, including a 1992 flood in Chicago and the 1993 World Trade Center bombing.(Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010) Commerzbank’s New York offices are “located on floors 31 to 34 at the World Financial Center”.(Hewlett-Packard, 2002) This location is “only 300 feet from the World Trade Center towers.”(Editorial Staff of SearchStorage.com, 2002)
21
4.1.2. World Trade Center Attacks September 11, 2001 the World Trade Center suffered the largest terrorist attack in United States history. Nearly 3000 died that day as a result of the attacks. (Schwartz, Li, Berenson, & Williams, 2002) The impact to the economy of the city of New York alone was $83 billion. (Barovik, Bland, Nugent, Van Dyk, & Winters, 2001) Site clean up took over eight months. (Comptroller of the city of New York, 02) Not all businesses were able to recover from the devastation inflicted by the attacks. (Scalet S. D., 2002) The overall economic impacts continue today and the daily lives of each resident of the United States has been affected, if only indirectly.
4.1.2.1. Ramifications Commerzbank was so near the World Trade Center impact sites that the debris caused the widows to shatter.(Editorial Staff of SearchStorage.com, 2002) The interior of the building that housed Commerzbank was covered in debris and glass creating an unsafe environment and choking building equipment. The data center air conditioning failed leading to high temperatures, which had a cascading effect on the data center computers.(Hewlett-Packard, 2002, p. 2) Most of the local data center disk failed causing failover to Commerzbank’s remote site.(Hewlett-Packard, 2002, p. 2) Commerzbank had a redundant, fault tolerant system with remote failover that allowed them to remain operational
22
throughout the event.(Hewlett-Packard, 2002, p. 2) They lost equipment at that site but their ability to do business remained intact.
4.1.2.2. Response Initially, links were directed to the Rye backup site to restore communications with “Federal Reserve and the New York Clearing House” that were lost after the first collision.(Availability Digest, 2009) It became apparent that the World trade center was under attack when the second jet hit, Commerzbank initiated immediate evacuation.(Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010) When the building lost power Commerzbank’s backup power generator took over, but the HVAC system failed due to the debris causing that site’s data center to shutdown.(Hewlett-Packard, 2002, p. 2) Automated failover processes continued as employees traveled to the recovery site.(Editorial Staff of SearchStorage.com, 2002) The recovery site at Rye, New York can be operated by 10 staff members and 16 reported to the backup site on September 11th.(Hewlett-Packard, 2002, p. 2) This site served as the primary data center and in days that followed EMC, Commerzbank’s storage vendor, worked around the clock to restore data that was backed up to tape rather than replicated.(Editorial Staff of SearchStorage.com, 2002) EMC added “multiple terabytes” of storage to augment the backup site capacity during following 36 hours allowing restoration of “mission-critical” data as well as creation of new backups.(Editorial Staff of SearchStorage.com, 2002)
23
4.1.2.3. Mitigation in place Commerzbank’s “primary site was well-protected, with its own generator, fuel storage tank, cooling tower, UPS, batteries, and fire suppression system.”(Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010) Commerzbank was in the midst of virtualizing storage, and had finished the majority of the conversion before the attacks. (Mears, Connor, & Martin, 02)The IT staff at Commerzbank designed and maintained a business continuity plan that included regular testing and a call tree.(Hewlett-Packard, 2002, p. 3) This provided the capability to meet the zero downtime requirement set forth by the business.(Hewlett-Packard, 2002) To reach this goal Commerzbank shadowed “everything” to the remote site. (Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010) The remote site, located 30 miles from the World Trade Center site at Rye, was the cornerstone of that plan.(Hewlett-Packard, 2002)
Boensch describes the activities of Commerzbank’s Disaster Recovery (DR) site in non-disaster mode. “Our DR site is really dual purpose. The AlphaServer GS160 system is a standby production site in case of a disaster. But on a regular day-to-day basis, it’s up and running as a test and development system. Actually, the only things that are redundant in an active/active configuration are the StorageWorks data disks — they are truly dedicated both locally and remotely. We also use the site for training.(Hewlett-Packard, 2002, p. 4) The primary site at the world trade center maintained local duplicate drives and “extra CPUs”(Hewlett-Packard, 2002, p. 2) There was also a “disaster-tolerant cluster” in the active/active data configuration described above to provide failover
24
capacity in seconds(Parris, Using OpenVMS Clusters for Disaster Tolerance) Commerzbank used: EMC's Symmetrix Remote Data Facility (SRDF) hardware and software to safeguard its customer transactions, financial databases, e-mail and other crucial applications. SRDF replicates primary data to one or more sites, making copies remotely available almost instantly.(Editorial Staff of SearchStorage.com, 2002) This system provided “a standard, immediately functional environment for critical decision-support and transactional data.”(Editorial Staff of SearchStorage.com, 2002) The facilities were physically connected via “Fibre Channel SAN” providing a storage transfer rate of almost 1TB per second. (Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010) The remote site maintained servers that “were members of the cluster” at the World Trade Center site. These servers continued to serve using replicated “remote disks to the main site” after the storage there failed.(Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010) Commerzbank’s “Follow-the-sun personnel staffing model meant help was available” around the clock.(Parris, Who Survives Disasters and Why, Part 2: Organizations, 2010) Previously established vendor relationships with EMC and Compaq, later to become part of Hewlett-Packard (HP), ensured they were on hand to assist with any services or equipment required to recover.
25
4.1.2.4. Corrective actions Commerzbank’s corporate vice president Rich Arenaro, felt that the disaster recovery part of the business continuity plan worked. All critical data was available, but it still took nearly four hours to resume normal business operations.(Mears, Connor, & Martin, 02) Therefore, they had failed to meet the zero downtime business requirements. The servers were “somewhat inflexible and required way too much human intervention.” Rye’s backup servers were not identical to those at the primary site causing application compatibility problems with the operating systems. (Egenera, 2006) "Our strategy had been based on a false one-to-one ratio of technology, meaning if I buy a server here and one for Rye, I'm protected," Arenaro says. "The reality is when you are faced with that situation, having hardware really is the least of your worries. It's really having your data and your systems available and ready to use."(Mears, Connor, & Martin, 02) Commerzbank corrected this by virtualizing their servers and eliminating proprietary operating systems. The virtualized Linux servers use “SUSE Linux and the support model of the open source community” rather than the HP operating system.(Egenera, 2006) Another problem was that the hardware residing “on the server itself—the disk, network interface card and storage interface—give that server a fixed identity” this also caused delays as the servers were manually reassigned.(Egenera, 2006) The virtualized environment provides a pool of servers with shared storage and networking hardware to “run any application on demand”. (Egenera,
26
2006) The new “system is designed for SAN connectivity and boot” any BladeFrame server can assume any identity at any time. That’s what we were missing and what we grappled with on 9/11.”(Egenera, 2006) The cooling requirements for the data center have also decreased due to the virtualized servers. The overall physical complexity has decreased as well, 140 servers were consolidated into 48 blades. (Egenera, 2006) The virtualized configuration has reduced hardware trouble-shooting time. Configuring new servers now takes less than an hour; it previously took up to 16 hours. (Egenera, 2006) The primary site and the backup site contain servers that are members of active/active clusters. Applications as well as data are stored on a SAN allowing any services to be switched seamlessly between locations using bi-directional synchronous replication. The Rye site is now an active part of daily processing and handles 40% of the processing load. (Egenera, 2006) We live every day in the recovery portion of the DR mode. Having the assets active takes the mystery out of continuity. We’re not praying that it works, not planning that it works—we know it works because it’s an active part of the process.(Egenera, 2006)
4.1.2.5. Discussion This case study provides an example of disaster recovery done correctly. The IT department was involved in contingency planning and performed regular testing and every staff member knew what to do. The failover processes were sufficiently automated to allow the evacuation process to focus on safety without
27
concern for heroics to save the business. Post incident review showed some weakness in the technical contingency plan. The plan’s focus needed to be shifted from recovery to continuity to meet Commerzbank’s business needs. The company identified the problem, found a suitable solution, and implemented the solution. The remaining weakness, based on the information available, is that there is no mention of a third cluster outside of New York. If an incident occurred that severely impacted New York on a larger scale, having only two clusters both located in New York may not provide the seamless zero downtime the company requires. This global company has the resources to commit to this more comprehensive configuration. They also have facilities around the world to take advantage of for co-location. The floor space use was reduced by 60% through server virtualization, this extra space should be taken advantage of to host remote clusters between Commerzbank locations to ensure continuity.(Egenera, 2006) In this case, like that of Katrina, the disaster destroyed the hardware at the site. There was little that preparedness could do to save the equipment. However, unlike Katrina the recovery plan worked. Commerzbank had many advantages in this case; New York’s infrastructure did not suffer the damage New Orleans suffered. Commerzbank did not have to shoulder the burden of rebuilding a city, only their primary location. Also, Commerzbank had the resources necessary to provide for their uptime requirements.
28
The lesson that can be learned from Commerzbank is not to be complacent. Disasters happen of various scales on a daily basis, most are not terribly severe and impact a small number of people. Failure to plan for a largescale severe impact event will increase the financial burden and stress of incidents that do occur. If possible, defray the costs of maintaining hot sites by integrating them into daily processing as Commerzbank has done. During planning, walk through as many scenarios as imaginable this will help ensure that all details are covered.
4.1.3. Conclusion Commerzbank survived 9/11 with relative ease while many others suffered unrecoverable losses. Many did not recover due to failure to plan and prepare for the possibility of massive hardware and personnel losses. Commerzbank understood the bank’s vulnerabilities and tolerances and made the investments necessary to mitigate them. Past experience had taught the company how to survive and high-level management and staff were trained to manage incidents. This vigilance paid off in reduced downtime and minimized financial impact to the company.
29
4.2. FirstEnergy
4.2.1. Background FirstEnergy (FE) was founded in 1997 located in Akron, Ohio is ranked 179 in the 2010 list of Fortune 500 companies.(FirstEnergy, 08)(FirstEnergy, 09)(Fortune, 10) This unregulated utility supplies electricity to “Illinois, Maryland, Michigan, New Jersey, Ohio, and Pennsylvania”.(FirstEnergy, 09) FirstEnergy has remained highly profitable despite a history of poor practices that put the public at risk. One of the most notable resulted in a $5.45 million fine issued by the Nuclear Regulatory Commission (NRC). This fine regarded “reactor pressure vessel head degradation”. FirstEnergy was notified of the problem in 2002 by the NRC. (Merschoff, 05) The plant was operated for nearly two years after the company was aware the equipment was unsafe to operate. (Merschoff, 05) FirstEnergy employees supplied the NRC with misinformation and at least two employees were indicted. (Associated Press, 06)
4.2.2. Northeast Blackout of 2003 In 2003, the Northeast region suffered a blackout, the largest in US history, causing several Northeast US cities and Canada to be without power. (Minkel, 08) News reports claimed this blackout was primarily due to a software bug that stalled the utility’s control room alarm system for over an hour. The
30
operators were deprived of the alerts that would have caused them to take the necessary actions to mitigate the grid shutdown/failures. The primary energy grid monitoring server failed shortly after the failure of the alarm system, the backup server took over and failed after a short period. The failure of the backup server overloaded the remaining server’s processing ability bringing computer response time to a crawl, which further delayed operators’ actions due to a refresh rate of up to 59 seconds per screen. (U.S.-Canada Power System Outage Task Force) The operators’ actions were slowed while they waited for information and service requests from the server to load.
4.2.2.1. Ramifications 4.2.2.1.1. General In a matter of minutes the blackout cascaded through the power grid taking down over 263 plants. (Associated press, 03) Resulting in eight states and parts of Canada being without power. (Barron J. , 2003)This black out affected water supply, transportation, and communication. One hospital was completely without power (Barron, 2003) and governmental systems to detect border crossings, port landings, and unauthorized access to vulnerable sites failed. (Northeast Blackout of 2003) The estimated cost of this blackout was $7-10 billion. (Electricity Consumers Resource Council (ELCON), 2004)
31
4.2.2.1.2. FirstEnergy Immediately following the outage, FirstEnergy’s public stock offering values fell as investors were cautioned sighting the possibility of fines and lawsuits. (From Reuters and Bloomberg News, 03) A US-Canadian taskforce assigned to investigate “found four violations of industry reliability standards by FirstEnergy”. (Associated press, 03) The FirstEnergy violations included not reacting to a power line failure within 30 minutes as required by the North American Electricity Reliability Council, not notifying nearby systems of the problems, failing to analyze what was going on and inadequate operator training. (Associated press, 03) There were no fines assessed because at that time no regulatory entity had the authority to impose fines. (Associated Press, 06) However, FirstEnergy stockholders sued for losses due to negligence and the company settled in July of 2004 agreeing to pay $89.9 million to stockholders. (The New York Times Company, 04)
4.2.2.2. Response 4.2.2.2.1. MISO Midwest Independent System Operator (MISO) a group responsible for overseeing power flow across the upper Midwest located in Carmel, Indiana. (Associated press, 03) (Midwest ISO) The MISO state estimator tool malfunctioned due a power line break at 14:20 Eastern Daylight Time (EDT).
32
(U.S.-Canada Power System Outage Task Force) This was one of the two tools MISO used, both of which were under development, to assess electric system state and determine best course of action. (U.S.-Canada Power System Outage Task Force) The state estimator (SE) mathematically processes raw data and presents it in the electrical system model format. This information is then feed into the real time contingency analysis (RTCA) tool to “evaluate the reliability of the power system”. (U.S.-Canada Power System Outage Task Force, p. 48) At 14:15 the SE tool produced a solution with a high degree of error. The operator turned off the automated process that runs the SE every five minutes to perform troubleshooting. Troubleshooting identified the cause of the problem as an unlinked line and manually corrected the linkage. The SE was manually run and at 13:00 producing a valid solution. (U.S.-Canada Power System Outage Task Force)The real-time contingency analysis (RTCA) tool successfully completed at 13:07. The operator, left for lunch forgetting to re-enable the automated tool processing. This was discovered and re-enabled at about 14:40. The previous linkage problem recurred and the tools failed to produce reliable results. The tool was not successfully run again until “16:04 about two minutes before the start of the cascade.” (U.S.-Canada Power System Outage Task Force, p. 48) 4.2.2.2.2. FE The Systems Control and Data Acquisition (SCADA) system monitoring alarm function failed at 14:14 and began a cascading series of application and
33
server failures, by 14:54 all functionality on the primary and backup servers failed. (U.S.-Canada Power System Outage Task Force) FE’s IT staff were unaware of any problems until 14:20, when their monitoring system paged them because the Emergency Management System (EMS) consoles failed. At 14:41 the primary control system server failed and the backup server took over processing. The FE IT engineer was then paged by the monitoring system. (U.S.Canada Power System Outage Task Force) A “warm reboot” was performed at 15:08. (U.S.-Canada Power System Outage Task Force) IT staff did not notify the operators of the problems nor did they verify that functionality was restored with the EMS system operators. (U.S.Canada Power System Outage Task Force) The alarm system remained nonfunctional. IT staff were notified of the alarm problem at 15:42 and they discussed the “cold reboot” recommended during a support call with General Electric (GE). The operators advised them not to perform the reboot because the power system was in an unstable state. (U.S.-Canada Power System Outage Task Force) Reboot attempts were made at 15:46 and 15:59 to correct the EMS failures. (U.S.-Canada Power System Outage Task Force) An American Electric Power (AEP) operator, who was still receiving good information from FE’s EMS, called FE operators to report a line trip at 14:32. Shortly thereafter operators from MISO, AEP, PJM Interconnection (PJM), and other FE locations called to provide system status information. (U.S.-Canada Power System Outage Task Force) FE operators became aware that the EMS
34
systems had failed at 14:36, when an operator reporting for the next shift reported the problem to the main control room. (U.S.-Canada Power System Outage Task Force) The “links to remote sites were down as well.” (U.S.-Canada Power System Outage Task Force, p. 54) The EMS failure resulted in the Automatic Generation Control (AGC), which works with affiliated systems to automatically adjust to meet load, to be unavailable from 14:54 to 15:08. (U.S.Canada Power System Outage Task Force) FE operators failed to perform contingency analysis after becoming aware that there were problems with the EMS system. (U.S.-Canada Power System Outage Task Force) At 15:46 it was too late for the operators to take action to prevent the blackout. (U.S.-Canada Power System Outage Task Force)
4.2.2.3. Mitigation in place FirstEnergy did have mitigation in place. There were several server nodes that can host all functions with one server on “hot-standby” for backup with automatic failover. (U.S.-Canada Power System Outage Task Force) FE had an established relationship with the EMS vendor GE, which provided support to the IT staff when a new problem that the IT staff was not experienced with occurred. There were also established mutual aid relationships with other utility operators. The operators have the ability to monitor affiliated electric systems and request support. There were also established communication procedures that dictated that the operators make calls under specific conditions.
35
FirstEnergy also had a tree trimming policy that is a standard mitigation tactic for electric companies. The purpose of the policy is to avoid lines that will require immediate repair for safety reasons and will increase stress on the electric system. This is a non-technical mitigation measure that is very important to protect the reliable functioning of the electric system and its monitoring tools.
4.2.2.4. Corrective actions 4.2.2.4.1. Regulatory Federal Energy Regulatory Commission (FERC) regulations are no longer voluntary they can now “impose fines of up to a million dollars a day”. (Minkel, 08)The Energy Policy Act of 2005 provided FERC authority to set and enforce standards. (Minkel, 08) FERC has also created a prototype real-time monitoring system for the nation’s electric grid. (Minkel, 08) Future smart or supergrid systems are also under development. According to Arshad Mansoor, Electric Power Research Institute’s power delivery and utilization vice president, these systems would provide more resiliency by “monitoring and repairing itself”. (Minkel, 08) Project Hydra scheduled to be in service in downtown Manhattan in 2010 is a joint supergrid venture between the Department of Homeland Security and Consolidated Edison Company of New York. (Minkel, 08) More testing and infrastructure upgrade are required before this promising technology could be implemented on a large scale. (Minkel, 08)
36
4.2.2.4.2. FirstEnergy FirstEnergy implemented a new EMS system that was installed at two locations to provide resiliency. (Jesdanun, 04) The new system has improved alarm, diagnosis, and contingency analysis capabilities. (NASA, 2008) There are now more visual status information and cues. (NASA, 2008) FirstEnergy created an operator certification program, emergency response plan and updated protocol. Communication requirements were established for “computer system repair and maintenance downtimes between their operations and IT staffs” and “tree trimming procedures and compliance were tightened.” (NASA, 2008)
4.2.2.5. Discussion The primary cause of this outage appears to be human error. The electrical systems operators were “unaware” of the problem for over an hour, as the electrical system began to degrade. (U.S.-Canada Power System Outage Task Force) However, there were repeated warnings from communications with operators from various locations to indicate there was a problem with the EMS. The operators were aware that there was a problem at 14:36, which provided the operators more than an hour to take action. The discussion between FE operators and IT staff indicated that the operators were aware that the electrical system state required action. Operators’ actions may have been hampered from 14:54 to 15:59 by EMS screen refresh rates of up to “59 seconds per screen.” (U.S.-Canada Power System Outage Task Force, p. 54)
37
FE’s IT staff failed to notify the operators at 14:20, when they became aware of EMS system failures. This could have provided the EMS operators with 16 minutes more to determine and execute the correct course of action. Also the FE EMS system was not configured to produce alerts when it fails, which is a standard EMS feature. This would have provided another six minutes to the operators to perform manual actions. Based on the operators’ failure to act on the many other warnings they received, it is hard to make a case that the operators would have acted in a timely manner even with an additional 32 minutes notice. It is possible, that operators were too dependent upon the automated systems and overconfident that the situation would correct itself. The North American Electric Reliability Council (NERC) found FE in violation for failure to use “state estimation/contingency analysis tools”. (U.S.-Canada Power System Outage Task Force, p. 22) The EMS system was “brought into service in 1995” and it had been decided to replace the aging system ”well before August 14th”. (U.S.-Canada Power System Outage Task Force, pp. 55-56) The NERC found FE in violation for insufficient monitoring equipment. (U.S.-Canada Power System Outage Task Force) It was later determined that the software had a programming error that contributed to the alarm failure.” According to Kema transmission services senior vice president, Joseph Bucciero, “the software bug surfaced because of the number of unusual events occurring simultaneously _ by that time, three FirstEnergy power lines had already short-circuited.” (Jesdanun, 04) The three
38
lines were lost because FE failed to perform tree trimming according to internal policy. The lines sagged, which occurs on hot days, and touched trees. (NASA, 2008)
4.2.3. Conclusion This outage serves as an example that many small, mostly human errors, can result in disaster. A more resilient system requiring less human interaction to perform emergency tasks could have prevented this outage. Poor communication between IT and Operations staff was a large factor as was the operators’ failure to heed the warning of other operators. The FirstEnergy operators were provided with information outside of their EMS to understand that the EMS was likely providing unreliable information. The largest contributing factor was FirstEnergy failure to be proactive. They did not trim trees, they did not replace their old EMS system, they did not communicate appropriately with other energy operators, and they did not train the employees how to act in a crisis situation when the EMS could not be relied upon. There were contributing factors outside of FirstEnergy, but if any one of the factors contributed by FirstEnergy were removed the wide spread outage may not have occurred.
39
4.3. Tulane
4.3.1. Background Tulane University is a private institution located in New Orleans, Louisiana with an extension located in Houston, Texas for the Freemans School of Business. (Gerace, Jean, & Krob) The University was established in 1834 as a tropical disease research medical school (Alumni Affairs, Tulane University, 2008). A post Civil War endowment from Paul Tulane transformed the financially struggling public university into the private university that survives today. (Alumni Affairs, Tulane University, 2008)Tulane maintains a community service oriented focus and its contributions have shaped the city of New Orleans over the decades. In 1894 Tulane’s College of Technology brought electricity to the city of New Orleans. (Alumni Affairs, Tulane University, 2008) Tulane is currently New Orleans’s largest employer. (Tulane University) Since the university was established Tulane has weathered the Civil War and many hurricanes. Tulane has adapted to the New Orleans hurricane prone environment. Tulane has integrated buildings that can “withstand hurricane force winds” into the campus landscape. (Alumni Affairs, Tulane University, 2008) Only Katrina and the Civil War have prevented Tulane from offering instruction. (Tulane University, 2009)
40
4.3.2. Hurricane Katrina Two days before the beginning of Tulane’s 2005 fall semester Hurricane Katrina devastated New Orleans. (Blackboard Inc., 2008) This was “the worst natural disaster in the history of the U.S.” (Cowen, 05) The real damage to New Orleans began hours after Katrina passed as the levee succumbed to the damage it suffered during the storm.
4.3.2.1. Ramifications The ramifications of this disaster reach far beyond Tulane’s campus. However, Tulane’s data center is the focus of this case study therefore direct impact on Tulane and the cascading effects will be discussed. The hurricane’s property damages alone were in excess of $400 million. (Alumni Affairs, Tulane University, 2008) Over a week after Katrina, “eighty percent of Tulane’s campus was underwater.” (Alumni Affairs, Tulane University, p. 66) The New Orleans campus was closed for the fall semester of 2005. (Cowen S. , Messages for Students , 05) Students were displaced and attended other colleges as “visiting” students. (Gulf Coast Presidents, 2005) Some students were asked to pay fees at the hosting University, Tulane promised to address tuition issues as soon they gained access to their student records. (Cowen S. , Student Messages, 05)
41
As university administration began planning for Tulane’s recovery from Hurricane Katrina, they had no access to “computer records of any kind”. (Alumni Affairs, Tulane University, 2008, p. 65) Tulane’s bank was not operational and the administration did not know what funds were in the inaccessible account. (Alumni Affairs, Tulane University, 2008) Accounts receivable servers were unrecoverable because they operated independent of central IT. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) Research at Tulane University suffered as well. Specimens from long running studies were destroyed. Engineering faculty returned to campus “to service critical equipment and retrieve important servers” which saved several experiments. (Grose, Lord, & Shallcross, 2005) Over 150 research projects suffered damage. (Alumni Affairs, Tulane University, 2008) Medical teams were forced to destroy dangerous germ specimens used in research to avoid possible outbreaks caused by inadvertent release of the germs. In addition, Tulane’s Hospital was closed for six months, but “was the first hospital to reopen in downtown New Orleans”. (Oversight and Investigations Subcommittee of the House Committee on Energy and Commerce, p. 1) Tulane reopened in January 2006 for the spring semester. The school lost $125 million due to being closed for the fall semester of the 2005-2006 school years. (Alumni Affairs, Tulane University, 2008) Prior to reopening Tulane had to streamline its academic programs. This made funding available for the daunting task of rebuilding Tulane and New Orleans. New Orleans had no infrastructure to
42
support Tulane. Tulane provided housing, utilities, and schools to support Tulane students and staff. (Alumni Affairs, Tulane University, 2008) Despite Tulane’s amazing recovery, loss of tuition income and disaster related financial losses forced staff reductions and furloughs. (Lord, 2008)
4.3.2.2. Response Monday August 29th 2005, Tulane University was flooded after the levees damaged by Hurricane Katrina broke. (Searle, 2007) Tulane was fortunate to have a few days warning prior to the hurricane. August 25th, Tulane’s IT staff initiated online data backups according to the data center disaster recovery plan. (Lawson, 2005) August 28th, Tulane brought its information systems down. (Lawson, 2005) Backup generators and supplies were placed into campus buildings. (Krane, Kahn, Markert, Whelton, Traber, & Taylor, 2007) On the 30th generators began to fail as Tulane’s campus flooded, as a result communication systems failed “with loss of e-mail systems and both cell and landline phones. Text messaging remains functional and becomes the main source of communication.” (Krane, Kahn, Markert, Whelton, Traber, & Taylor, 2007) Senior administration staff sheltered in the Reily Student Recreation Center command post along with other essential staff during the Hurricane. Wednesday, August 31 Tulane’s “Electrical Superintendant Bob Voltz” shut off power to the Reily building. (Alumni Affairs, Tulane University, 2008) Thursday,
43
the staff was rescued by helicopter from the now flooded Tulane after several unsuccessful rescue attempts (Alumni Affairs, Tulane University, 2008) Tulane’s top recovery priority was paying its employees. (Anthes, 2008) This effort was complicated because payroll employees failed to take the payroll printers and supplies as specified in the disaster plan. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) Police escorted Tulane IT staff to retrieve Tulane’s backup data and computers from their 14th floor offsite datacenter in New Orleans. (Alumni Affairs, Tulane University, 2008) Tulane’s recovered backup tapes were processed at SunGard in Philadelphia. (Anthes, 2008) SunGard’s willingness to take Tulane as a customer allowed payroll to be completed “two days late” according to Tulane CIO John Lawson. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) As of September 3rd 2005 Tulane still listed restoration of communications and IT systems as an urgent issue. (Cowen S. , Student Messages, 05) Tulane’s President Scott Cowen held live chats in September 2005 to address community concerns. (Cowen S. , Student Messages, 2005) Baylor University in Houston hosted Tulane’s redirected website and invited Tulane to resume operations at Baylor. However, this process did not go as smoothly as planned because the IP address assigned was not static. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) This was quickly corrected and Tulane used the redirected emergency site to communicate with stakeholders providing “a continuous and unbroken chain of
44
updates via its Web site.” (Schaffhauser, 2005) School of Medicine classes resumed in three weeks despite beginning with: none of the necessary infrastructure that maintains the functions of any medical school was available to Tulane's SOM. Information technology support, network communication servers, the University's payroll system, and e-mail were down, and student, resident, and faculty registration systems were not operational. Student and resident rosters did not exist, nor were there any methods to confirm credentials or grades. (Krane, Kahn, Markert, Whelton, Traber, & Taylor, 2007) Clinical students were able to resume because the Association of American Medical Colleges maintains a database on medical students, which had been updated in the days before Katrina hit. (Testa, 2006) The database records along with the Baylor registration website and newly created paper files allowed Baylor and Tulane to gather the information needed to resume classes. (Testa, 2006) This resumption was particularly vital for seniors. Unfortunately, not all of the College students of New Orleans were so lucky. About 100,000 were displaced, many with no academic or financial records. (DeCay, 2007) Email “was the first system to be brought back online”. (McLennan, 2006) Blackboard provided systems to allow Tulane and other affected Gulf Coast universities to establish online courses. (McLennan, 2006) This system was utilized by Tulane to provide a six-week “mini fall semester”. (McLennan, 2006) Tulane’s own Blackboard system was quickly restored to allow retrieval of course material. (McLennan, 2006) There was no help desk to assist students or instructors during the “mini fall semester”. (McLennan, 2006)
45
4.3.2.3. Mitigation in place Tulane’s IT had plans that covered “how to prepare for a hurricane.” (Anthes, 2008) The staff was trained and comfortable enacting the disaster plan. They knew the backups could be completed in 36 hours. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) Offsite backups were maintained on the 14th floor of a building in New Orleans. (Anthes, 2008) Tulane also maintained a website for emergency information and phone contacts. (McLennan, 2006) In case of a category 4 or higher hurricane the data center would be shutdown and evacuated. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) The remote hosted emergency website for Tulane would be activated prior to shutdown. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05)
4.3.2.4. Corrective actions Today the university has a disaster recovery plan including offsite backup servers for websites, e-mail and other critical systems, which is updated yearly. (Anthes, 2008) There are also documented protocols for recovery from a disaster, which was missing during the recovery from Katrina. (Anthes, 2008) The recovery plan has also been amended to cover more than hurricanes and IT staff now participates in preparedness planning. (Anthes, 2008) (Gerace, Jean, & Krob, 2007)
46
As of 2008, Tulane had a contract with SunGard mobile data center for emergencies. (Anthes, 2008) Katrina’s affect on the New Orleans backup data center made it clear that they needed to maintain backups at a more distant location as a result “backups are taken to Baton Rouge 3 times a week”. (Anthes, 2008) Employees have been provided with USB storage devices to prepare personal backups for emergencies. (Anthes, 2008) An alternate recovery site has been established in Philadelphia and there is now a hardened onsite command center at Tulane. (Lord, 2008) “Energy efficient systems were installed in the down town campus” which can be operated longer using emergency generators.(Alumni Affairs, Tulane University, 2008) Tulane also maintains a “digital ham-radio network that can transmit simple e-mail” and emergency updates to the website can be published directly by the university’s public relations (Lord, 2008) “So as not to be dependent on the media to track potentially disastrous hurricanes, Tulane has enlisted a private forecaster to supply e-mail updates.”(Lord, 2008) “Students are required to have notebook computers” which can facilitate continuity during a disaster and the university now has online classes. (Gerace, Jean, & Krob, 2007) (Lord, 2008)
4.3.2.5. Discussion Tulane’s situation is an extreme, but not unique example. There were many things they did right and in the end they recovered. It is debatable if the plan for offsite disaster recovery would have been worth the investment in
47
dollars. Itemized financial reports for Tulane were not available for review. It is clear that the absence of offsite recovery contract was a deliberate financial decision. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) In retrospect this was probably a poor financial gamble in a hurricane prone area especially considering that the destruction of the levee was a known risk. (Kantor, 2005) This decision also created additional stress for Tulane’s staff and students. Tulane did an excellent job of recovering payroll to ensure their staff was not without desperately needed financial resources. The medical students were also well cared for thanks to the help of outside partnerships. The continued medical program could not have been possible had there not been an existing, if informal, mutual aid relationship with Baylor. Unfortunately, the loss of Tulane’s data center made for a difficult fall 2005 semester for most students. They not only had to relocate, but were without financial or academic records from Tulane. For those students the approximately $300,000 per year expenditure would have provided some peace of mind. (Lawson, A Look Back at a Disaster Plan: What Went Wrong and Right, 05) As a result of this as well as other adverse conditions at Tulane, many students did not return. In 2008, enrollment at Tulane was down by 5,300 students from its pre-Katrina numbers. (Lord, 2008) This resulted in financial distress for Tulane and the closing of its engineering school and consolidation of other programs and colleges. (Lord, 2008)
48
Nonetheless, Tulane made herculean efforts to reopen one semester after Katrina forced the University to close. This ensured not only the survival of Tulane, but the revival of New Orleans as well. The medical students and hospital provided much needed health care for New Orleans residents. Students of Architecture are designing and building affordable, energy efficient homes. (Brown, 2008) The damage caused to Tulane and New Orleans was beyond the ability to prevent or to protect infrastructure. Efforts to preserve the ability to recover from a complete loss of IT and infrastructure were proven to be the most valuable in this case. No one institution was capable of recovering New Orleans, but Tulane has kept it alive.
4.3.3. Conclusion Tulane has learned from Katrina how to protect the data that is the lifeblood of the university. The aftermath of Katrina has also made clear that the students are Tulane’s customers and they cannot survive without them. Further that New Orleans is dependent upon Tulane. Universities have a history of providing for the communities they are a part of in times of disaster, this was true in the aftermath of 9/11 and of Katrina. As the largest employer, a medical provider and educator Tulane has persevered and shored up its weakness and has become more independent.
49
4.4. Commonwealth of Virginia
4.4.1. Background The state of Virginia outsourced its information technology to Northrop Grumman in 2005. (Schapiro & Bacque, Agencies' computers still being restored, 2010)The contract was to span 10 years at a cost of $2.4 billion becoming the largest single vendor in Virginia's history. (Lewis)Virginia Information Technologies Agency (VITA) was established in 2003 and is the state agency charged to ensure the state’s information technology needs and the terms of the contract with Northrop Grumman are met. (Lewis) This was to be the flagship partnership to show that the Public sector could benefit through private outsourcing of information technology. However,"(d)elays, cost increases and poor service have dogged the state's largest-ever outsourcing contract, the first of its kind in the country”. (Schapiro & Bacque, Agencies' computers still being restored, 2010)Virginia had entered the contract with the expectation that the contract would provide modernized services for the "same cost as maintaining their legacy services.” (Stewart, 2006)At this point, the state no longer expects to see any cost savings under the original contract period, but hope that savings will be realized under an extended contract. (Joint Legislative Audit and Review Commission, 2009)
50
Since the beginning of the contract with Northrop Grumman, the state of Virginia has suffered two major outages. In addition, the state paid an additional $236 million to cover a hardware refresh. (Schapiro & Bacque, Agencies' computers still being restored, 2010)This process, scheduled to be completed in July 2009, is significantly behind schedule. There have been ongoing issues with Northrop Grumman's poor performance in several areas. (Joint Legislative Audit and Review Commission, 2009)Included in these issues are inadequate disaster recovery and unreliable backup completion. As recently as October 2009, "lack of network redundancy" is recognized as a "major flaw" in the system. (Joint Legislative Audit and Review Commission, 2009, p. 105)
4.4.1.1. Oversight issues Until the later part of March 2010, VITA could make changes to the contract with Northrop Grumman without consulting with the General Assembly. (Joint Legislative Audit and Review Commission, 2009)This limited the governor’s ability to oversee the state IT services. The Information Technology Investment Board (ITIB) is charged with oversight of VITA, but could not provide full time oversight. The members of ITIB attend meetings irregularly and lack the technical knowledge required to provide adequate governance (Joint Legislative Audit and Review Commission, 2009) VITA's oversight was restructured to eliminate the ITIB. VITA and the State CIO now report to the Office of the Secretary of Technology. This new structuring creates oversight by the
51
Governor. The new structure became effective March 16, 2010 after passing through the house and senate via an emergency clause. (VITA)
4.4.1.2. Notable service failures The state has been plagued with a litany of service failures throughout the contract with Northrop Grumman. In 2009, prison phone service failed and was prioritized according to the number of employees affected. The technicians were given 18 hours to resolve the issue according to the assigned prioritization. Service was restored six and a half hours later following an escalation request initiated by the prison. (Joint Legislative Audit and Review Commission, 2009)Another service failure, noted in the JLARC 2009 report, left the Virginia State Police without internet access for three days. (Kumar & Helderman, 2009)June 20, 2007 the state of Virginia suffered a wide spread outage. (VITA, 2007)The outage was caused by several "near simultaneous hardware failures" in a legacy server scheduled for refresh. (The Virginia Information Technology Infrastructure Partnership) This failure occurred after the annual disaster recovery test, which was held in April.
4.4.2. August 2010 outage Wednesday, Aug. 25 an outage occurred impacting 27 state agencies. (News Report, 2010) Thirteen percent of the state's file servers were unavailable
52
during the outage. (Schapiro & Bacque, Agencies' computers still being restored, 2010) "The computer troubles were traced to a hardware malfunction at the state's data center near Richmond, which caused 228 storage servers to go offline." (Kravitz, Statewide computer meltdown in Virginia disrupts DMV, other government business, 2010) The hardware that failed was one of the SAN's two memory cards. (Lewis) According to Jim Duffey, Virginia Secretary of Technology, the outage was "unprecedented" based on the "uptime data" on the EMC SAN hardware that caused the widespread failure. (News Report, 2010) "Officials also said a failover wasn't triggered because too few servers were involved." (News Report, 2010) "Workers restored at least 75 percent of the servers overnight." (Kravitz, Statewide computer meltdown in Virginia disrupts DMV, other government business, 2010)
4.4.2.1. Ramifications The SAN failure negatively impacted “483 of Virginia's servers." (Schapiro & Bacque, Agencies' computers still being restored, 2010)Virginia's Department of Motor Vehicles (DMV) was the most visibly impacted agency. Drivers were not able to renew licenses at the DMV offices during the outage, forcing the DMV to open on Sunday and work through Labor Day to clear the backlog of expired licenses. (Schapiro & Bacque, Northrop Grumman regrets computer outage, 2010) Some drivers were ticketed for expired licenses before law enforcement agencies were requested to stop issuing tickets to affected drivers. (Charette,
53
2010) According to Virginia State Police, while "they will not cite drivers whose licenses expired during the blackout", unfortunately those that received tickets must "go through the court system" to request relief. (Kravitz & Kumar, Virginia DMV licensing services will be stalled until at least Wednesday, 2010) In addition, drivers who renewed licenses the day of the blackout will need to visit the DMV again because the data and pictures from the transactions that day were lost. (Schapiro & Bacque, Northrop Grumman regrets computer outage, 2010) This also increases the likelihood that some of the licenses and IDs issued that day could be illicitly sold. The DMV was not the only agency negatively impacted by the SAN outage. According to the Department of Social Services about 400 welfare recipients will receive benefit checks up to two days late. Employees at this agency also worked overtime to reduce and eliminate delays where possible. (Schapiro & Bacque, Agencies' computers still being restored, 2010) Internet services used by citizens to make child support and tax payments were unavailable as well. (Schapiro & Bacque, Northrop Grumman regrets computer outage, 2010) "At the state Department of Taxation, taxpayers could not file returns, make payments or register a business through the agency's website." (Schapiro & Bacque, Agencies' computers still being restored, 2010)Three days after the outage began "(f)our agencies continue(d) to have 'operational issues' " these agencies included the departments of Taxation and Motor Vehicles. Many
54
other agencies continued to suffer negative effects from the outage. (Schapiro & Bacque, Agencies' computers still being restored, 2010)
4.4.2.2. Response At approximately noon on Aug. 25, a data storage unit sent an error message. (Wikan, 2010) The cause of the error message was determined to be “one of the two memory boards on the machine needed replacement." (Wikan, 2010)"A few hours later, a technician replaced the board ", (Wikan, 2010) Shortly after the board was replaced the storage area network (SAN) failed. It was later discovered that the wrong board might have been replaced. (Wikan, 2010)"VITA and Northrop Grumman activated the rapid response team and began work with the appropriate vendors to restore service." (VITA) Work continued through the night to restore services but was unable to restore data access to affected servers. (Wikan, 2010) (VITA)Thursday, the SAN was shut down overnight to replace all components (Wikan, 2010) The storage provider, EMC, determined that the best course of action is to perform an extensive maintenance and repair process. VITA and Northrop Grumman, in consultation, have determined this is the best way to proceed. (VITA) The 24 affected agencies were notified prior to the SAN shutdown to allow them to take appropriate action. (VITA) SAN service was restored at "2:30 a.m. Aug. 27." (Wikan, 2010) Over half of the attached servers were operational Friday
55
morning. (VITA) VITA began working with the operational customers to confirm service availability and perform data restoration.(VITA, 2010) Unfortunately, the DMV remained unable to “process driver’s licenses at its customer centers. Some other agencies continue(d) to be impacted." (VITA) VITA continued data restorations over the weekend the DMV restore took “about 18 hours.” (Wikan, 2010) as of Monday August 30th "Twenty-four of the 27 affected agencies were up and running". (VITA) However, three key agencies still suffered service outages Monday through Wednesday: ”the Department of Motor Vehicles, Department of Taxation and the State Board of Elections." (VITA)
4.4.2.3. Mitigation in Place The recovery scenario in this SAN outage was facilitated by mitigation measures in place. Not only was there a "fault-tolerant" SAN, but also there were magnetic tape backups and the staff had just performed recovery exercise testing. The established relationship with the hardware vendor EMC brought additional expertise to resolve this SAN outage. VITA also has two data centers, a primary and failover data center. Examination of the documents available on the VITA web site would imply that every recommended best practice is being implemented and executed. The SAN hardware used is best in class and has excellent reliability. VITA also had a rapid response team whose mission was to reach incident resolution rapidly. (Nixon, 2010) Yet, an outage in one system had serious negative impact on
56
several agencies and more importantly the citizens of Virginia for more than one week. This incident is one of many that the State of Virginia has suffered since the beginning of the contract with Northrop Grumman. Professing use of industry standards and best practices does not result in a reliable, stable cyber infrastructure. In this case, there was still a single point of failure that resulted in more than a minor inconvenience for Virginia. Implying a serious lack of foresight in planning for recovery from this "unprecedented" outage. IT professionals experience one-in-a-million or billion faults many times throughout their careers. Disasters that are disregarded as near impossible do occur. This state invested in a long-term partnership with Northrop Grumman to avoid outages such as that which occurred in late August. Virginia made the necessary financial investments that should have resulted in a more stable, modern, infrastructure managed by experienced IT staff.
4.4.2.4. Corrective Actions Largely unknown at the time of this report, an independent review has been ordered. Agilisys Inc. was chosen to conduct a 10-12 week audit beginning November 1, 2010. (VITA, 2010)
57
4.4.2.5. Discussion From a technical perspective it is difficult to do more than speculate exactly what happened and extrapolate what should have been done. VITA participated in disaster recovery exercises in the second quarter of 2010. (VITA, 2010) The exercise involved restoring service after losing a data center. Providing that the exercise was adequately rigorous, performing a restore for an outage affecting 13 percent of servers should have been relatively easy. The major complicating factor resulting in delayed recovery was reported to be tape restoration and data validation. (Wikan, 2010) More emphasis should be placed on data restoration and validation activities in future exercises. Incidents resulting in partial data loss or corruption are far more likely than loss of an entire data center. Activities that improve restoration time for data recovery are valuable to avoid serious negative business impacts from a relatively common incident. Practicing these restorations will provide insight on process and technology enhancements that might improve recovery time. In this case, the data recovery process from tape left the DMV unable to issue or update driver’s licenses or IDs for a week. A data restoration exercise might have revealed this weakness and another solution might have been put in place to mitigate the recovery time issue. The Northrop Grumman decision to allow days between backups is highly questionable. (Availability Digest, 2010) It is difficult to justify anything less than daily backups for agencies like the DMV, the Department of Taxation, and Child
58
Support. Loss of payment records for the latter two agencies would cause major inconveniences and bad press. Loss of four days identification data for licenses and IDs is inexcusable. The root of this decision likely lies in the bottom-line, Northrop Grumman is trying to make a profit and failed to implement sufficient redundancy for the customer business needs. It is difficult to imagine a situation where allowing days between backups is anything less than negligence. Additional resiliency mechanisms should be built into the databases and storage. This is advisable for all high availability databases and might have avoided the data loss and corruption that occurred. One possible mechanism is local auditing copies maintained onsite in intermediate storage until there is confirmation that the data was written to the SAN. The local copy would then be held until the backup copy is confirmed as processed. This would entail maintaining onsite transactions records and data for up to 48 hours. Maintaining local daily backups for daily transactions is also an advisable practice to avoid loss of records. Another option is to maintain clone Business Continuance Volumes (BCV), essentially a regularly scheduled copy between the two SANs. This creates mirrored storage systems, hot scheduled copies occurring every minute for example, using technology such as Oracle Data Guard or SQL server mirroring and log shipping. Most database engines have a way to replicate themselves in a near real-time state; the replicated copy is stored on a separate
59
physical hardware in order to eliminate data loss. The use of both options presented would significantly reduce the possibility of data loss. The fact that "too few servers were involved" to trigger failover is baffling. (News Report, 2010) Any fault with the potential to incur the impact experienced by this outage should initiate a failover. The IT staff should have initiated a manual failover prior to making the SAN repair for the initial hardware failure. This suggestion assumes that the failover would have eliminated the dependence on the faulty SAN. In addition, if the SAN was still operating, why did the technician perform the repair during business hours? The technician should have created a cold backup to tape prior to doing the off hours repair. The technician should have been aware the backup had not occurred for four days and understood the potential data loss that could result. (Availability Digest, 2010) VITA’s staff may need additional training to help them identify situations where initiating a failover is appropriate. Training may be required to identify when to perform a manual back up as well as situations that can wait for after hours repair. It is likely that required change management processes were not followed. The VITA webpage professes a commitment to the principals of Information Technology Infrastructure Library (ITIL). (VITA) Following ITIL principals, the SAN repair would have been subject to a change management process. An emergency change request should have been submitted explaining the problem, the proposed fix, and the steps to be taken. Affected customers,
60
process owners, and the change management board (or equivalent) should have been notified. Either there was: no change request, no one reviewed the change request, the request was not understood or the proposed steps were not executed. Monitoring tools may also have played a role in this outage. The IT staff either ignored alerts, did not understand them, or had the monitoring tools incorrectly configured. Monitoring alerts should have notified the staff of the problem, identified which SAN controller was having the problem, and alerted staff of failed write attempts to the networked storage. Additional training could have ensured properly implemented monitoring tools and the IT staff’s ability to understand the alerts. This outage also displays the weakness of consolidated centralized services. The financial motivation to move to centralized services is strong. It is important to balance the cost savings with the risk being taken; the savings may not justify the risk for many governmental organizations. Strategic long term planning focused on the business needs rather than cost savings is a requirement. Perhaps the DMV data is not a place to cut corners. Distributed servers were implemented to avoid single points of failure, and while even a distributed system is not free from failure, the possibility of wide spread failure and data loss is reduced. Strategic planners would do well to be skeptical of vendor claims. EMC may claim the SAN outage is unprecedented, but they do not claim this is the first
61
outage. It is foolish to depend heavily on a single piece of hardware for any business critical service. Northrop Grumman has failed to deliver a reliable quality service due to poor strategic planning. Northrop Grumman has enough experience to deliver quality service. However, they have chosen to design IT services for the state of Virginia that allow a single hardware failure to cause outages for several agencies spanning days. There are many areas for improvement revealed by this outage. Ultimately, the outage was a result of human error. Human error will occur, however there needs to be as many safeguards as possible in place to avoid human error from escalating into a fiasco. Training would help with human error. Training is an ongoing process that must be maintained along with the process of constant improvement. Northrop Grumman, EMC, and VITA share the blame for poorly automating redundancies and backups. A single technician dealing with an "unprecedented" outage will always be likely to make a mistake in a moment of uncertainty. The partnership with VITA and Northrop Grumman was established as part of a risk transference plan. Outsourcing expensive IT services to a company that specializes in IT should result in lower cost due to bulk discounts, enhanced services, and access to high quality IT staff. The total costs of outsourcing IT should go down over time due to falling hardware costs. (Lee) These expected financial benefits are nonexistent in the Virginia - Northrop Grumman contract. The effectiveness of this partnership should be reviewed in terms of value to the
62
customer, in this case the citizens of Virginia. The quality of the partnership should be reviewed using the dimensions of fitness of use and reliability. (Lee & Kim, 1999) The events of the last few years have shown that the service that Northrop Grumman provides is of questionable fitness for use or reliability. Poorly strategized and executed services have not only cost Virginians, but have been a source of inconvenience and delay. Some Virginians had to go to court to combat expired license tickets, those who cannot find the time to do this may also face increases in insurance premiums. These issues seem small in comparison to the compromised integrity of the licenses issued by the Virginia DMV just prior to the outage. These licenses are legal and nearly untraceable and could fetch high prices on the black market. Also, consider the safety of those working in prisons without phone service for hours. The phone outage described was not an isolated event. The 10-year contract with Northrop Grumman has left little possibility to exit the contract and request new outsourcing bids. Virginia recently reviewed the partnership and it was decided that it was too costly to exit the contract. Northrop Grumman argued that Virginia did not provide them with adequate access to information that would have allowed them to create a realistic refresh schedule and budget. Virginia denied this, but agreed to extend the project timeline and paid an additional $236 million to cover the hardware refresh. (Schapiro & Bacque, Agencies' computers still being restored, 2010) This was done in part for political reasons. Northrop Grumman agreed to move their
63
headquarters to Virginia. (Squires, 2010) Virginia hopes to create new jobs and get better service. Meanwhile, Northrop Grumman will pay out approximately $350,000 in fees due to the August 2010 outage. The lesson to be learned from this partnership contract is that it is unwise to commit to a lengthy contract. (Lee) Contract law as it relates to IT is in its infancy. There are few who understand both IT and law well enough to write or defend the contract properly. A less lengthy agreement may have been best for embarking on the hardware refresh. Perhaps the hardware refresh should have been negotiated as a separate contract from the services outsourcing. At the least, an exit clause that would allow Virginia to exit the contract without risking the waste of millions of public funds would be advisable. Public safety and security are too important to place in the hands of a single provider without any recourse to correct serious issues. The contract with Northrop Grumman appears to have too much wiggle room to make Northrop Grumman accountable for failures. For Virginians the important concern is the implementation of corrective actions to see that this never happens again. Further, that Northrop Grumman is held accountable in a manner that motivates them to stop ignoring issues raised by those charged with oversight. Northrop Grumman has a responsibility to provide high quality services. Northrop Grumman is responsible for their vendors, employees, configurations, and processes. They must deliver resilient IT services and well-trained staff. Taxpayers should no longer pay for the
64
negligence of the outsourced contractor. This partnership was intended to transfer risks of IT services to Northrop Grumman, but Virginia keeps paying without realizing the expected benefits of partnership. The best protection for Virginians may lie in contract law. Future outsourcing contracts should not favor the vendor and exploit the state. Referring to the outsourcing as a partnership may have been a good political move. However, it is important to remember that the relationship in an outsourcing situation cannot be a true partnership because business motives are not shared. (Lee) The outsourcing contract should have clearly defined service level agreements and failure to meet these expectations should result in equally clear penalties. These penalties should have enough financial impact to ensure the vendor does not determine that paying the penalty fees make better financial sense than providing the contracted services. The contract between Virginia and Northrop Grumman has exit penalties that are too expensive to be a feasible option to exercise. (Joint Legislative Audit and Review Commission) Virginia is effectively trapped in a bad contract with no recourse. Future outsourcing contracts must ensure that if Virginia is not receiving contracted services that provide value to the citizens of Virginia the contract can be cancelled allowing Virginia to seek satisfactory services. (Lee) These outsourcing contract improvements can only be achieved through requirements identification, contract negotiations, and rigorous contract review prior to contract finalization. The contract review must be performed by an experienced IT
65
contract lawyer. It is very probable that Northrop Grumman standard contracts provided at least the basis for the outsourcing contract. The use of vendor contracts "even as a starting point" is highly inadvisable because the contract will favor the vendor. (Lee, p. 13) This problem is illustrated in the case of the contract between Virginia and Northrop Grumman. After the contract is in effect, the contract must be strictly managed by the outsourcing organization. This may require the establishment of an internal IT auditing team charged with conducting ongoing service reviews of the vendor. The team should be comprised of experienced IT service auditors. This will unfortunately require additional expense, but auditing activities will ensure that the outsourcing organization will realize the expected value of the contract. Therefore, the expenses of maintaining an auditing team should be included in the outsourcing project costs.
4.4.3. Conclusion Virginia's August 2010 outage provides a case study to illustrate the risks of outsourcing. Virginia chose an experienced government contractor and made appropriate investments. However, they failed to negotiate a contract that provided effective recourse to enforce the contract terms. VITA also failed to complete a manual that would have provided additional leverage to enforce contract terms. (Joint Legislative Audit and Review Commission) In order to mitigate outsourcing risks a strong, well defined, and managed contract is
66
necessary. An experienced IT contract lawyer is recommended to negotiate and manage the outsourcing contract. The outsourcing organization must fulfill contractual obligations to effectively employ mechanisms to enforce vendor contract terms. Vigilance on the part of the outsourcing organization is required to ensure the vendor delivers quality services that meet business requirements. This means investing in auditing to ensure that the vendor is taking appropriate action to provide contracted services.
67
CHAPTER 5. ANALYSIS
5.1. Best Practice Triangulation Each of these case studies highlighted strengths and weaknesses of various mitigation techniques. Tulane’s investment in backup tapes paid off but the investment in an offsite data center did not. The factor that contributed most to Tulane’s recoverability was the aid provided by other Universities and vendors. This type of relationship has proven very useful in sectors such as education and utilities. (Hardenbrook, 2004) Many of these types of organizations work cooperatively on a daily basis to pool resources.
5.1.1. Before-Planning Effective planning must begin with the company business requirements and establishing the maximum tolerable period of disruption (MTPOD), recovery time objectives (RTO), and recovery point objectives (RPO). MTPOD relates to how long your business can be “down” before damaging the organization’s viability. The case studies provide an array of tolerances as shown in Table 5.1 Tolerance and objectives. Using established tolerances and objectives based on organizational characteristics would provide direction in terms of what mitigation techniques to implement.
68
Organization Commerzbank FirstEnergy Tulane
Table 5.1 Tolerance and objectives MTPOD RTO More than 1 week Less than 1 hour Less than ½ hour Less than ¼ hour Less than 1 month Less than 1 week
Virginia
More than 1 day
Less than one hour
RPO Last transaction N/A Previous business day 1 hour
Table 5.1 above reflects estimated MTPOD, RTO, and RPO for each organization based on artifacts included in each case study. These estimates are open to debate for example Commerzbank’s estimated MTPOD is listed as more than one week. One week was chosen as the point at which the viability of Commerzbank would be threatened. This tolerance was determined based on looking only at the Commerzbank American division and determining at what point customers would switch to a competitor. Any outage would be costly for Commerzbank but a weeklong outage would damage the bank’s reputation and cause attrition among customers. Customers tend to be tolerant of short outages, but when the outage impacts their ability to be profitable, they must look elsewhere. Commerzbank America has a relatively small customer base in a highly competitive sector and would therefore have difficulty recovering from customer loss. FirstEnergy provides electricity, a critical infrastructure resource; any outage will immediately inconvenience the customer base. Also outages result in lost revenue because electricity cannot be stored for later use. Extended outages strain other providers and potentially result in cascading critical service outages. There are now mandatory guidelines as well and failure to meet these guidelines
69
carry strict enforceable fines. In addition electrical outages tend to be highly publicized and investigated, damaging the company’s reputation. FirstEnergy is investor owned therefore outages would reduce the value of company shares. Investors sued FirstEnergy for lost revenue in the past and could potentially do so again. All of these factors were included in the ½ hour estimate of MTPOD for FirstEnergy. Tulane University has sustained hurricane season for more than a century before experiencing irrevocable damage to the University’s viability. Review of case study artifacts revealed a repeating theme in this case; hurricanes had become routine. The general thought was just send everyone away for a few days return and clean up when it passes; back to business as usual. This reveals that outages of “a few days” had no real impact on the organization. However, a one-month or more outage impacts the university’s ability to maintain semester operations, most notably the ability to provide their primary service, education. Tulane IT is vital to education and research missions. Without these two activities, university income is critically impacted. In determining this MTPOD the university hospital was not included, only the university itself. Including the hospital would reduce the MTPOD to hours or less due to possible loss of life. Loss of life will not necessarily result in irrevocable viability damage to the organization, but must be avoided at all costs and therefore would be heavily weighted.
70
Determining the MTPOD for the Common Wealth of Virginia is more complex. Some services such as 911 services are critical infrastructure and cannot be down without compromising public safety. Other services may suffer very little during an extended outage. Obviously, prison guards should never be without phone services. However, do any of these factors really damage the viability of the state, it would be very hard to argue that they do. This estimate comes down to cost and public impact. Public impact was weighted most heavily. Also, the state’s IT was outsourced therefore impact to the viability of Northrop Grumman must be included. To date there has been little impact on Northrop Grumman, but possible contractual changes made after the conclusion of the third party investigation may have greater impact. Recovery time objectives were based on reducing impact to the organization’s ability to maintain functionality. Recovery point objectives (RPO) were based on organizational tolerance to lost data including transactional data. FirstEnergy stands out in this group with a not applicable (N/A) rating on Table 5.1. This is based on the assumption that for operational purposes historical data is not critical to maintaining on going services. Real-time data is critical to FirstEnergy operators. Past data is important to predicting and future planning as well as tools development, but loss of this data would have little operational impact as other data sources could be utilized for the purposes mentioned. The MTPOD, RTO, and RPO provide planning direction as mentioned above. Commerzbank tolerances and objectives make it apparent that they must
71
employ business continuity measures to ensure as close as possible to zero downtime. Expenditures in IT to ensure this are warranted and practical for their organization. They can afford to make the necessary investments and downtime is far too costly. The case study artifacts reveal that Commerzbank is actively working on business continuity and actively works to improve the IT infrastructure. Organizations in this category would be advised to avoid delays of tape-based restores and to maintain two hot sites in an active/active cluster configuration. It is important to note that one hot site must be significantly geographically distant. The location should be in another part of the country or in another country when possible. Tulane is a good example of an organization with all the right pieces that failed due to poor placement. Tulane had tape-based backup and recovery which were appropriate for their budget and MTPOD. The backup data center was new and not fully complete but location was the problem. It was near enough to be affected by incidents that affected the University rendering it practically useless. They were lucky the building’s upper floors, where the tapes were located, were not flooded allowing retrieval of the backup tapes. This site at the time would have been a warm site at best; strategic placement would have made this site a major asset. Katrina destroyed the infrastructure of New Orleans and Tulane; it is hard to imagine how on campus classes could have resumed. A functional emergency operations center (EOC) and backup data center could minimally have provided
72
student and employee records and possibly online coursework. Contingency planning should include a backup data center that allows virtual operations where possible. Virtual operations are useful in a variety of situations such as primary site destruction, pandemics, inclement weather, and transportation interruptions. For organizations that can continue operations without the use of information technology services, investing in IT based mitigation may not be appropriate. Be very cautious in ensuring that your organization truly has no IT dependencies. Performing an exercise to walk through a mock year would help to identify dependencies. Organizations that fall in this category are likely to be very small with very few employees. Payroll and billing functions would be very simple and probably paper based. Even in these circumstances, multiple copies maintained at different locations would be advisable to prevent lost records, to avoid lost revenue, or liability issues. Organizations in this category are not representative of the average.
5.1.1.1. Staff training in recovery procedures Staff training levels are more apparent in some of the cases than others. For example, FirstEnergy staff was inadequately trained and there was poor communication between operations and IT staff. There is inherent bias toward documentation available, providing abundant evidence of bad training versus the lack of evidence for good training. Therefore, less documented evidence of Commerzbank’s quality of staff training was available. However, the fact that
73
employees began assembling at the backup site, in the midst of the chaos, transportation, and communication problems of 9/11, is a testament to Commerzbank’s preparedness training. Again, these are the two most extreme examples, but training is the difference between staff that fail to perform versus those that coolly navigate themselves safely from just a few hundred feet from the largest terrorist attack in U.S. history. The stress levels between the two staffs during the first phases are not comparable. In these cases, the well-trained staff under conditions of extreme stress performs very well with very little warning. The poorly trained staff failed to act despite many warnings and hours to act. FirstEnergy staff disregarded these warnings without attempting to verify the current situation.
5.1.2. During-Plan execution
5.1.2.1. Adherence to established procedures Staff adherence to established procedures appeared to have a strong correlation with the success of continuity and recovery efforts. FirstEnergy and the Commonwealth of Virginia suffered comparatively minor incidents. These organizations could have completely averted disaster had the staff followed procedure and taken appropriate action. Figure 5.1 Adherence to established procedures represents each organization’s adherence to procedure on a continuum relative to one another. On the extremes of the continuum
74
represented in Figure 5.1 are Commerzbank and FirstEnergy. Commerzbank appears to have executed their plan flawlessly despite encountering unexpected technical difficulties. FirstEnergy failed to adhere to many industry standards and procedures.
Figure 5.1 Adherence to established procedures The independent review of the 2003 Northeast Blackout found FirstEnergy primarily responsible for the blackout. The operators failed to respond appropriately to calls from partner operators alerting them of problems detected. Forty minutes before the outage, the operators knew the monitoring equipment was not working and still failed to take corrective action. Established internal procedures were inadequate to maintain reliable operations. FirstEnergy IT staff was aware of the problems with the EMS, but did not alert the operators to the issue. This communication was not required at the time of the studied incident, but was later addressed. However, the primary cause of the outage was failure to follow procedure. As a result some areas were without power for up to a week and FirstEnergy's board sued for financial losses caused by negligence. Based on the Commonwealth of Virginia case study artifacts, it is apparent that ITIL standard practices were not followed. Artifacts indicate the use of ITIL for this organization; therefore, ITIL adherence was used as the basis for
75
placement in the continuum in Figure 5.1 Adherence to established procedures. ITIL specifies standards for communication during incidents and also focuses on continuity of operations. The Commonwealth of Virginia outage was still under review at the completion of this study. After the independent review is complete, adherence to established procedure may be more accurately determined. However, it is not disputable that a minor hardware problem, that was not an outage, was acted upon inappropriately. This resulted in a weeklong outage for some agencies and millions in reported losses. Tulane University had well established pre-incident disaster procedures and a staff that was trained and comfortable with the procedures. They also had the luxury of knowing days ahead that the hurricane was coming. The execution went according to plan for the most part. There were critical parts of the plan left unexecuted; the payroll printer and related materials were not taken to safety. This failure further complicated the task of issuing payroll and likely added additional cost during the recovery execution process. Commerzbank adherence to procedure saved the company millions if not billions in lost revenue. The transactions system never went down during the events of 9/11. Despite the loss of primary facilities and unforeseen technical issues they were fully operational within hours. Commerzbank serves as a model for the financial sector for business continuity and disaster recovery planning. Other peer institutions never recovered from 9/11.
76
5.1.2.2. Chain of command structure All of the organizations included in the study had well established chain of command communication structures. Some were more effective than others for a few reasons. Both Tulane and Commerzbank experienced communication disruptions due to the magnitude of the disasters and the resulting damage to infrastructure. Commerzbank had designated call trees and an alternate location to maintain the chain of command despite communication and transportation difficulties. The impact of Katrina was so severe that the impact to the infrastructure of New Orleans was prolonged and the duration of the disaster itself was longer. Both organizations struggled with limitations of communication providers and overloaded cell towers. Tulane’s critical staff members now carry cell phones from more than one provider and maintain local and non-local numbers to avoid future communication disruptions. Tulane has also developed a computer security incident response plan, which follows many principals from the National Incident Management system. (Tulane University, 09) This plan defines roles, incident phases, and incident levels, which delineate what roles, are activated. (Tulane University, 09) This plan could be adapted using NIMS to provide an incident command structure to manage cyberinfrastructure incidents as in Figure 5.2 below. The contacts listed are cumulative, for example if a level 3 incident were to occur the
77
Figure 5.2 Sample IT incident command structure
Chief Information Officer (CIO), Infrastructure director, required infrastructure staff, and process owners would be contacted. Each role activated would have a responsibilities check list to be used for specific level incidents. NASCIO has a toolkit that could be used as a template. Information on where to find the NASCIO toolkit is available in appendix A. Neither FirstEnergy nor the Commonwealth of Virginia experienced disruptions in their chain of commands. FirstEnergy staff disregarded communications with the Midwestern-coordinating operator and failed to communicate as the voluntary industry standards of the time dictated. There was no apparent deviation from the chain of command in the case of the Virginia outage. Though it would be safe to speculate that the independent review will reflect failures to follow some communication protocol.
78
5.1.2.3. Mutual aid relationships The role of previously established relationships with vendors and partners was apparent in all of the case studies. Each continuity or recovery effort was assisted through external relationships. The use of these relationships to provide additional resources was integral to recovery success and reduced the duration of the outage in most of the cases. The assistance Baylor provided Tulane was vital to the future viability of Tulane. The relationships utilized are represented in Table 5.2 below.
Organization
Table 5.2 Aid relationships utilized during recovery Aid Provider
Commerzbank
EMC
FirstEnergy
GE, MISO and affiliated ISOs
Tulane
Baylor and Blackboard
Virginia
EMC and Affected Agencies
5.1.3. After-Plan improvement All of the studied organizations employed some post-incident evaluation to improve future response and resiliency. FirstEnergy and Virginia underwent mandatory third party incident review to determine what steps were necessary to prevent future incidents. Commerzbank and Tulane were unhappy with the response and recovery provisions in place at the time of the incident, and have made changes to increase resiliency.
79
5.1.3.1. Recovery time and cost 5.1.3.1.1. Downtime There is no single way to determine the cost of downtime for every organization nor is there a simple way to determine the cost of recovery. These figures vary based on the sector and other organizational factors. Organizations that have experienced disaster recovery events have not made the financial ramifications available to the public, including those in this study. Further most literature and tools available to aid in determining these cost and return on investment (ROI) are provided by commercial entities that are attempting to sell disaster recover or business continuity solutions and are therefore of questionable validity. For the purpose of this study a combination of recent studies are used for illustrative purposes. The results of a study commissioned by CA technologies in 2010 claims “the average North American organization loses over $150,000 a year through IT downtime.”(CA Technologies, 2010) A Symantec 2011 survey reports median downtime cost of $3,000 per day for small businesses and $23,000 for medium size businesses. Based on these figures it would not be financially feasible for these types of organizations to invest in high availability systems. However, the losses are still substantial and investment in daily data backups maintained offsite would be advisable and affordable. Investing in high availability critical infrastructure information systems is more likely to be a good investment as illustrated in Figure 5.3 below. Increased downtime translates into
80
increased costs. Some sectors such as utility, financial, and some public have regulatory standards that must be met and downtime could result in fines as well as lost revenue.
Figure 5.3 Reported average downtime revenue losses in billions (CA Technologies, 2010, p. 5)
5.1.3.1.2. Resiliency Investment As with most projects in IT determining time and cost benefits is difficult because the goal is a moving target. According to a 2010 study conducted by Forrester Research, “more and more applications are considered critical” as a result recovery times have increased by 1.5 hours. (Dines, 2011) The average application and data classifications reported are shown in Figure 5.4 below. As organizations become more dependent upon information systems and define applications and data as critical, the cost of resiliency rises. As tolerance for
81
downtime decreases the cost of resiliency also rises. Economic realities dictate that most organizations cannot maintain redundancy of all applications and data.
Figure 5.4 Reported critical applications and data classifications
There is currently no accepted standard for how much to invest in resiliency. A rule of thumb for investing in disaster recovery is to earmark one week’s worth of yearly revenue for mitigation.(Outsource IT Needs LLC) There are many other ways to compute how much to invest in IT business continuity most are far more complex. A 2010 Forrester study found that respondents reported six percent of the IT operating budget is committed to resiliency investments. (Balaouras, 2010) When creating a resiliency budget it is important to note that many functions fall under the umbrella of IT operational resiliency such as “security management, business continuity management, and IT operations management”. (Caralli, Allen, Curtis, White, & Young, 2010) Another factor related to resiliency investments is probability of a disaster. The fields of insurance and economics have complicated equations to determine
82
risk to insure profitability. These equations are outside of the scope of this qualitative study. However, this study will use a 2010 Forester market study for anecdotal evidence to provide a simplified method to calculate a spending baseline. Forester reports “24 percent of respondents have declared a disaster and failed over to an alternate site in the past five years”, this yields a 4.8 percent probability of experiencing a disaster requiring failover to a remote site in a year. (Dines, 2011) The average cost of downtime per hour was $145,000 and average recovery time was reported to be 18.5 hours. (Dines, 2011) Multiplying the average cost per hour by the average recovery time yields an average recovery cost of $2,682,500. Spreading the cost of a major disaster over a five year time period yields a disaster cost of $536,500. Multiplying this by the risk probability of 4.8 percent yields $25,752. These figures provide a range for disaster investments a minimum investment of $25,752 and maximum of $536, 500. The average of the two is $281,126; this figure, based on the Forester study, represents a practical investment per year for disaster recovery. An average annual budget of $281,126 is not a large investment relatively speaking, careful long term planning along with integrated iterative implementation will allow the results of this small yearly investment to yield substantial results over time. A five-year resiliency implementation plan would allow long term planning to be implemented through a series of short-term goals. The overall 5-year budget would be $1,405,630. The first year would likely be dedicated to reviewing organizational needs and looking for cost effective ways
83
to implement the resiliency plan. The following years could be focused on a modular implementation and integrating resiliency into new projects.
5.1.3.2. Findings One best practice identified through this case study is to integrate redundant systems into daily processing functions. Commerzbank is a good example of this configuration. This configuration was instituted post 9/11 as a result of evaluation of the recovery efforts’ weaknesses. One advantage is there is less need for human intervention, thus requiring less manpower to recover. This also allows response and recovery to begin immediately. In life threatening situations staff can focus on evacuation. Possible liability issues can be reduced related to both staff and external stakeholders by removing the question of due diligence. Another advantage is that testing is far less disruptive because the recovery systems are already processing part of the load. Integrating redundant systems into daily processing does not mean that all systems must be redundant. Careful planning and classification of applications and data can reduce costs. For example in the case of FirstEnergy, access to past data is not business critical, investing in data recovery can be reduced. Lower cost tape based storage and recovery methods are fine. However, availability of real-time operations applications and data are critical to the mission of FirstEnergy. Investments to support these critical functions are money well
84
spent. A raw order of magnitude estimate for this would place such a system in the hundreds of thousands and could reach to the millions. Gartner reported the cost of a tier IV data center to be about $3,450 per square foot with a cost of $34.5 million for a 10,000 square foot datacenter. (Cappuccio, 2010) A tier IV data center according to Gartner would provide less than a half hour of downtime a year. (Cappuccio, 2010) However, the risk and outage costs are high enough to justify such an investment. The losses of the 2003 blackout were widespread and totaled in the billions. As a result FirstEnergy’s investments should be scaled to support business continuity to avoid outages. The August 2010 Virginia outage is an example of an organization that made the “right” technical decisions, but failed on an organizational and implementation level. A one-time implementation project will not ensure cyberinfrastructure resiliency. It is an ongoing continuous improvement lifecycle process. The Virginia case also provides an example of the hazards of risk transference through outsourcing. Outsourced IT must be carefully managed and monitored. The outsourcing party must ensure the power to enforce meaningful penalties for contractual failures. Figure 5.5 displays the high-level conceptual relationships of the various components of a resilient system. Each triangle above can be further broken down to reflect component relationships, for example training would be within failover testing and plan updates. Disaster recovery and business continuity
85
Figure 5.5 Components of a resilient system
plans would include business impact analysis, categorization of data and applications, and recovery sequence. These relationships hold whether IT is internally maintained or externally maintained. Organizations must be vigilant to ensure that each component is rigorously maintained.
86
CHAPTER 6. CONCLUSION
The research question this study endeavored to answer is “What are best practices before, during, and after disaster recovery execution?” The multiple case study best practice analysis indicates that disaster recovery is one part of an iterative business continuity process in successful organizations. This process is broken down into three distinct phases: before, during, and after disaster recovery execution. Strategic planning occurs during the before phase. This planning includes determining the MTPOD, RTO, and RPO to help determine appropriate investment. Training, and rigorous testing occur in the before phase as well. Best practice during the disaster recovery execution phase includes effective management, organization, and execution of business continuity and disaster recovery plans. Adherence to policy, chain of command, and utilization of aid relationships are important elements of the during phase. In the after, post recovery phase, best practice involves reviewing the situation and response to identify areas that need improvement. The after phase will not only help plan future mitigation, but also identifies supporting government policy needs in critical infrastructure sectors. The iterative cycle begins again in the before planning phase as improvement implementation begins.
87
The purpose of this research was “to bridge the gap of unmet” cyberinfrastructure resiliency needs. An assumption was made that the high cost of implementation was the most significant barrier. While this may be true, surprisingly, the two most avoidable disasters did not occur because of any direct lack of funds. Virginia had already allocated the funds, the company they outsourced to failed to meet contracted requirements. FirstEnergy is fortune 500 company and was a fortune 500 company prior to the 2003 North East Blackout, it is therefore unlikely that lack of available funds was a contributing factor. In these two cases it is arguable that lack of management oversight and urgency was the motivation for any lack of funds allocated. Both organizations understood and implemented backup equipment, but failed to ensure that all mitigation measures were followed. Commerzbank and Tulane are veterans in dealing with adversity. Weaknesses revealed in each appeared to be failure to truly understand the complexities of the recovery process in a truly catastrophic loss. The real barrier appears to be an inability to admit that large-scale disasters do and will happen. Unfortunately, this cognitive avoidance is in our nature as humans. Some large-scale disasters are caused by catastrophes, others by human error, both are illustrated in this study. Required preparations must be strictly enforced or mandated by the government to induce compliance. Car, home insurance, and retirement savings are all forgone by most unless it is mandatory. The extensive complexity of data center and information systems is difficult to grasp. Add to this the tendency to disregard possible calamity and the
88
end result is a crumbling, tenuous cyberinfrastructure. Regulation may be far off due to the specialized workforce required to audit information systems.
89
CHAPTER 7. FUTURE RESEARCH
This study has revealed areas in need of further research. Two are well known issues with ongoing research. These areas are educating a workforce capable of managing critical information systems and increasing resiliency by eliminating dependencies upon third parties for power. The areas of hydrogen fuel cells and solar power continue to leap forward and may provide the power required to create grid independent data centers. Sustainability and reduction in power consumption are also required to build independent data centers. Another area that requires further research is providing understandable suggested business continuity investing based on the probability of a major disaster and the cost of such events. Tables based on industry, size, and location would be particularly useful in determining appropriate spending. Current spending appears to be based on confidence, fear, or sales pitches. A rational method, based on facts, would allow this information to be presented to a CFO in a meaningful manner. Lastly, the area of cyberinfrastructure policy needs to be investigated. The current, mostly unregulated IT, climate values fast over safe. Companies must move fast to push out the next new product. Little time is spent focused on ensuring security and resiliency. This will likely continue until minimum
90
regulations are in place. These policies and the ability to enforce them will be very helpful to organizations that want to be secure and resilient, but are struggling with vendors. Related to this is IT contract law, this field desperately needs to be brought to maturity to protect organizational investments.
BIBLIOGRAPHY
91
BIBLIOGRAPHY
Alumni Affairs, Tulane University. (2008 29-9). Tulane University: Renaissance. Retrieved 10 27-12 from issue: http://issue.com/thebooksmithgroup/docs/tulane Anthes, G. (2008 31-3). Disaster survivor: Tulane's people priority. Retrieved 10 17-12 from Compuerworld: http://www.computerworld.com/s/article/print/314109/Tulane_University Associated Press. (06 20-1). FirstEnergy to pay $28M fine, saying workers hid damage. Retrieved 11 5-1 from USA Today: http://www.usatoday.com/news/nation/2006-01-20-nuke-plant-fine_x.htm Associated press. (03 19-11). Investigators pin origin of Aug 2003 blackout on FirstEnergy failures . Retrieved 11 6-1 from Windcor Power Systems. Availability Digest. (2009 7). Commerzbank Survives 9/11 with OpenVMS Clusters. Retrieved 11 3-1 from Availability Digest: http://www.availabilitydigest.com/public_articles/0407/commerzbank.pdf Availability Digest. (2010 10). The State of Virginia – Down for Days. Retrieved 2010 8-11 from www.availabilitydigest.com: http://www.availabilitydigest.com/public_articles/0510/virginia.pdf Balaouras, S. (2010 2-9). Business Continuity And Disaster Recovery Are Top IT Priorities For 2010 And 2011 Six Percent Of IT Operating And Capital Budgets Goes To BC/DR. Retrieved 2011 7-2 from Forrester.com: http://www.forrester.com/rb/Research/business_continuity_and_disaster_r ecovery_are_top/q/id/57818/t/2 Balaouras, S. (2008 Winter). The State of DR Preparedness. Retrieved 6 29-6 from Disaster Recovery Journal: http://www.drj.com/index.php?Itemid=10&id=794&option=com_content&ta sk=view Barovik, H., Bland, E., Nugent, B., Van Dyk, D., & Winters, R. (2001 26-11). For The Record Nov. 26, 2001. Retrieved 11 13-1 from Time: http://www.time.com/time/magazine/article/0,9171,1001334,00.html
92
Barron, J. (2003 15-8). Power Surge Blacks Out Northeast. Retrieved 2009 2-11 from New York Times: http://www.nytimes.com/2003/08/15/nyregion/15POWE.html Barron, J. (2003 15-8). Power Surge Blacks Out Northeast. Retrieved 2009 2-11 from The New York Times: http://www.nytimes.com/2003/08/15/nyregion/15POWE.html Blackboard Inc. (2008 24-10). Blackboard & Tulane University. Retrieved 10 2712 from Blaceboard: http://www.blackboard.com/CMSPages/GetFile.aspx?guid=39a0b112221d-4d04-be80-f2024d16943a Brown, K. (2008 1-2). House No. 3 Rises for URBANbuild. Retrieved 2011 2-1 from Tulane University New Wave: http://tulane.edu/news/newwave/020108_urbanbuild.cfm CA Technologies. (2010 11). The Avoidable Cost of Downtime. Retrieved 2011 28-1 from Arcserve: http://arcserve.com/us/~/media/Files/SupportingPieces/ARCserve/avoidab le-cost-of-downtime-summary.pdf Cappuccio, D. (2010 17-3). Extend the Life of Your Data Center, While Lowering Costs. Retrieved 2011 28-1 from Gartner: http://www.gartner.com/it/content/1304100/1304113/march_18_extend_lif e_of_data_center_dcappuccio.pdf Caralli, R., Allen, J., Curtis, P., White, D., & Young, L. (2010 5). CERT® Resilience Management Model, Version 1.0 Process Areas, Generic Goals and Practices, and Glossary. Hanscom AFB, MA. Charette, R. (2010 31-8). Virginia's Continuing IT Outage Creates Political Fireworks. Retrieved 2010 6-11 from IEEE Specrtum: http://spectrum.ieee.org/riskfactor/computing/it/virginias-continuing-itoutage-creates-political-fireworks Clinton Administration. (1998 22-5). The Clinton Administration's Policy on Critical Infrastructure Protection: Presidential Decision Directive 63. Retrieved 2010 2-5 from Computer Security Resource Center National Institute of Standards and Technology Federal Requiements: http://csrc.nist.gov/drivers/documents/paper598.pdf Collett, S. (2007 4-12). Five Steps to Evaluating Business Continuity Services. Retrieved 2009 9-11 from CSOonline.com: http://www.csoonline.com/article/221306/Five_Steps_to_Evaluating_Busin ess_Continuity_Services
93
Comptroller of the city of New York. (02 04-9). One Year Later, The Fiscal Impact of 9/11 on New York City. Retrieved 11 13-1 from The New York City Comptroller's Office: http://www.comptroller.nyc.gov/bureaus/bud/reports/impact-9-11-yearlater.pdf Cowen. (n.d.). Letter to students. Retrieved 2010 27-12 from Tulane University: http://renewal.tulane.edu/students_undergraduate_cowen2.shtml Cowen, S. (05 8-12). Messages for Students . Retrieved 10 27-12 from Tulane.edu: http://www.tulane.edu/students.html Cowen, S. (05 2-9). Messages for Students . Retrieved 2010 27-12 from Tulane University : http://www.tulane.edu/studentmessages/september.html Cowen, S. (05 3-9). Student Messages. Retrieved 10 27-12 from Tulane University: http://www.tulane.edu/studentmessages/september.html Cowen, S. (05 8-9). Student Messages. Retrieved 10 27-12 from Tulane University: http://www.tulane.edu/studentmessages/september.html Cowen, S. (2005 14-9). Student Messages. Retrieved 10 27-12 from Tulane University: http://www.tulane.edu/studentmessages/september.html DeCay, J. (2007 3-5). Advising Students After An Extreme Crisis: Assisting Katrina Survivors. Retrieved 10 17-12 from Dallas County Community College District: http://www.dcccd.edu/sitecollectiondocuments/dcccd/docs/departments/do /eduaff/transfer/conference/conference_cvc.pdf Denial-of-service attack. (n.d.). Retrieved 2009 2-11 from Wikipedia: http://en.wikipedia.org/wiki/Denial-of-service_attack Dines, R. (2011). Market Study The State of Disaster Recovery Preparedness. (R. Arnold, Ed.) Disaster recovery Journal , 24 (1), 12-22. Editorial Staff of SearchStorage.com. (2002 6-3). Bank avoids data disaster on Sept. 11. Retrieved 11 3-1 from SearchStorage.com: http://searchstorage.techtarget.com/tip/0,289483,sid5_gci808783,00.html Egenera. (2006). Case Study: Commerzbank North America. Retrieved 2011 3-1 from Egenera: www.egenera.com/1157984790/Link.htm
94
Electricity Consumers Resource Council (ELCON) . (2004 09-02). The Economic Impacts of the August 2003 Blackout . Retrieved 2009 02-11 from ELCON: http://www.elcon.org/Documents/EconomicImpactsOfAugust2003Blackout .pdf EMAC. (n.d.). The History of Mutual Aid and EMAC. Retrieved 2011 20-2 from EMAC: http://www.emacweb.org/?321 FEMA. (n.d.). Incident Command System (ICS). Retrieved 2011 20-2 from FEMA: http://www.fema.gov/emergency/nims/IncidentCommandSystem.shtm FEMA. (2006 30-11). Private Sector NIMS Implementation Activities. From http://www.fema.gov/pdf/emergency/nims/ps_fs.pdf FirstEnergy. (08 27-2). Company history. From FirstEnergy: http://www.firstenergycorp.com/corporate/Corporate_Profile/Company_His tory.html FirstEnergy. (09 27-2). Corporate profile. Retrieved 11 5-1 from FirstEnergy: http://www.firstenergycorp.com/corporate/Corporate_Profile/index.html Forrester, E. C., Buteau, B. L., & Shrum, S. (2009). Service Continuity: A Project Management Process Area at Maturity Level 3. In E. C. Forrester, B. L. Buteau, & S. Shrum, CMMI® for Services: Guidelines for Superior Service (pp. 507-523). Boston, MA: Addison-Wesley Professional. Fortune. (10 3-5). Fortune 500. Retrieved 11 5-1 from CNNMoney.com: http://www.firstenergycorp.com/corporate/Corporate_Profile/Company_His tory.html From Reuters and Bloomberg News. (03 19-8). FirstEnergy Shares Fall After Blackout. Retrieved 11 6-1 from Los Angeles Times: http://articles.latimes.com/2003/aug/19/business/fi-wrap19.1 Gerace, T., Jean, R., & Krob, A. (2007). Decentralized and centralized it support at Tulane University: a case study from a hybrid model. In Proceedings of the 35th annual ACM SIGUCCS fall conference (SIGUCCS '07). New York: ACM. Grose, T., Lord, M., & Shallcross, L. (2005 11). Down, but not out. Retrieved 2010 28-12 from ASEE PRISM: http://www.prismmagazine.org/nov05/feature_katrina.cfm
95
Gulf Coast Presidents. (2005). Gulf Coast Presidents Express Thanks, Urge Continued Assistance . Retrieved 10 27-12 from Tulane University: http://www.tulane.edu/ace.htm Hardenbrook, B. (2004 8-9). Infrastructure Interdependencies Tabletop Exercise BLUE CASCADES II. Seattle, WA. Hewlett-Packard. (2002 7). hp AlphaServer technology helps Commerzbank tolerate disaster on September 11. Retrieved 11 3-1 from hp.com: http://h71000.www7.hp.com/openvms/brochures/commerzbank/commerzb ank.pdf?jumpid=reg_R1002_USEN Homeland Security. (2009 8). Information Technology Sector Baseline Risk Assesment. Retrieved 2010 17-5 from Homeland Security: http://www.dhs.gov/xlibrary/assets/nipp_it_baseline_risk_assessment.pdf Homeland Security. (2009 August). Information Technology Sector Baseline Risk Assessment. Retrieved 2010 17-5 from Homeland Security: http://www.dhs.gov/xlibrary/assets/nipp_it_baseline_risk_assessment.pdf Internet Security Alliance (ISA)/American National Standards Institute (ANSI). (2010). The Financial Management of Cyber Risk An Implementation Framework for CFOs. USA: Internet Security Alliance (ISA)/American National Standards Institute (ANSI). Jackson, C. (2011 2). California’s Mutual Aid System Provides Invaluable Support During San Bruno Disaster. Retrieved 2011 20-2 from Western City: http://www.westerncity.com/Western-City/February-2011/Californiarsquos-Mutual-Aid-System-Provides-Invaluable-Support-During-SanBruno-Disaster/ Jesdanun, A. (04 12-2). Software Bug Blamed For Blackout Alarm Failure. Retrieved 11 6-1 from CRN: http://www.crn.com/news/security/18840497/software-bug-blamed-forblackout-alarm-failure.htm?itc=refresh Joint Legislative Audit and Review Commission. (2009 2009-13). Review of Information Technology Services in Virginia. Retrieved 2010 05-11 from http://jlarc.state.va.us/: jlarc.state.va.us/meetings/October09/VITA.pdf Kantor, A. (2005 8-9). Technology succeeds, system fails in New Orleans. Retrieved 11 2-1 from USA Today: http://www.usatoday.com/tech/columnist/andrewkantor/2005-09-08katrina-tech_x.htm
96
Krane, N. K., Kahn, M. J., Markert, R. J., Whelton, P. K., Traber, P. G., & Taylor, I. L. (2007 8). Surviving Hurricane Katrina: Reconstructing the Educational Enterprise of Tulane University School of Medicine. Retrieved 10 17-12 from Academic Medicine: http://journals.lww.com/academicmedicine/Fulltext/2007/08000/Surviving_ Hurricane_Katrina__Reconstructing_the.4.aspx Kravitz, D. (2010 28-8). Statewide computer meltdown in Virginia disrupts DMV, other government business. Retrieved 2010 6-11 from The Washington Post : http://www.washingtonpost.com/wpdyn/content/article/2010/08/27/AR2010082705046.html Kravitz, D., & Kumar, A. (2010 31-8). Virginia DMV licensing services will be stalled until at least Wednesday. Retrieved 2010 6-11 from Washingtonpost.com: http://www.washingtonpost.com/wpdyn/content/article/2010/08/30/AR2010083004877.html Kumar, A., & Helderman, R. (2009 14-10). Outsourced $2 Billion Computer Upgrade Disrupts Va. Services. Retrieved 2010 6-11 from Washingtonpost.com : http://www.washingtonpost.com/wpdyn/content/article/2009/10/13/AR2009101303044.html Lawson, J. (05 9-12). A Look Back at a Disaster Plan: What Went Wrong and Right. Retrieved 10 28-12 from The Chronicle of Higher Education: http://chronicle.com/article/A-Look-Back-at-a-Disaster/10664 Lawson, J. (2005 9-12). Katrina and Tulane: a Timeline. Retrieved 12 2010-17 from The Chronicle of Higher Education : http://chronicle.com/article/KatrinaTulane-a-Timeline/21840 Lee. (1996). IT outsourcing contracts: practical issues for management. Industrial Management & Data Systems , 96 (1), 15 - 20. Lee, J., & Kim, Y. (1999). Effect of Partnership Quality on IS Outsourcing Sucess: Conceptual Freamwork anf Empirical Validation. Journal of Management Information Systems , 15 (4), 29-61. Lewis, B. (n.d.). Massive Computer Outage Halts Some Va. Agencies. Retrieved 2010 5-11 from HamptonRoads.com: http://hamptonroads.com/print/566771 Lord, M. (2008 11). WHEN DISASTER STRIKES Recovering from Katrina’s damage, two New Orleans engineering schools make emergency preparation a priority. Retrieved 10 28-12 from ASEE PRISM: http://www.prism-magazine.org/nov08/feature_03.cfm#top
97
Massachusetts Institute of Technology Information Security Office . (1995). MIT BUSINESS CONTINUITY PLAN. Retrieved 2010 17-5 from Information Services and Technology : http://web.mit.edu/security/www/pubplan.htm McIntyre, D. A. (2009 2-9). Gmail's outage raises new concern about the Net's vulnerability. Retrieved 2009 25-11 from Newsweek: http://www.newsweek.com/id/214760 McLennan, K. (2006). Selected Distance Education Disaster Planning Lessons Learned From Hurricane Katrina . Retrieved 10 28-12 from Online Journal of Disatnce Learning Administration: http://www.westga.edu/~distance/ojdla/winter94/mclennan94.htm Mears, J., Connor, D., & Martin, M. (02 2-9). What has changed. Retrieved 11 41 from Network World. Merschoff, E. (05 21-4). EA-05-071 - Davis-Besse (FirstEnergy Nuclear Operating Company). Retrieved 11 5-1 from USNRC: http://www.nrc.gov/reading-rm/doccollections/enforcement/actions/reactors/ea05071.html Michigan State University Disaster Recovery Planning . (n.d.). Planning Guide. Retrieved 2010 17-5 from Michigan State University Disaster Recovery Planning : http://www.drp.msu.edu/Documentation/StepbyStepGuide.htm Midwest ISO. (n.d.). About Us. Retrieved 2011 28-3 from Midwest ISO: http://www.midwestmarket.org/page/About%20Us Minkel, J. (08 13-8). The 2003 Northeast Blackout--Five Years Later. Retrieved 11 6-1 from Scientific American: http://www.scientificamerican.com/article.cfm?id=2003-blackout-fiveyears-later NASA. (2008 3). Powerless. Retrieved 2011 6-1 from Process Based Mission Assurance NASA Safety Center: http://pbma.nasa.gov/docs/public/pbma/images/msm/PowerShutdown_sfc s.pdf New York Independent System Operator. (2005 2). ISO. Retrieved 2010 17-3 from http://www.nyiso.com/public/webdocs/newsroom/press_releases/2005/bla ckout_rpt_final.pdf News Report. (2010 1-9). Northrop Grumman Vows to Find Cause of Virginia Server Meltdown as Fix Nears. Retrieved 2010 6-11 from Government Technology: http://www.govtech.com/policy-management/102482209.html
98
News Report. (2010 30-8). Work Continues on 'Unprecedented' Computer Outage in Virginia . Retrieved 2010 6-11 from Government Technology: http://www.govtech.com/security/102485974.html Nixon, S. (2010 13-11). VITA Briefing. Retrieved 2010 7-11 from www.vita.virginia.gov: http://www.vita.virginia.gov/uploadedFiles/091310_JLARC_Final.pdf Outsource IT Needs LLC. (n.d.). How Much Should You Spend on Disaster Recovery? Calculating the Value of Business Continuity. Retrieved 2011 7-2 from Outsource IT Needs, LLC: http://outsourceitneeds.com/DisasterRecovery.pdf Oversight and Investigations Subcommittee of the House Committee on Energy and Commerce. (2007 1-8). Testimony of M.L. Lagarde, III . Retrieved 2010 27-12 from Committee on Energy and Commerce: http://energycommerce.house.gov/images/stories/Documents/Hearings/P DF/110-oi-hrg.080107.Lagarde-Testimony.pdf Parris, K. (n.d.). Using OpenVMS Clusters for Disaster Tolerance. Retrieved 11 3-1 from hp.com: http://h71000.www7.hp.com/openvms/journal/v1/disastertol.pdf?jumpid=re g_R1002_USEN Parris, K. (2010). Who Survives Disasters and Why, Part 2: Organizations. Retrieved 11 3-1 from www2.openvms.org/kparris/: http://www2.openvms.org/kparris/Bootcamp_2010_Disasters_Part2_Orga nizations.pdf Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., et al. (2002). Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Computer Science Technical Report, Computer Science Division, University of California at Berkeley, Computer Science Department, Mills College and Stanford University; IBM Research, Berkeley. Petersen, R. (2009 9). Protecting Cyber Assets. Retrieved 2010 15-6 from EDUCAUSE Review: http://www.educause.edu/EDUCAUSE%2BReview/EDUCAUSEReviewMa gazineVolume44/ProtectingCyberAssets/178440
99
Scalet, S. D. (2002 1-9). IT Executives From Three Wall Street Companies Lehman Brothers, Merrill Lynch and American Express - Look Back on 9/11 and Take Stock of Where They Are Now. Retrieved 2009 9-11 from CIO: http://www.cio.com/article/31295/IT_Executives_From_Three_Wall_Street _Companies_Lehman_Brothers_Merrill_Lynch_and_American_Express_L ook_Back_on_9_11_and_Take_Stock_of_Where_They_Are_Now?page= 3&taxonomyId=1419 Scalet, S. D. (2002 01-09). IT Executives From Three Wall Street CompaniesLehman Brothers, Merrill Lynch and American Express-Look Back on 9/11 and Take Stock of Where They Are Now . Retrieved 2009 09-11 from CIO: http://www.cio.com/article/31295/IT_Executives_From_Three_Wall_Street _Companies_Lehman_Brothers_Merrill Lynch_and _American_Express_Look_Back_on_9_11_and _Take_Stock_of_Where_They _Are_Now?page=3&taxonomyId=1419 Schaffhauser, D. (2005 21-10). Disaster Recovery: The Time Is Now. Retrieved 2010 17-12 from Campus Technology: http://campustechnology.com/articles/2005/10/disaster-recovery-the-timeis-now.aspx Schapiro, J., & Bacque, P. (2010 28-08). Agencies' computers still being restored. Retrieved 2010 5-11 from Richmond Times-Dispatch: http://www2.timesdispatch.com/member-center/share-this/print/ar/476845/ Schapiro, J., & Bacque, P. (2010 3-9). Northrop Grumman regrets computer outage. From Richmond Times-Dispatch: http://www2.timesdispatch.com/news/state-news/2010/sep/03/vita03-ar485147/ Schapiro, J., & Bacque, P. (2010 2-9). Update: McDonnell lays out concerns to Northrop Grumman. Retrieved 2010 8-11 from Richmond Times-Dispatch: http://www2.timesdispatch.com/news/2010/sep/02/10/vita02-ar-483821/ Schellenger, D. (2010). Dealing with ther Personal Dimention of BC/DR. Disaster Recovery Journal , 23 (2). Scherr, I., & Bartz, D. (2010 3-2). U.S. unveils cybersecurity safeguard plan. Retrieved 2010 30-6 from Reuters: http://www.reuters.com/article/idUSTRE62135H20100302 Scherr, I., & Bartz, D. (2010 2-3). U.S. unveils cybersecurity safeguard plan. Retrieved 2010 13-4 from Reuters: http://www.reuters.com/article/idUSTRE62135H20100302
100
Schwartz, S., Li, W., Berenson, L., & Williams, R. (2002 11-9). Deaths in World Trade Center Terrorist Attacks --- New York City, 2001. Retrieved 11 13-1 from CDC: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm51spa6.htm Searle, N. (2007). Baylor College of Medicine's Support of Tulane University School of Medicine Following Hurricane Katrina. Retrieved 2010 17-12 from Academic Medicine: http://journals.lww.com/academicmedicine/Fulltext/2007/08000/Surviving_ Hurricane_Katrina__Reconstructing_the.4.aspx Slater, D. (2009 28-10). Business Continuity and Disaster Recovery Planning: The Basics. Retrieved 2009 9-11 from CSOonline.com: http://www.csoonline.com/article/204450/Business_Continuity_and_Disast er_Recovery_Planning_The_Basics Squires, P. (2010 2-9). Northrop Grumman to pay for cost of independent review. Retrieved 2010 8-11 from virginiabusiness.com: http://www.virginiabusiness.com/index.php/news/article/northropgrumman-to-pay-for-cost-of-independent-review/ Stewart, L. (2006 10-10). VITA Update to JLARC. Retrieved 2010 5-11 from www.vita.virginia.gov: jlarc.state.va.us/meetings/October06/VITA.pdf Swanson, A., Bowen, P., Wohl Phillips, A., Gallup, D., & Lynes, D. (2010 5). Contingency Planning Guide for Federal Information Systems. NIST Special Publication 800-34, Revision 1 . Gaithersburg, MD. Swanson, M., Wohl, A., Pope, L., Grance, T., Hash, J., & Thomas, R. (2002 June). Contingency Planning Guide for Information Technology Systems Recommendations of the National Institute of Standards and Technology NIST Special Publication 800-34. Retrieved 2010 27-5 from Computer Security Division Computer Resource Center National National Institute of Standards and Technology: http://csrc.nist.gov/publications/nistpubs/80034/sp800-34.pdf Testa, B. (2006 8). In Katrina’s Wake: Intensive Care for an Institution. Retrieved 2010 17-12 from Workforce Management: http://www.workforce.com/section/recruiting-staffing/archive/featurekatrinas-wake-intensive-care-institution/244929.html The Clinton Administration’s Policy on Critical Infrastructure Protection: Presidential Decision Directive 63. (1998 22-5). Retrieved 2010 02-05 from Computer Security Resource Center National Institute of Standards and Technology Federal Requirements: http://csrc.nist.gov/drivers/documents/paper598.pdf
101
The New York Times Company. (04 29-7). FirstEnergy settles suits related to blackout. Retrieved 11 13-1 from NYTimes.com: NYTimes.com The Virginia Information Technology Infrastructure Partnership. (n.d.). The Virginia Information Technology Infrastructure Partnership ANNUAL REPORT Improving Technology and Wiring Virginia for the 21st Century July 1, 2006, through June 30, 2007. Retrieved 2010 6-11 from www.vita.virginia.gov: http://www.vita.virginia.gov/uploadedFiles/IT_Partnership/ITP2007Annual Report.pdf Thibodeau, P., & Mearian, L. (2005 9-12). After Katrina, users start to weigh long-term IT issues. Retrieved 12 2010-15 from Computerworld: http://www.computerworld.com/s/article/104542/After_Katrina_users_start _to_weigh_long_term_IT_issues Tulane University. (n.d.). About Tulane. Retrieved 10 29-12 from Tulane University: http://tulane.edu/about/ Tulane University. (2009 2009-2). Ellen DeGeneres to Headline 'Katrina Class' Commencement. Retrieved 1 2010-2 from Tulane Admission: http://admission.tulane.edu/livecontent/news/34-ellen-degeneres-toheadline-katrina-class.html Tulane University. (09 3). Tulane University Computer Incident Response Plan Part of Technology Services Disaster Recovery Plan. Retrieved 2011 20-2 from Information Security @ Tulane: http://security.tulane.edu/TulaneComputerIncidentResponsePlan.pdf U.S. Department of Transportation. (n.d.). iFlorida Model Deployment Final Evaluation Report. Retrieved 2009 24-10 from http://ntl.bts.gov/lib/31000/31000/31051/14480.htm U.S.-Canada Power System Outage Task Force. (2004 April). Final Report on the August 14, 2003 Blackout in the United State and Canada: Causes and Recommendations. From https://reports.energy.gov Virginia Community College. (1998 25-3). Virginia Community College Utility Data Center Contingency Management/Disaster Recovery Plan. Retrieved 2009 9-11 from Virginia Community College: http://helpnet.vccs.edu/NOC/Mainframe/drplan.htm VITA. (n.d.). Information Technology Infrastructure Library (ITIL). Retrieved 2010 8-11 from www.vita.virginia.gov: http://www.vita.virginia.gov/library/default.aspx?id=545
102
VITA. (n.d.). Information Technology Investment Board (ITIB). Retrieved 2010 63 from www.vita.virginia.gov: http://www.vita.virginia.gov/ITIB/ VITA. (2010 1-11). Network News. Retrieved 11 13-1 from Vita: http://www.vita.virginia.gov/communications/publications/networknews/def ault.aspx?id=12906 VITA. (2007 1-7). Network News Volume 2, Number 7 From the CIO. Retrieved 2010 6-11 from www.vita.virginia.gov: http://www.vita.virginia.gov/communications/publications/networknews/def ault.aspx?id=3594 VITA. (2010 1-6). Network News Volume 5, Number 6 . Retrieved 2010 27-11 from www.vita.virginia.gov: http://www.vita.virginia.gov/communications/publications/networknews/def ault.aspx?id=12080 Wikan, D. (2010 13-9). Northrop Grumman to pay for computer outage investigation. Retrieved 2010 7-11 from www.wvec.com: http://www.wvec.com/news/local/Northrop-Grumman-to-pay-for-computeroutage-investigation-102796459.html
APPENDICES
103
Appendix A. Recommended Resources NASCIO IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster http://www.nascio.org/publications/documents/NASCIO-DRToolKit.pdf This is an easy to follow workbook style14-page document covering before, during, and after best practices. Carnegie Melon Computer Emergency Response Team Resilience Management Model http://www.sei.cmu.edu/library/abstracts/reports/10tr012.cfm This detailed 259 page document covers resiliency management from a cross disciplinary perspective. Includes best practices, CMMI based generic goals and objectives to guide the process of planning and implementing operational resiliency. FEMA's emergency management institute http://www.training.fema.gov/IS/ Free online courses that provide testing and certificates of subject proficiency. Covers a variety of topics such as emergency management, workplace violence, and preparedness.
104
Appendix B. NASCIO IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
NASCIO: Representing Chief Information Officers of the States
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster Without the flow of electronic information, government comes to a standstill. When a state’s data systems and communication networks are damaged and its processes disrupted, the problem can be serious and the impact far-reaching. The consequences can be much more than an inconvenience. Serious disruptions to a state’s IT systems may lead to public distrust, chaos and fear. It can mean a loss of vital digital records and legal documents. A loss of productivity and accountability. And a loss too of revenue and commerce. Disasters that shut down a state’s mission critical applications for any length of time could have devastating direct and indirect costs to the state and its economy that make considering a disaster recovery and business continuity plan essential. State Chief Information Officers (CIOs) have an obligation to ensure that state IT services continue in the state of an emergency. The good news is that there are simple steps that CIOs can follow to prepare for Before, During and After an IT crisis strikes. Is your state ready?
Disaster Recovery Planning 101 Disaster recovery and business continuity planning provides a framework of interim measures to recover IT services following an emergency or system disruption. Interim measures may include the relocation of IT systems and operations to an alternate site, the recovery of IT functions using alternate equipment, or execution of agreements with an outsourced entity.
NASCIO Staff Contact: Drew Leatherby, Issues Coordinator
[email protected]
term power outages to more-severe disruptions involving equipment destruction from a variety of sources such as natural disasters or terrorist actions. While many vulnerabilities may be minimized or eliminated through technical, management, or operational solutions as part of the state’s overall risk management effort, it is virtually impossible to completely eliminate all risks. In many cases, critical resources may reside outside the organization’s control (such as electric power or telecommunications), and the organization may be unable to ensure their availability. Thus effective disaster recovery planning, execution, and testing are essential to mitigate the risk of system and service unavailability. Accordingly, in order for disaster recovery planning to be successful, the state CIO’s office must ensure the following: 1. Critical staff must understand the IT disaster recovery and business continuity planning process and its place within the overall Continuity of Operations Plan and Business Continuity Plan process. 2. Develop or re-examine disaster recovery policy and planning processes including preliminary planning, business impact analysis, alternate site selection, and recovery strategies. 3. Develop or re-examine IT disaster recovery planning policies and plans with emphasis on maintenance, training, and exercising the contingency plan.
IT systems are vulnerable to a variety of disruptions, ranging from minor short-
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
NASCIO represents state chief information officers and information technology executives and managers from state governments across the United States. For more information visit www.nascio.org. Copyright © 2007 NASCIO All rights reserved 201 East Main Street, Suite 1405 Lexington, KY 40507 Phone: (859) 514-9153 Fax: (859) 514-9166 Email:
[email protected]
1
106
NASCIO: Representing Chief Information Officers of the States
Before the Crisis (1) Strategic and Business Planning Responsibilities (Building relationships; What is the CIO’s role on an ongoing basis? Role of enterprise policies?) ! CIOs need a Disaster Recovery and Business Continuity (DRBC) plan including: (1) Focus on capabilities that are needed in any crisis situation; (2) Identifying functional requirements; (3) Planning based on the degrees of a crisis from minor disruption of services to extreme catastrophic incidents; (4) Establish service level requirements for business continuity; (5) Revise and update the plan; have critical partners review the plan; and (6) Have hard and digital copies of the plan stored in several locations for security.
! CIOs should conduct strategic assessments and inventory of physical assets, e.g. computing and telecom resources, identify alternate sites and computing facilities. Also conduct strategic assessments of essential employees to determine the staff that would be called upon in the event of a disaster and be sure to include pertinent contact information. Notes:
Notes:
! CIOs should ask and answer the following questions: (1) What are the top business functions and essential services the state enterprise can not function without? Tier business functions and essential services into recovery categories based on level of importance and allowable downtime. (2) How can the operation’s facilities, vital records, equipment, and other critical assets be protected? (3) How can disruption to an agency’s or department’s operations be reduced? Notes:
! CIOs should create a business resumption strategy: Such strategies lay out the interim procedures to follow in a disaster until normal business operations can be resumed. Plans should be organized by procedures to follow during the first 12, 24, and 48 hours of a disruption. (Utilize technologies such as GIS for plotting available assets, outages, etc.)
! CIOs should conduct contingency planning in case of lost personnel: This could involve crosstraining of essential personnel that can be lent out to other agencies in case of loss of service or disaster; also, mutual aid agreements with other public/ private entities such as state universities for “skilled volunteers.” (Make sure contractors and volunteers have approved access to facilities during a crisis). Notes:
! Build cross-boundary relationships with emergency agencies: CIOs should introduce themselves and build relationships with state-wide, agency and local emergency management personnel – you don’t want the day of the disaster to be the first time you meet your emergency management counterparts. Communicate before the crisis. Also consider forging multi-state relationships with your CIO counterparts to prepare for multi-state incidents. Consider developing a cross-boundary DR/BC plan or strategy, as many agencies and jurisdictions have their own plans. Notes:
Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
3
105
NASCIO: Representing Chief Information Officers of the States
How to Use the Tool-kit This tool-kit represents an updated and expanded version of business continuity and disaster preparedness checklists utilized for a brainstorming exercise at the “CIO-CLC Business Continuity/ Disaster Recovery Forum” at NASCIO’s 2006 Midyear Conference. This expanded tool-kit evolved from the work of NASCIO’s Disaster Recovery Working Group, www.NASCIO.org/Committees/ DisasterRecovery. Along with NASCIO’s DVD on disaster recovery, “Government at Risk: Protecting Your IT Infrastructure.” (View video or place order at: www.NASCIO.org/Committees/ DisasterRecovery/DRVideo.cfm), these checklists and accompanying group brainstorming worksheets will serve as a resource for state CIOs and other state leaders to not only better position themselves to cope with an IT crisis, but also to help make the business case for disaster recovery and business continuity activities in their states.
IT Disaster Recovery and Business Continuity Checklists Before the Crisis (1) Strategic and Business Planning Responsibilities (Building relationships; What is the CIO’s role on an ongoing basis? Role of enterprise policies?) (2) Top Steps States Need to Take to Solidify Public/ Private Partnerships Ahead of Crises (Predisaster agreements with the private sector and other organizations.) (3) How do you Make the Business Case on the Need for Redundancy? (Especially to the state legislature, the state executive branch and budget officials.)
2
The tool-kit is comprised of six checklists in three categories that address specific contingency planning recommendations to follow Before, During and After a disruption or crisis situation occurs. The Planning Phase, Before the disaster, describes the process of preparing plans and procedures and testing those plans to prepare for a possible network failure. The Execution Phase, During the disaster, describes a coordinated strategy involving system reconstitution and outlines actions that can be taken to return the IT environment to normal operating conditions. The Final Phase, After the disaster, describes the transitions and gap analysis that takes place after the disaster has been mitigated. The tool-kit also provides an accompanying group activity worksheet, “Thinking Sideways,” to assist in disaster recovery planning sessions with critical staff.
(4) General IT Infrastructure and Services (Types of redundancy; protecting systems.) During the Crisis (5) Tactical Role of CIOs for Recovery During a Disaster (Working with state and local agencies and first responders; critical staff assignments; tactical use of technology, e.g. GIS.) After the Crisis (6) Tactical Role of CIOs for Recovery After a Disaster Occurs (Working with state and local agencies, and critical staff to resume day-to-day operations, and perform gap analysis of the plan’s effectiveness.)
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
107
NASCIO: Representing Chief Information Officers of the States
! Intergovernmental communications and coordination plan: Develop a plan to communicate and coordinate efforts with state, local and federal government officials. Systems critical for other state, local and federal programs and services may need to be temporarily shut down during an event to safeguard the state’s IT enterprise. Local jurisdictions are the point-of-service for many state transactions, including benefits distribution and child support payments, and alternate channels of service delivery may need to be identified and temporarily established. Make sure jurisdictional authority is clearly established and articulated to avoid internal conflicts during a crisis. Notes:
! Establish a crisis communications protocol: A crisis communications protocol should be part of a state’s IT DR/BC plan; Designate a primary media spokesperson with additional single point-of-contact communications officers as back-ups. Articulate who can speak to whom under different conditions, as well as who should not speak with the press. In a time of crisis, go public immediately, but only with what you know; provide updates frequently and regularly.
! Testing: CIOs should conduct periodic training exercises and drills to test DR/BC plans. These drills should be pre-scheduled and conducted on a regular basis and should include both desk-top and field exercises. Conduct a gap analysis following each exercise. Notes:
! A CIO’s approach to a DR/BC plan will be unique to his or her financial and organizational situation and the availability of trained personnel. This still leaves the question as to who writes the plans. If a CIO chooses from one of the many consultants that provide Continuity of Operations planning, he or she should make sure that staff maintains a close degree of involvement and, when completed, that the consultant(s) provide general awareness training of the plan. If CIOs choose to conduct planning in-house, have an experienced and certified business continuity planner review it for any potential gaps or inconsistencies. Notes:
Notes:
! Communicate to rank and file employees that there is a plan, the why and how of the plan, and their roles during a potential disruption of service or disaster. Identify members of a possible crisis management team. Have in place their roles, actions to be taken, and possible scenarios. Have a list of their office, home, and cell or mobile phone numbers. Notes:
4
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
108
NASCIO: Representing Chief Information Officers of the States
(2) Top Steps States Need to Take to Solidify Public/ Private Partnerships Ahead of Crises (Pre-disaster agreements with the private sector and other organizations.)
! Utilize preexisting business partnerships: Keep the dialogue open with state business partners; periodically call them all in for briefings on the state’s disaster recovery and business continuity (DR/BC) plans. Notes:
! Set up “Emergency Standby Services and Hardware Contracts:” Have contracts in place for products and services that may be needed in the event of a declared emergency. Develop a contract template so a contract can be developed with one to two hours work time.
! Be sure essential IT procurement staff are part of the DR/BC plan and are aware of their roles in executing pre-positioned contracts in the event of a disaster; also be sure to include pertinent contact information. Notes:
! CIOs should develop “Emergency Purchasing Guidelines” for agencies and have emergency response legislation in place. Notes:
Notes:
! Outsourced back-up sites may be time limited; therefore back-up, back-up outsourcing may be necessary for continuity leap-frog.
! Think outside the box: CIOs can partner with anyone, e.g. universities, local government, lottery corporations, local companies and leased facilities with redundant capabilities. Notes:
Notes:
! Place advertisements in the state’s “Contract Reporter” every quarter; continuous recruitment is a good business practice. Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
5
109
NASCIO: Representing Chief Information Officers of the States
(3) How do you Make the Business Case on the Need for Redundancy? (Especially to the state legislature, the state executive branch and budget officials.)
Risk assessment of types of disasters that could lead to the need for business continuity planning: !
!
!
!
!
Geological hazards – Earthquakes, Tsunamis, Volcanic eruptions, Landslides/ mudslides/ subsidence; Meteorological hazards – Floods/ flash floods, tidal surges, Drought, Fires (forest, range, urban), Snow, ice, hail, sleet, avalanche, Windstorm, tropical cyclone, hurricane, tornado, dust/sand storms, Extreme temperatures (heat, cold), Lightning strikes; Biological hazards – Diseases that impact humans and animals (plague, smallpox, Anthrax, West Nile Virus, Bird flu); Human-caused events – Accidental: Hazardous material (chemical, radiological, biological) spill or release; Explosion/ fire; Transportation accident; Building/structure collapse; Energy/power/utility failure; Fuel/resource shortage; Air/water pollution, contamination; Water control structure/dam/levee failure; Financial issues: economic depression, inflation, financial system collapse; Communications systems interruptions; Intentional – Terrorism (conventional, chemical, radiological, biological, cyber); Sabotage; Civil disturbance, public unrest, mass hysteria, riot; Enemy attack, war; Insurrection; Strike; Misinformation; Crime; Arson; Electromagnetic pulse.
" Education and awareness: Craft an education and awareness program for IT staff, lawmakers and budget officials to ensure all parties are on the same page with regards to your DR/BC plan and the need for such a plan. Prepare key talking points that outline the rationale for DR/BC planning. Utilize outside resources such as this tool-kit and NASCIO’s DVD on disaster recovery, “Government at Risk: Protecting Your IT Infrastructure,” to help make the business case for disaster recovery and business continuity activities in your state. Notes:
" For federally declared states of emergency the financial aspect has been somewhat lessened by the potential of acquiring funding grants from state or federal organizations such as FEMA. Additional funding for state cybersecurity preparedness efforts is available to states through the U.S. Department of Homeland Security’s State Homeland Security Grants Program. Notes:
" Establish metrics for costs of not having redundancy: How much will it cost the state if certain critical business functions go down – e.g. ERP issues on the payment side; citizen service issues (what it would do to the DMV for license renewals); impacts on eligibility verifications for social services, etc. How long can you afford to be down? How much is this costing you? How long can you be without a core business function? Notes:
" Up-front savings: States obtain greater leverage for fair pricing and priority service in the event of an emergency before the emergency occurs, rather than after the emergency has occurred. Notes:
" Consider channels of delivery: Child support payments channeled through a broker agency. Notes:
6
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
110
NASCIO: Representing Chief Information Officers of the States
! Consider cycles of delivery: The most important periods of delivery, e.g. the last week or couple of days of the month may be the most critical back-up period. Notes:
! Realize that as the adoption rate for electronic business processes and online services grows, employees with knowledge of business rules and paper processes will retire and will no longer be around for manual backup. Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
7
111
NASCIO: Representing Chief Information Officers of the States
(4) General IT Infrastructure and Services (Types of redundancy; protecting systems.)
! CIOs need to ensure that information is regularly backed up. Agencies need to store their back-up data securely off site in a location that is accessible but not too near the facility in question. Such locations should be equipped with hardware, software and agency data, ready for use in an emergency. (Restore functions should be tested on a regular basis.) These “hot sites” can be owned and operated by an agency or outsourced.
Notes:
! Mobile communication centers can be utilized in the event that traditional telecommunications systems are down. Notes:
Notes:
! Protect current systems: Controlled access; uninterruptible power supply (UPS); back-up generators with standby contracts for diesel fuel (use priority and back-up fuel suppliers that also have back-up generators to operate their pumps in the event of a widely spread power outage).
! Self-healing primary point of presence facilities that automatically restore service. Notes:
Notes: ! Approach enterprise backup as a shared service: Other agencies may have the capability for excess redundancy. ! Strategic location: Locate critical facilities away from sites that are vulnerable to natural and manmade disasters.
Notes:
Notes: ! Provide secure remote access to state IT systems for essential employees (access may be tiered based on critical need.) ! Interactive voice response (IVR) systems that are accessing back-end databases: (There may be no operators for backup that can connect patrons to services.) Seek diversity of inbound communications. Notes:
! Self-healing communications systems that automatically re-route communications or use alternate media.
8
Notes:
! Hot Sites: A disaster recovery facility that mirrors an agency’s applications databases in real-time. Operational recovery is provided within minutes of a disaster. These can be provided at remote locations or outsourced to one or multiple contractors. Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
112
NASCIO: Representing Chief Information Officers of the States
During the Crisis (5) Tactical Role of CIOs for Recovery During a Disaster (Working with state and local agencies and first responders; critical staff assignments; tactical use of technology, e.g. GIS.) ! Decision making: Prepare yourself for making decisions in an environment of uncertainty. During a crisis you may not have all the information necessary, however, you will be required to make immediate decisions. Notes:
! Execute DR/BC Plan: Retrieve copies of the plan from secure locations. Begin systematic execution of plan provisions, including procedures to follow during the first 12, 24, and 48 hours of the disruption.
! Implement your emergency employee communications plan: Inform your internal audiences – IT staff and other government offices – at the same time you inform the press. Prepare announcements to employees to transition them to alternate sites or implement telecommuting or other emergency procedures. Employees can maintain communication with the central IT office utilizing Phone exchange cards, provided to employees with two numbers: (1) First number employees use to call in and leave their contact information; (2) Second number is where the employees call in every morning for a standing all employee conference call for updates on the emergency situation. Notes:
Notes:
! Shutdown non-essential services to free up resources for other critical services. Identify critical business applications and essential services and tier them into recovery categories based on level of importance and allowable downtime, e.g. tier III applications are shut down first. Be sure to classify critical services for internal customers vs. external customers. Notes:
! Intergovernmental communications and coordination plan: Communicate and coordinate efforts with state, local and federal government officials. Systems critical for other state, local and federal programs and services may need to be temporarily shut down during an event to safeguard the state’s IT enterprise. Local jurisdictions are the point-of-contact for many state transactions, including vehicle and voter registration, and alternate channels of service delivery may need to be identified and temporarily established. Make sure jurisdictional authority is clearly established and articulated to avoid internal conflicts during a crisis. Notes:
! Communicate, communicate, communicate: Engage your primary media spokesperson immediately and have additional communications officers on stand-by if needed. Immediately get the word to the press; let the media – and therefore the public – know that you are dealing with the situation. Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
9
113
NASCIO: Representing Chief Information Officers of the States
! Back-up communications: In the event wireless, radio and Internet communications are inaccessible, Government Emergency Telecommunications Service (GETS) cards can be utilized for emergency wireline communications. GETS is a Federal program that prioritizes calls over wireline networks and utilizes both the universal GETS access number and a Personal Identification Number (PIN) for priority access.
! Leverage technology/ Think outside the box: In a disaster situation the state’s GIS systems can be utilized to monitor power outages and system availability. For emergency communications, the “State Portal” can be converted to an emergency management portal. Also, Web 2.0 technologies such as Weblogs, Wikis and RSS feeds can be utilized for emergency communications. Notes:
Notes:
! CIO’s must be effectively engaged with the On Scene Coordinator (OSC), and the Incident Command System (ICS) – the federal framework for managing disaster response that outlines common processes, roles, functions, terms, responsibilities, etc. ICS supports the FEMA National Incident Management System (NIMS) approach; states must understand both NIMS and the ICS.
! Execute “Emergency Standby Services and Hardware Contracts:” If necessary, execute preplaced contracts for products and services needed during the crisis. The Governor may also have to temporarily suspend some of the state’s procurement laws and execute “Emergency Purchasing Guidelines” for agencies. Notes:
Notes:
10
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
114
NASCIO: Representing Chief Information Officers of the States
After the Crisis (6) Tactical Role of CIOs for Recovery After a Disaster Occurs (Working with state and local agencies, and critical staff to resume day-to-day operations, and perform gap analysis of the plan’s effectiveness.) ! Preliminary damage and loss assessment: Conduct a post-event inventory and assess the loss of physical and non-physical assets. Include both tangible losses (e.g. a building or infrastructure) and intangible losses (e.g. financial and economic losses due to service disruption). Be sure to include a damage and loss assessment of hard copy and digital records. Prepare a tiered strategy for recovery of lost assets. Notes:
! Employee transition: Once agencies have recovered their data, CIOs need to find interim space for displaced employees, either at the hot site or another location. Coordinate announcements to employees to transition them to an alternate site or implement telecommuting procedures until normal operation are reestablished.
! Contractual performance: Review the performance of strategic contracts and modify contract agreements as necessary. Notes:
! Lessons learned: Evaluate the effectiveness of the DR/BC plan and how people responded. Examine all aspects of the recovery effort and conduct a gap analysis to identify deficiencies in the plan execution. Update the plan based on the analysis. What went right (duplicate); what went wrong (tag and avoid in the future). Correct problems so they don’t happen again. Notes:
Notes:
! Budgetary concerns: Following a disaster and resumption of IT services, there may be a need for emergency capital expenditures to aid in the recovery process. Be prepared to work with the state budget officer and/ or the state’s legislative budget committees. Notes:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
11
115
NASCIO: Representing Chief Information Officers of the States
Appendix 1. Thinking Sideways Instructions: Use this worksheet in conjunction with each checklist as a group brainstorming tool.
A. Conduct a gap analysis on Checklist ___. Focus on what’s missing and include key policy issues unique to state governments, best practices and innovative ideas that can be shared across jurisdictions: C. How can CIOs use this information to secure funding and other resources for business continuity?
B. Describe how states and the private sector can work together to tackle these issues, through the transference of knowledge and experience?
12
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
116
NASCIO: Representing Chief Information Officers of the States
Appendix 2. Additional Resources
State Government Resources
Federal Government Resources
Pennsylvania’s Pandemic Preparation Website: Also see Government Technology’s article regarding Pennsylvania’s new Website:
The Federal Emergency Management Agency’s (FEMA’s) National Incident Management System (NIMS) – NIMS was developed so responders from different jurisdictions and disciplines can work together better to respond to natural disasters and emergencies, including acts of terrorism. NIMS’ benefits include a unified approach to incident management; standard command and management structures; and emphasis on preparedness, mutual aid and resource management: FEMA’s Emergency Management Institute – A federal resource for emergency management education and training. GAO Report, Information Sharing: DHS Should Take Steps to Encourage More Widespread Use of Its Program to Protect and Share Critical Infrastructure Information. GAO-06-383, April 17, 2006: GAO Report, Continuity of Operations: Agency Plans Have Improved, but Better Oversight Could Assist Agencies in Preparing for Emergencies. GAO-05-577, April 28, 2005: U.S. Department of Homeland Security (DHS), Safe America Foundation – National Institute of Standards and Technology (NIST) – Special Publication 800-34, Contingency Planning Guide for Information Technology: Recommendations of the National Institute of Standards and Technology:
New York State’s, Office of General Services (OGS) emergency contracts prepared through the new National Association of State Procurement Officials (NASPO) Cooperative Purchasing Hazardous Incident Response Equipment (HIRE) program, are available at: New York is the lead state for this multi-state cooperative. Washington State, Department of Information Technology, Tech News, Enterprise Business Continuity: Making Sure Agencies are Prepared, December 2005: National Organization, Academia and Consortium Resources Business Continuity Institute (BCI) – BCI was established in 1994 to enable members to obtain guidance and support from fellow business continuity practitioners. The BCI has over 2600 members in 50+ countries. The wider role of the BCI is to promote the highest standards of professional competence and commercial ethics in the provision and maintenance of business continuity planning and services: Disaster Recovery Institute (DRI) – DRI International (DRII) was first formed in 1988 as the Disaster Recovery Institute in St. Louis, MO. A group of professionals from the industry and from Washington University in St. Louis forecast the need for comprehensive education in business
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
13
117
NASCIO: Representing Chief Information Officers of the States
continuity. DRII established its goals to: Promote a base of common knowledge for the business continuity planning/ disaster recovery industry through education, assistance, and publication of the standard resource base; Certify qualified individuals in the discipline; and Promote the credibility and professionalism of certified individuals: The National Association of State Procurement Officials (NASPO) has completed work on disaster recovery as it relates to procurement: U.S. Computer Emergency Readiness Team (U.S. CERT)/ Coordination Center – Survivable Systems Analysis Method: The Council of State Archivists (CoSA) – CoSA is a national organization comprising the individuals who serve as directors of the principal archival agencies in each state and territorial government. CoSA’s Framework for Emergency Preparedness in State Archives and Records Management Programs is available at: AFTER THE DISASTER Hurricane Katrina not only impacted more than 90,000 square miles and almost 10 million residents of the Gulf Coast but also affected how governments will manage such disasters in the future. A collection of articles opens the dialogue about disaster response in a new book, “On Risk and
14
Disaster: Lessons from Hurricane Katrina.” The book, edited by Ronald J. Daniels, Donald F. Kettl (a Governing contributor) and Howard Kunreuther, warns of the inevitability of another disaster and the need to be prepared to act. It addresses the public and private roles in assessing, managing and dealing with disasters and suggests strategies for moving ahead in rebuilding the Gulf Coast. To see a table of contents and sample text, visit Published by the University of Pennsylvania Press, the book sells for $27.50. Articles and Reports “Cleaning Up After Katrina,” CIO Magazine, March 15, 2006: Continuity of Operations Planning: Survival for Government, Continuity Central: Disaster and Recovery, GovExec.com: “Disaster Recovery, How to protect your technology in the event of a disaster,” Bob Xavier, November 27, 2001:
IT Disaster Recovery and Business Continuity Tool-kit: Planning for the Next Disaster
VITA
118
VITA EDUCATION Candidate for M.S. in Computer Information Technology at Purdue University, May 2011 G.P.A. 3.8/4.00 Honors B.A. in Communication, Public Relations at Purdue University, December 1998 G.P.A. 3.57/4.00 PUBLICATIONS “Disaster Recovery and Business Continuity Planning: Business Justification,” H. M. Brotherton, Journal of Emergency Management, 67-60, 2010. DOI:10.5055/jem.2010.0019 http://pnpcsw.pnpco.com/cadmus/testvol.asp?year=2010&journal=jem EMPLOYMENT ITaP Web and Applications Administration, Graduate Assistant May 2009-Present • Met with customers to define project requirements and create an articulate design rational that best meets requirements • Project planning and management • Developed and updated documentation of administration policies and procedures • Granted and implemented development and deploy access to web developers • Migrated and created websites in our Apache, IIS, and ColdFusion environments • Customized application portals using, XML, JavaScript, HTML, CSS • Created Tivoli Storage Manager nodes • Assisted with SharePoint training development • Built Tomcat web server • Research BMC Remedy BI development and implementation • Design, development, and implementation of the Applications Administration website and forms using XHTML, JavaScript, CSS and PHP
119
Social Security Administration, Social Insurance Specialist April 1999- January 2009 • Served as Site LAN Coordinator for my office. Duties included: verifying systems updates, reporting systems problems, changing daily backup tapes, resolving systems issues by making necessary changes on site. • Prepared and performed presentations to special interest groups • Employed creativity and problem solving to deal effectively with situations of competing or conflicting priorities • Exercised professionalism and discretion in handling confidential information • Analyzed, interpreted, and implemented policy and balanced tasks in a fast paced work environment • Learned new policies, tools and technology on a daily basis to keep up with constantly changing workloads • Assumed responsibility for maintaining quality standards in processing claims • Worked both independently and as a team member to meet our office goals Before & Afterthoughts, Owner-Manager November 2001-April 2003 Formed S-corporation, performed all bookkeeping, paid and managed employees. Underhill Games, Co-owner/Board Member May 2001-April 2002 Researched business models, informed the board members about the various corporation types and gave recommendation to form S-corporation, attended trade conventions as company representative, created and edited our online store, setup payment and shipping for business merchandise. Purdue University West Lafayette, Residence Hall Counselor August 1997-May 1998 Planned and implemented programs and activities, enforced University policies, resolved disputes between residents, and served as contact for University resource referrals. REFERENCES Dr. J. Eric Dietz, Computer and Information Technology, Purdue University, Purdue Homeland Security Institute (PHSI), Gerald D. and Edna E. Mann Hall, Room 166, 203 S. Martin Jischke Drive, West Lafayette, IN 47907-1971 Jeffrey Sprankle, Computer and Information Technology, Purdue University, Knoy Hall, Room 221, 401 N. Grant Street, West Lafayette, IN 47907
PUBLICATION
120
121
122
123