Integrating the resiliency needs of business and IT functions - IBM

0 downloads 84 Views 428KB Size Report
Around-the-clock business operations place greater resiliency demands on the .... distribution, or online access to prod
IBM Global Technology Services

Integrating the resiliency needs of business and IT functions A holistic approach to recovering from an adverse event

July 2011

2

Integrating the resiliency needs of business and IT functions

Contents 2 Executive summary 2 IT risk strategy considerations

readiness or system failover from continuous processing. A new, combined business and IT functionality has emerged, which has made it necessary to address the protection of these functions with strategies and techniques designed and integrated into a singular, seamless approach.

3 Availability and recovery objectives 3 Functional components associated with information protection 4 Business priorities 6 Challenges of a combined local availability and remote recovery design 7 A holistic design approach 7 Key performance indicators to measure success 8 Summary

Executive summary Regardless of size or industry, organizations depend upon the nearly continuous flow and processing of information. Driven by business requirements, they increasingly need to provide faster and broader access 24×7. As a result, determining how to protect information and manage the risk associated with information technology (IT) is a critical focus of both business units and IT organizations across the globe. Around-the-clock business operations place greater resiliency demands on the underlying IT functions that support critical business processing. Therefore, it is no longer feasible to separate availability from recovery, business continuity from disaster

This white paper details the benefits of implementing an enhanced resiliency approach that combines local operational availability with remote disaster recovery. Such an approach can help to provide better IT risk management and a more holistic way of managing critical information processing across the business continuum. This paper will also explore how effective IT risk management can help ensure that information is available for processing on a continual basis by preventing—and at times even predicting—an adverse action that might impact an organization’s ability to transact its core functions and services.

IT risk strategy considerations Developing an IT risk strategy that helps ensure continuous data access requires attention to a number of challenging considerations. Traditionally, organizations have handled some of these considerations—such as redundant operations, high-availability infrastructures, continuous operations, failover capabilities and disaster recovery—somewhat separately from an implementation standpoint. Although each of these considerations has a different way of addressing how often data is accessed, how it is protected and how frequently it is backed up to ensure recovery, they can complement and overlap each other to deliver uninterruptible service. Together, they provide a full informationaccess and data-protection strategy that can help ensure instantaneous response, near-zero down time and limited critical-information loss.

IBM Global Technology Services

Some of the key questions to consider when developing an IT risk strategy include:

Recovery point objective

3

Recovery time objective

Time ●







What are the critical risk factors and how do they relate to overall risk tolerance? What are the boundaries of availability regarding component failures, system failures, application failures and site failure? How does recovery come into play for site failure or complete destruction? For loss of data? Do the requirements reflect the perspective of the end users, operations or business processes?

Availability and recovery objectives The common metric used to determine availability involves service level agreements (SLAs), which identify varying levels of availability, serviceability, performance, operation and other attributes that pertain to a system’s ability to deliver uninterruptible service. This level of service is often specified in terms of “targets” that identify the optimal levels of availability that must be delivered to the end users of the service, thus indicating the amount of system “up time” that is required to drive business results. Likewise, if a catastrophic event occurs, the amount of time required to resume business operations is defined in terms of recovery objectives that consider the impact to the business from both an outage and data loss perspective. The two most common recovery objective indicators are the recovery time objective (RTO) and the recovery point objective (RPO).

Last backup

Event

Data restored

Figure 1: Objectives for recovering after an adverse event

The RTO measures the time required to restore critical operations back to operational processing after an adverse event. It is most often defined in minutes or hours and is based upon the financial impact, financial loss, penalties or fines to the business. The RPO is the point in time, prior to the outage, to which systems and data must be restored. It is commonly measured in terms of information loss based upon business criticality. SLAs, RTOs and RPOs accompany a detailed design of the resiliency implementation. The design specifies the associated cost, management and ongoing support required to deliver the functional components that comprise the enhanced resiliency effort. Establishing a baseline for these components regarding how information is most effectively managed and protected is critical to understanding how their combined synergies can create a more effective, efficient approach to managing information flows across the enterprise.

Functional components associated with information protection The following table shows the various functional components associated with information protection. The first column shows the business priority of the information being addressed, followed by availability and recovery definitions, specific operating requirements and the most common delivery approaches.

4

Integrating the resiliency needs of business and IT functions

Business priority

Availability

Operating requirements

Delivery approach

SLA - 99.999%

Continuous availability

Full redundancy

RTO - minutes RPO - near zero

Immediate failover

Active/active

SLA - 99.99%

High availability

Duplicate systems

RTO - < 8 hours RPO - minutes

Rapid recovery

Dedicated recovery

SLA - 99.9%

Standby operations

Idle/spare capacity

RTO - < 24 hours RPO - minutes

System/data copies

Hybrid recovery

System restoration

Recovery operations

Recovery objective

Mission critical

Key operations

Key support

SLA - 99.5%

Noncritical

RTO - > 72 hours RPO - hours

System/data backup

Rebuild through acquisition

RTO: Recovery time objective RPO: Recovery point objective

Figure 2: Functional components of an enhanced resiliency approach

Business priorities #1: Mission-critical functions

Mission-critical functions are the highest priority workloads of a business. They include functions in which an outage could jeopardize a person’s life or cause extreme financial loss. Accessibility to the systems, applications and data that support these functions is paramount to processing critical business transactions. Should the functions be impaired, the business may be subject to legal action or financial penalties.

The extreme importance of these functions mandates that they be continuously available, with fully redundant capability in place to minimize any interruption that might inhibit 24×7 access and availability. Delivering this capability requires that the underlying IT—defined as systems, applications, data, networks and all associated infrastructures—be duplicated and readily available for automated takeover of the primary function should an incident occur.

IBM Global Technology Services

IT duplication of mission-critical functions is essential because an outage or interruption of these functions could have a devastating impact on the organization. Recovery requirements must be designed to provide immediate access to live systems and accurate information. This is best accomplished using a fully redundant approach, where dedicated capacity and replicated data with complete network access is in place to ensure business continuance in as short a timeframe as possible. #2: Key operations

Key operations are characterized by their relative importance in enabling the business to deliver its products and services. Examples include SAP systems for inventory control or product distribution, or online access to product websites that support internet sales. These functions tend to be governed more by internal audit and controls than by the external forces that drive mission-critical functions. Because outages to key operations will adversely affect the organization’s ability to deliver quality service, key operations must be considered critical functions. Availability designs must be in place to provide adequate failover capability if it becomes necessary to initiate manual intervention to provide immediate access to systems and information.

5

#3: Key support capabilities

Key support capabilities are the business and IT components that augment the critical processing of mission-critical and key operations as they relate to the analysis, audit and tracking of business functions. Although they are important to maintaining the integrity of critical operations, these functions can tolerate some level of delay during an outage. For example, call centers and internal help desks are processing services that are support oriented, but they do not need to be available or recovered as a top priority function. Availability considerations for key support include having some form of standby operations that can be brought online within a fairly aggressive timeframe. This requires having excess capacity on hand, in the form of processor and storage resources, that can be reconfigured for immediate use in place of the failing function. Most often, organizations use a hybrid recovery design that includes both dedicated and shared assets for faster implementation. To facilitate hybrid recovery, copies of system images, applications and data must be electronically stored on dedicated resources to meet the fairly aggressive recovery objectives established by the critical business functions. At the time of an incident, resources from a shared asset pool—either spare internal equipment or a vendor subscription in an alternate site—can be configured and made available for connection to the dedicated resources. #4: Noncritical functions

From a recovery standpoint, there must be dedicated systems, data and infrastructure that can be manually reconfigured to resume processing in accordance with defined business needs. This recovery design requires that all system images, applications and data are transmitted electronically to the remote site to ensure that the recovery objectives are in accordance with times and data accuracy metrics established by the business.

Noncritical functions are the lowest business priority. These typically include back-office, offline processing capabilities, such as the accounting and reconciliation of client accounts. These functions can tolerate extended downtime without significant impact on the business and often can be temporarily executed using manual workarounds.

6

Integrating the resiliency needs of business and IT functions

It is common to take a more relaxed approached when determining business resiliency requirements for noncritical functions. Regarding availability, organizations restart systems and accompanying applications with data recovered from either online storage or local tape pool backups. This is a largely manual approach, and the organization must be able to make spare resources readily available within the local site in the event of an operational outage.

For example, designing in-region redundancy that utilizes system failover and data replication techniques provides optimal availability for handling system or single-site-related events, but it does not protect against a regional outage. Conversely, an out-of-region recovery design protects against a regional outage, but it may be very limited in its ability to remotely recover single systems or files when the majority of the production site has not been impacted.

The recovery design for noncritical functions is typically more traditional than for other functions. The organization rebuilds various systems and restores accompanying applications and data from backup media (again using either online storage or tape backup) based on established recovery objectives. The technology needed to rebuild an environment after an operational outage depends on how the business chooses to approach the recovery, for example, by moving a workload from a particular resource, by locating spare assets, by using vendor-subscribed assets or by acquiring resources at the time of an event. How the environment is rebuilt and how long the recovery takes will differ with each approach.

Some of the challenges associated with designing a combined local availability and remote recovery approach include:

Challenges of a combined local availability and remote recovery design With the emergence of new, combined business and IT functionality, it is mandatory to design infrastructures that can support availability and recovery from both a business function and an IT operational standpoint. Often, availability and recovery are intertwined, with local availability being substituted for disaster recovery, or remote recovery being used as protection against wide-scale outages in production. Although the different approaches provide the technical design points required to support the stated objectives somewhat consistently, they neglect to fully address the metrics needed to meet both availability and recovery requirements.







Technology limitations when designing system failover capabilities. Currently, the ability to automate system failovers for mission-critical and key-operations functions is limited to an in-region design due to timings required for internal processor synchronization. Integrity concerns when replicating data over long distances. Data synchronization—and ultimately accuracy and availability—is a concern because latency is introduced as distance increases. The impact on production must be evaluated as it relates to performance and any potential degradation in the overall delivery of service. “Partial recovery” of single systems. Recovering single systems requires not only resource and data availability, but also the ability to reconnect the system into the primary processing environment upon restoration. This requires integrated connectivity that, from a performance, throughput and access perspective, must once again address latency at a system and application level. A critical factor to a successful recovery design is understanding the intricate data dependencies at the application level to enable the failing function integration back into the mainstream processing environment.

IBM Global Technology Services





Single points of failure. Providing seamless access to the secondary environment is another key to meeting availability objectives. The design must take into consideration the need for standalone and isolated infrastructures that can run in the event of a primary site outage. The ability to exercise the design. The design must be able to test the complete solution and not be relegated to component or unit testing. The ability to not impact production during test events is yet another key factor to a successful design.

A holistic design approach An effective design takes into consideration the detailed requirements for both availability and disaster recovery. An organization must identify these requirements and confirm their relative importance because they relate not only to the vitality of the business, but to how each will be supported from a financial, operational and technology perspective. Furthermore, establishing business priorities allows an organization to properly identify the metrics (SLAs, RTOs and RPOs) that will be used to monitor the design implementation. In addition to gaining a deeper understanding of the business requirements, it is critical that the design and ultimate implementation of the resiliency approach be in concert with the dynamics of the evolving business and IT strategies. This requires that the organization designs availability and recovery into every aspect of the production environment, from business functions and processes to the underlying applications, organization and supporting infrastructure that deliver products and services. It is paramount to design availability and recovery into the very fabric of production operations as new systems, applications and data management techniques are deployed. Experience has proven that trying to retrofit these capabilities after new systems are brought online is often extremely difficult, if not impossible.

7

Key performance indicators to measure success When determining the best business and IT strategy to use to address availability and disaster recovery, it is important to take into consideration the metrics collectively defined by the business units and IT organization for managing how resiliency will be deployed across the enterprise. These metrics should be jointly agreed upon by both the business units and the IT organization. They also should be used as the baseline for defining a combined capability that provides consistent access for critical functions on a daily basis while supporting a reasonable recovery effort should an event occur. Financial impact is also a key factor in defining a combined resiliency approach. Significant capital acquisition costs and ongoing run rates are associated with the additional facilities, hardware, software, networks and support services required to enable, operate and maintain enhanced availability and recovery. Historically, funding has been directed toward either an availability or a recovery design, providing only half of the solution. A more cohesive design identifies what is truly critical from both perspectives, with funding allocated to a combined strategy. Within this holistic approach, allocation costs can be broken down into specific availability and recovery requirements based on business priorities and the need to meet previously defined service level and recovery objectives.

Establishing the business priorities, understanding the operational relationships, defining their associated metrics and assessing the financial impact of an enhanced strategy serve as the guidelines for developing the actual solutions that will be architected to support enhanced resiliency.

Furthering an enhanced resiliency effort requires a detailed analysis to balance the specific characteristics that are associated with delivering the various capabilities necessary to support and sustain business operations. This includes trade-offs associated with in-region availability—such as system failover, continuous processing and local operational recovery—as they relate to maintaining daily operations. Likewise, from a recovery perspective, the ability to resume business functions from a site outage or regional event may require relaxing some of the recovery objectives in support of meeting daily service-level objectives.

Summary As organizations strive to create more robust capabilities to support the accessibility of the key resources that drive business processing, they continue to see the convergence of IT- and business-focused approaches to managing availability and recovery requirements, because the approaches and designs that support operational processing functions can no longer be separate and discrete. Key to the convergence of these approaches is having detailed knowledge of what each delivers relative to the importance of the actual business outcomes they support. Not until these individual approaches are understood and rationalized relative to how they jointly contribute to the design of a true resiliency strategy can they be utilized to fully address the ultimate needs of the business.

© Copyright IBM Corporation 2011 Produced in the United States of America July 2011 All Rights Reserved IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corporation in the United States, other countries or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

Other company, product or service names may be trademarks or service marks of others. Please Recycle

For more information To learn more about the benefits of implementing an enhanced resiliency approach, please contact your IBM marketing representative or IBM Business Partner, or visit the following website: ibm.com/services/continuity Additionally, financing solutions from IBM Global Financing can enable effective cash management, protection from technology obsolescence, improved total cost of ownership and return on investment. Also, our Global Asset Recovery Services help address environmental concerns with new, more energy-efficient solutions. For more information on IBM Global Financing, visit: ibm.com/financing

BUW03023-USEN-00

Suggest Documents