paper, we present Cloud Migration Orchestrator (CMO), a framework for automation and coordination of large-scale cloud migration based on the IBM Business ...
Automation and orchestration framework for large-scale enterprise cloud migration
J. Hwang K. Bai M. Tacci M. Vukovic N. Anerousis
With the promise of low-cost access to flexible and elastic compute resources, enterprises are increasingly migrating their existing workloads to cloud environments. However, the heterogeneity and complexity of legacy IT infrastructure make it challenging to streamline processes of migration at an enterprise scale. In this paper, we present Cloud Migration Orchestrator (CMO), a framework for automation and coordination of large-scale cloud migration based on the IBM Business Process Management (BPM) technology with pre-migration analytics. CMO seamlessly automates complex and error-prone tasks, spanning from on-premise data center analysis, using correlations between occurrences of middleware components, to parallel migration execution by integrating various vendor migration tools. CMO offers self-service capability with a “one-click” migration execution and provides a solution for retaining IP addresses to further minimize workload remediation efforts. We present a taxonomy of network challenges, based on experience with migration of legacy environments and discuss how to automate and optimize network configurations. For each step of the migration process, starting from pre-migration assessment through the post-migration configuration, we discuss lessons learned from real-world deployments and demonstrate how the novel CMO framework reduces human activities through automation. Finally, we discuss efficiency of migration capabilities, including a fourfold process improvement (with respect to traditional approaches) using automation and orchestration.
Introduction Enterprises are increasingly moving their existing IT infrastructure to the cloud, given the promise of low-cost access to on-demand elastic computing resources [1]. Migration of IT infrastructure to cloud platforms transfers a company’s data, applications, and services to one or more cloud operating environments. Heterogeneity and complexity of the source environment, coupled with regulatory and governance constraints (e.g., approvals required to deploy discovery scripts), demand a significant level of manual intervention and supervision while accessing on-premise data centers. In turn, this has an
Digital Object Identifier: 10.1147/JRD.2015.2511810
impact on the duration and quality of execution of the processes, also increasing dependency on the migration engineers’ skill level, and makes migrating enterprise-scale workloads technically challenging and error prone [2–4]. A typical process for migration to the cloud starts with the discovery stage that feeds into the analysis of the source environment and its fit to the target (e.g., cloud environment) [5, 6]. During the planning stage, related servers are grouped into so-called “waves” that will be sequentially scheduled for migration. Before the migration is executed, preconfiguration is performed, which includes provisioning of the target environment and network setup. Once the migration is completed, post-configuration tasks are executed, such as backup switching. The process completes with the quality assurance step [7, 8].
ÓCopyright 2016 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/16 B 2016 IBM
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
J. HWANG ET AL.
1:1
Figure 1 CMO framework (UI user interface; XML Extensible Markup Language; csv comma separated values )
The existing migration frameworks [9, 10] mostly have focused on how to migrate legacy applications into the cloud software-as-a-service environment, which means the application modernization has been a key motivation for the cloud migration. Although these frameworks have provided a way to devise transformation model from business models (process, rules, and motivation), down to the level of component architectures and platform-specific design, the enterprise-scale migration requires broader approaches that can include the source data center analysis, considering middleware relations and affinity, target cloud design based on the catalogs provided by various cloud providers, and parallelized migration execution by orchestrating various existing migration tools. Technical challenges for an enterprise-scale migration are driven by the heterogeneity of the source and target environments, number of available tools, and many unanticipated events and exceptions that can derail the process [11, 12]. For example, the availability of a large variety of server platforms has an impact on the choice of migration techniques that need to be considered. On-premise servers typically run on different platforms, different physical hardware, and various hypervisors that involve different image formats. As a result, there is no one-size-fits-all migration approach, and this spurs the variety of methods [13, 14]. Challenges in managing human activities in migration include process governance, access to common content (e.g., target design, decisions, what/if analysis), timely notifications to process tracking, etc. In this paper, we examine each step in the process of migration to the cloud and describe how we applied analytics and automation to improve its efficiency and
1:2
J. HWANG ET AL.
optimize the process variability. The target migration method of CMO is an image migration with reconfiguration, and the target cloud providers are IaaS (infrastructure-as-a-service) companies such as IBM SoftLayer and Amazon Web Services. We present CMO (Cloud Migration Orchestrator), and specifically its enhancements we made to integrate the discovery and analysis phase. We offer insights from CMO deployments and discuss improvements that it introduces in the process of migration, in terms of time and labor savings.
Cloud Migration Orchestrator In this section, we introduce CMO, a framework based on BPM (Business Process Management) [15] principles, which automates and orchestrates migration activities, including discovery, analysis, pre-migration configuration, and execution of migration [16, 17]. Figure 1 illustrates the CMO framework. We have extended the base CMO [16] to integrate discovery and analytical capabilities, and to provide flexibility of choice of migration tooling and to support post-configuration steps, thereby enabling an end-to-end automation and orchestration of the entire migration process (each step is explained in the following sections). Benefits of BPM include monitoring, context sharing, and orchestration of the entire process, allowing for variability in migration scenarios that CMO supports. Discovery and analysis (discovery and analytics engines) Customers have started to centralize IT infrastructure through a variety of methods and primarily through data center consolidation [18–20]. However, existing methods
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
for IT infrastructure discovery that provide insights to drive decisions about the large-scale, complex relocation requirements are labor-intensive and impractical. They require credentials to access customer’s machines and lengthy approvals to deploy discovery agents in customer’s environment. The migration function was performed by internal IT departments at enormous costs; it was impractical to offer this capability as a service. To overcome these technical and governance limitations, CMO integrates with ALDM (Analytics for Logical Dependency Mapping) [21, 22] (see the next section for details). The innovation impact of ALDM is twofold. First, ALDM removed the requirement to obtain credentials to run discovery, thereby reducing the time. The simplicity of ALDM’s script-based approach has removed the effort required to deploy agents and configure complex tooling. In addition, ALDM’s technical underpinnings have enabled rapid and accurate discovery, thereby increasing time-to-value. A set of analytical capabilities have been developed on top of ALDM data source, to support reasoning about the workload’s fitness to the cloud and cost of migration, along with advanced server grouping to schedule migration waves. Pre-migration configuration (configuration engine) Once the source environment is identified and target environment designed (taking into an account any optimizations), the next step is to provision the target environment. Again, often many of these tasks tend to be executed manually or there are challenges arising from complexity of the environment (e.g., source and target network configuration challenges) [23]. This step can often be significantly delayed due to network circuits not being ready. CMO employs cloud APIs (application program interfaces) to provision the network elements, configure them, and provision the target environment and resources, as needed (e.g., ESX [Elastic Sky X]). The ability to fully automate this stage depends on the cloud provider and availability of APIs. Certain actions may still require human involvement. This is the benefit of the BPM underpinnings of CMO, which enables orchestration of human and automated tasks. Migration execution (migration engine) CMO integrates multiple migration technologies, such as vConverter, VMware** SRM (Site Recovery Manager), vReplication (vSphere Replication) [24], and RackWare [25], through APIs. This is a key differentiator for CMO as an orchestrator. As mentioned, there is no one single plan, process, or solution for migration to the cloud. The process varies depending on the source environment and desired target. Based on the availability of tooling, support for source and target pairs, licensing restrictions, and network configuration, different tools may be applicable. For example, an image-based migration is a per-unit-based
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
method, such as physical to virtual conversion, and this is supported by VMware converter. SRM, on the other hand, may be more suitable for virtual-to-virtual scenarios and bulk migration. Post-migration configuration (configuration engine) This step includes setup and configuration of backup systems and services. Customer’s workloads and underlying images may require upgrades and customization in the target side. Examples include an upgrade of middleware or a deployment of a new configuration script. This step can also derail a number of human and technical challenges, such as unanticipated, yet required fixes and backup configuration requirements, and challenges involving availability of backup service. Quality assurance (migration engine) The customer (or application/workload owner) typically performs the final user acceptance testing. Often, the delivery engineers verify the infrastructure and then give control to the customer. This handover is yet another example of required orchestration, to ensure smooth transitioning and handover between the distributed teams.
Discovery and analysis Theoretically, planning migration waves and migrating to the cloud seemingly are straightforward solutions because one might argue that the essential aspect of any application or business function involves its codebase; therefore, migrating to the cloud is simply equal to migrating the codebase to the cloud [9]. However, there are numerous and vast differences between migrating some code to the cloud and migrating an entire business function or application and services to the cloud. Consistent application and business functions are paramount to an enterprise user deploying to the cloud. Consistency is critical to serving clients and preserving the profitability of one’s business operation. Migrating critical applications and infrastructure to the cloud, while maintaining business as usual, requires substantial human resources to manage sensitive data and applications during the migration. In Figure 1, we provide a typical migration scenario to help understand the degree of complexity of the IT transformation problem. We also introduce our discovery framework and an example of how this discovered information can be used to serve the migration planning [26, 27]. Let us assume a migration engineer (a.k.a., field practitioner) in an IT services company is asked to migrate a client’s critical production software from their old data center (called the source or on-premise environment) to a new data center (called the target environment). The practitioner can first deploy a set of discovery tools,
J. HWANG ET AL.
1:3
including network observation and configuration analysis, to discover significant information about the existing systems. In addition, the practitioner will use our system and methodology, which follows a model-driven approach to enrich basic infrastructure discovery with more comprehensive information about dependency between applications and data (e.g., business objects, tables, files, and other information entities). Typically, this information cannot easily be found at all by the tool sets used in the first step, such as from a network observation tool. Using the overall discovered information, the practitioner then can make a cost-effective migration plan of what needs to move where, when, and by what means. A migration plan is needed because it is unfeasible to migrate tens of thousands of server images that are typical of enterprise customers all at once. Thus, a migration engineer needs to partition the servers into smaller batches, each of which can then be migrated during a small change window of time (e.g., during the weekend). Creating a migration plan is an extremely complicated task (think of it as a massive scheduling problem with many constraints). At the minimum, the plan needs to satisfy the following three constraints. First, each server needs to pass the cloud fitness validation. CMO collects resource catalogs and server specifications from the targeted cloud provider. It then analyzes the collected data and verifies the fitness of each server, as to whether or not the server with the current specifications can be migrated into the designated target cloud environment [28]. Next, each migration batch (wave) should be small enough so that the servers can all be migrated within the change window, but large enough to maximize productivity. Scheduling too many servers in the same change window means that some servers will not get migrated, thus requiring a change to the global migration plan to accommodate the left-over servers. Lastly, servers that depend on each other should be migrated in the same batch. Otherwise the servers may be separated by a slow network link (e.g., one of them is migrated to the target environment, whereas the other stays in the source environment), which often leads to significantly reduced performance. Traditionally the task of creating the migration plan is performed manually. Of course, creating a plan manually for tens of thousands of servers with millions of connections between them would be impossible. Figure 1 also shows the discovery framework, which has a lightweight discovery front-end visualization that collects information from servers and a back-end that processes and analyzes the collected data to provide high-level insights into the IT environment. Cloud fitness validation (cloud-fitness) We formulate the cloud fitness validation problem into the CSP (Constraint Satisfaction Problem) [29]. CSP is
1:4
J. HWANG ET AL.
defined by a set of variables and constraints. Each variable has a nonempty domain of possible values. Each constraint involves some subset of the variables and specifies the allowable combinations of values for that subset. A state of the problem is defined by an assignment of values to some or all of the variables, called a consistent or legal assignment. A complete assignment is one in which every variable is satisfied, and a solution to a CSP is a complete assignment that satisfies all the constraints with an objective function [29]. First, we define resources and software elements as variables. Let Ci be the number of cores of the machine i, Mi be the memory size of the machine i, Di be the disk size of the machine i, and Ni be the network bandwidth of the machine i. Similarly, let Si be a list of software and its versions of the machine i, and we denote Sij as the j-th software and version of the machine i. Also, we add a superscript t to each variable to represent the target catalog. For example, Cit represents the i-th item from the list of number of CPU cores supported at the target cloud. So, variables are defined as follows: Variables ¼ fC; C t ; M ; M t ; D; Dt ; N ; N t ; S; S t g: The domains should include pairs of each combination as follows: Domains ¼ ðC; C t Þ; ðM ; M t Þ; ðD; Dt Þ; ðN ; N t Þ; ðS; S t Þ : We set up constraints to match each pair that can be fitted into the target cloud environment. The constraints are defined as follows: Constraints : C C t; M M t; D Dt; N N t; S S t : To solve this problem, we use a recursive backtracking search [29]. The ultimate result will have a list of all possible assignments set A (at this point, we know the software specifications are already satisfied), and we can find the best-fit in terms of dissimilarity. We define dissimilarity of the machine i as follows: Dissimilarity ¼
Ck Mk Dk Nk þ þ þ ; Ckt Mkt Dtk Nkt
where k is the index of the element in Ai , 1 k jAi j. The dissimilarity is a measure of inequality, always a positive value, and 4 (minimum value under Constraints above) denotes the identical pair between the source server and the target catalog. For example, the source machine i has resources with 4 cores, 1 GB memory, 100 GB disk, and 1 GB/s network—and the target catalog can support the same resources. The best-fit of machine i is the one with minimum dissimilarity: BestFiti ¼ min DissimilarityðAi Þ : After we find the best-fit candidate, we validate that the target is an exact copy of the source machine or the
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
Table 1 Example of the host middleware installation. (DBMS: DB management system; FS: file system; APS: application platform suite; WLG: IBM web logic app server; HACMP: high-availability cluster multi-processing.)
target will have more resources than the source machine. Based on the validation, we notify users that there are mismatched resources, but it will not affect migration. Finally, we need to deal with the invalid list of machines. Based on the reasons for failure (i.e., resource mismatches or software mismatches), we recommend user options and whether they need to downgrade some resources or apply software reinstallations. Server dependency Finding a dependency of either “server-to-server” or “application-to-server” is an extremely hard problem due to dynamic complexities such as relations of business applications and communication relations. Thus, the major obstacle in automating this step is identifying critical dependencies among servers, to enable full end-to-end automation of migration process. Since it is not uncommon for a plan to take weeks, if not months to develop, we attempt to reduce the time it takes to devise the plan, thus leading to much reduced migration cost. We achieve this by automatically identifying the dependencies between servers and their components, and then systematically grouping those servers in logical groups or batches based. We approach the problem of finding dependencies among hosts by focusing on finding relationships among the middleware components installed on the hosts. We are interested in finding correlations between occurrences of different middleware components. The problem of discovering correlations is analogous to finding collocations in texts and can be cast as modeling the conditional probability of an event given another event. More formally, we let: 8x; y 2 fMWconfigurationsg ) Find pðx j yÞ;
(1)
where x and y are members of a set of middleware (MW) configurations. Examples of MW are application server, database, etc. The task is to find the conditional distribution of pðxjyÞ. The problem can also be examined from an information retrieval perspective. Suppose one wants to find all the hosts on which a particular
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
middleware y is present. In the hosts that are retrieved, we ask what is the chance that the hosts also contain another middleware x. We now use a concrete example to show how to compute the features and expectations. Suppose S ¼ fMW1 ; MW2 ; . . . ; MWN g is the collection of all occurrences of all middleware components. Table 1 shows an example of a set of host H and their installed middlewares MWN . Further suppose that we have no prior knowledge and that any member of S is equally likely to co-occur with any given middleware MWi . That is, pðMWj jMWi Þ ¼ 1=N . We define a binary feature function: n 1 if MWi and MWj on the same Hi ðMWi ; MWj Þ ¼ (2) 0 otherwise: Note that this feature is separable as follows: ðMWi ; MWj Þ ¼ 1 ðMWi Þ2 ðMWj Þ:
(3)
Equation (3) is equivalent to Equation (2) because MWi and MWj co-occur if and only if they both are found on the same host. Then, the empirical expectation of feature (2) is thus d0 ¼
N X
ðMWi ; MWj Þ ¼ Nij ;
i¼1
where Nij is the number of hosts containing both MWi and MWj . The model’s expectation of feature (2) is then d¼
N X X 1 1 ðMWi Þ2 ðMWj Þ: N i¼1 MW 2S
Once d0 and d are computed, we can obtain the optimal gain G of maximizing the data likelihood as follows: G ¼ Lð Þ Lð0Þ; ~ ~ E½ and E½ are the empirical where ¼ lnðE½=E½Þ. expectations of Kullback-Leibler Divergence [30].
Preconfiguration Before actually copying source data or the image over to a target data center, a proper migration-ready environment
J. HWANG ET AL.
1:5
needs to be configured with appropriate computing resources, virtual network setup, resource/network security protection, and network connectivity. At the same time, source servers should pass compliance checks, and need to be adjusted if they failed the compliance checks. In this section, we show how CMO automates this complex process using a BPM with custom Java** libraries and REST (Representational State Transfer Application Programming Interface) API. Target provisioning CMO makes use of BPM to orchestrate the entire provisioning automation. The BPM supports integration with custom Java and JavaScript, which CMO utilizes to maximize automation. In particular, CMO includes custom Java libraries in order to provide a REST API wrapper to access external services, configure target-side resources remotely using SSH (Secure Shell), and integrate migration tools, e.g., VMWare SRM. The BPM communicates with an API server that serves as a proxy to external services to provide encapsulation, flexibility, and integration of non-Java technologies such as Python to access the SoftLayer** API [31]. CMO automates the target provisioning based on the customer preferences, determining what kind of computing infrastructure is planned. CMO automatically provisions and configures five types of resources—Vyatta** [32] firewall, Vyatta bridge, vCenter**, ESXi, and VLANs (virtual local area networks). The vCenter here refers to VMware’s vCenter Server**, whereas the ESX refers to VMware’s ESX hosts, which provide the virtualization layer hosting server VMs migrated from the customer data centers. In SoftLayer, while the Gateway (a Vyatta appliance) and ESX are provisioned as bare-metal hardwares, the other resources can be provisioned either as hardware, cloud VMs, or even ESX-hosted VMs. For instance, if customers plan to migrate into the VMware environment, CMO provisions appropriate computing resources, including a virtual server with vCenter and bare-metal servers with ESXi, and registers ESXi servers to the vCenter once the provisioning is done. The provisioning order is significant, as VLANs need to be associated with a provisioned Gateway, and other computing resources need to be on the provisioned VLANs. Also, to register ESXi servers to a vCenter, the ESXi servers need to be provisioned first. In SoftLayer, the provisioning of these sources may take anywhere from 10 minutes to 4 hours depending on the type (e.g., virtual server or hardware). We use a polling technique supported by BPM to check the provision status and notify users through emails when provisioning is complete or exceptions have occurred. Once provisioned, these resources are configured properly through embedded custom Java libraries.
1:6
J. HWANG ET AL.
Since as a migration orchestrator, CMO manages migration tools such as vConverter and RackWare, CMO ought to provision these standalone servers in the target data center when the target infrastructure is ready. CMO allows customers to choose which migration tool they want to provision in the target environment to avoid wasting computing resources for installing the unused migration tools. Network setup Cloud providers that are ready for enterprise-scale workloads (e.g., using thousands of servers) offer high security standards with dedicated (hardware-based) firewalls and VLANs. One requirement is that the network firewalls should provide HA (high availability) to prevent a single-point of network failure. CMO automates/orchestrates the network configuration between the on-premise data center and the target data center [33]. First, CMO collects information about the network properties (gateway IP, VLANs and subnets) of on-premise data centers to prepare to establish an IPSec (Internet Protocol Security) tunnel (VPN connection) between the on-premise data centers and the provisioned target infrastructure. The CMO uses BPM to orchestrate both human and system tasks, and facilitate notifications, automation, and exception handling (e.g., deadline expiration triggers exceptions) to ensure the timely delivery of services. By establishing private tunnels that conform to the security requirement, the gateway safeguards the security of WAN (wide area network) connection. Then CMO automatically creates a custom firewall script with VLANs and subnets information, and then runs in the network appliances to set up base security zones and firewall rules. The major difference between the migration framework of CMO and other standalone migration tools is the automated network configuration. Working with large enterprise customers, one of the most importance issues in private cloud migration is the network security. The network security can mean that the target private cloud satisfies the customer’s security policy. However, in the context of migration, it can also mean that during the long running process of migrating data centers to the cloud, the connection between those migrated and yet to be migrated needs to be secure. Thus, our orchestration approach also establishes a secure WAN connection between the target cloud and the origin data center during the migration process. Another network feature of CMO is IP retention, meaning the computing resources retain their IP addresses from the on-premise data center even after they are migrated over to the target cloud. For example, since we have observed many servers are interconnected using IP addresses (not hostnames), retaining IP addresses
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
could be essential to the migration planning to avoid redesigning the entire network with the new installation. CMO provides two ways to achieve this: using GRE (generic routing encapsulation) protocol or extending the LAN with ARP (Address Resolution Protocol) forwarding through OpenVPN. The GRE protocol can provide a packet encapsulation when the source gateway can support the GRE protocol. If not, CMO deploys a pair of network bridge VMs (i.e., Vyatta images) for pair-wise VLANs on the target cloud and on-premise data center, respectively. This connection allows the ARP packets to be forwarded between the on-premise data center and the target data center. The IP Retaining feature provides a great advantage in minimizing the potential complex remediation work incurred by IP change. The CMO automatically provisions and configures GRE or VLAN bridges on the target cloud, and provides detailed step-by-step instructions (screenshots and scripts) to assist personnel with creating and configuring the on-premise counterpart network. Server preparation Moving into the cloud data centers is not simply running the same source machines intact; rather, servers need to be compliant with target management requirements. Especially, security and licenses for operating systems, applications, and middlewares are essential to ensure the future target infrastructure follows the security standards and the license agreement defined by the cloud provider and the software providers [34]. Migration tools such as vConverter and RackWare require a certain configuration at the source servers in order to install its agent automatically. Migration tools need to access the servers through predefined network ports so that the firewall should be configured beforehand. Also, RackWare requires the Windows** powershell execution-policy as Unrestricted. This is mainly because RackWare needs to install its agent. If these requirements are not met, the migration tool will fail.
Migration execution The CMO integrates and orchestrates third-party migration tools to perform server migration. Currently supported migration tools include VMWare SRM, vReplication, vConverter, and RackWare. Each migration tool has its own purpose such that SRM by its design supports a bulk-migration from VMware to VMware. vReplication “intends” to migrate a single server each time but uses the same technique with SRM, vConverter migrates a P2V (physical-to-virtual) server, and RackWare provides a single image migration to the bare-metal environment (Xen, KVM [Kernel-based Virtual Machine], and VMware). Although CMO is aimed to automate an entire end-to-end process, some migration tools do not support
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
proper APIs. VMware does not provide any APIs to vReplication and limited APIs to SRM, but vConverter has a full simple object access protocol (SOAP) API that can be used by CMO. Also, RackWare supports only CLI (command-line interface) based calls, not REST APIs. Therefore, the CMO provides limited automation for SRM, no automation for vReplication, and a fully automated task for vConverter and RackWare. When CMO is unable to provide full automation for a migration tool, it interacts with the user by providing detailed instructions and screen shots that include the acquisition, installation, configuration and operation of the tool. If the network bandwidth is allowed, parallel migrations can best minimize the migration time. Instead of relying on a single migration tooling, making use of multiple tools simultaneously can offer a significant time saving. Tools such as VMware SRM and vReplication provide the data transfer optimization such as deduplication, compression, and omitting zero blocks so that the total amount of transferred data will be minimized. This optimization performs better when similar images are migrated together because the cross-image optimization can take advantage of image similarity. For example, each image has a disk space for an operating system, and if multiple images use the same operating system, the operating system space can be transferred just once because the disk spaces are likely alike. Also because tools like RackWare and vConverter can dynamically designate the destination environment when initiating the migration, distributing these tools to minimize the migration time can further help reduce the entire migration time.
Post-configuration and quality assurance Even after the images are migrated successfully and running independently, the same way as when they were in the source data center, the servers still need to be reconfigured to perform well in the new environment. The post configuration includes not only application configurations, but also compliance settings, security settings, management tools installation, DNS (domain name resolution) configurations, AAA (authentication, authorization and accounting) configurations, and backup and storage configurations. CMO provides two ways to configure the target environment. First, in order to configure individual server, CMO runs pre-defined native scripts: Visual Basic** scripts for Windows machines, and bash scripts for Linux** machines. The scripts can execute any operations that are allowed to run in the native operating system, including installing software, configuring network interfaces, verifying the compliance, creating users, and much more. Although the initial scripts may solve configuration problems in each machine, they do not solve problems associated with the architectural configurations
J. HWANG ET AL.
1:7
and infrastructure configurations. Therefore, another way of configuring the target environment is by supporting external APIs. CMO resolves many configuration problems with the API. For example, often the target environment is not optimized because the servers still keep the same structure as the source data center, which can be inefficient in new cloud data centers. Through the source discovery process, CMO helps to identify servers that need a load-balancer for the scalability. Also, CMO supports pluggable API interfaces to adapt to the environment such as calling Infoblox** DNS appliances to register domain names and hostnames. Before finalizing the entire migration process, and delivering the migrated target environment to the customer, the delivery engineers verify the infrastructure for the acceptance testing. The quality assurance in CMO currently means the correctness of configurations and connectivity rather than QoS (quality of service). CMO suggests circling back to the discovery phase and verifying the configurations and connectivity by comparing between the source data center and the target data center. The QoS assurance is currently not supported by CMO, but it is likely to be the next goal to achieve. Once the target environment passes the acceptance testing and there are no more possible migrations, the last step CMO needs to perform is to clean up the target environment. When setting up the target environment, CMO deploys migration tools at the target data center and establishes a secure network connectivity between the data center where CMO resides and the target data center. CMO eventually needs to cancel the resources for the migration tools, and disconnects the secure network from the target data center. However, depending on the customer’s choice, the secure network between the source data center and the target data center can be maintained by the customer, and CMO does not remove it.
Evaluation In this section, we demonstrate how the CMO framework introduces labor savings arising from a self-service model and automation. Also, we discuss rationalization of migratable workloads, and efficiency of migration capabilities, including significant process improvement using bulk migration over traditional approach. To perform realistic migration analytics, we examine real datasets from three enterprise data centers with different sizes: small, medium, and large. The three data centers range in business functions and include a variety of operating systems. Table 2 shows the summary of the data centers used in our analysis. The datasets are composed of raw data taken from the output of various command line tools and structured data from ALDM [21]. We used raw data in our analysis, consisting of netstat, ifconfig, and
1:8
J. HWANG ET AL.
Table 2
Data centers used in the experiments.
uname output, as well as hardware information such as number of processors, memory capacity, and disk space. To compare the discovered source server specifications with respect to the available target resources, we need to identify the target resource catalog. The target data center is located in IBM SoftLayer, and the target cloud type is a private virtual server. The resource limitations include the maximum number of cores (8), the maximum memory size (64 GB), the maximum disk size (8,100 GB), and the maximum network bandwidth (1 GB/s). These resource limitations are used for the cloud-fitness analysis. Identifying the cloud-fitness of the source environment is an important first step to create migration wave plans. Given the discovered source specifications, CMO compares them against the target catalogs (resource limitations) assuming that only private virtual servers can be provisioned. The violation of resource limitations (i.e. number of cores, memory size, disk size, and network bandwidth) results in a “cloud-unfitted” server. In addition to the resource limitations, software installations including operating systems can prevent servers from migrating. That is, while the OSes such as Windows and Linux distributions can be provisioned in the target, Solaris**, AIX*, and HP-UX are not supported at the target data center. Therefore, unsupported OSes should be migrated into the (expensive) bare-metal environment only. In this paper, we perform the migration analytics only for the virtual servers with highest demand. We use the CSP formulation introduced earlier. Figure 2 illustrates the cloud-fitness results for three data centers shown in Table 2. From our analysis, Large/Medium/Small data centers have 73%/76%/84% of servers are “cloud-fitted,” respectively. The figure illustrates reasons about how each resource is filtered out for the cloud-unfitness. For the small data center, memory size and disk size are most influential, but the medium and large data centers are most constrained by the number of cores and memory size. Nevertheless, no single component dominates the cloud-unfitness.
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
Figure 2 Reasons for cloud unfitness
Figure 3 As explained before, grouping servers in migratable sizes is important for the wave planning. In Figure 3, we show the distribution of paired middleware components within the IBM HTTP Server (IHS-WAS) group (large data center). We can observe that some paired middleware components clearly stand out from others. For example, given that we observe the DB2* database, we have over 30% probability that we also observe the EMC Powerpath server-resident management tool installed on the same host to enhance performance and information availability, in this case to enhance the performance of DB2. We can also see that, given we see IBM web application server (WAS), we have a slim chance, around 1% probability, to see the DB2 database co-located in the same host. This finding confirms the common practice in a normal IT environment that IBM WebSphere* and DB2 database are not installed on the same host. However, if it does happen, like in this case, the practitioner needs to pay attention. Similar findings are observed across all other different subspaces and accounts. Hence, this intra-host correlation analysis is very useful to the migration team. After the migration analysis is done, the target environment needs various resources including network gateway appliances, VLANs with subnets, and computing resources. It is important to know the estimated time taken to provision various computing resources through the automated provisioning system using APIs. Network appliances (bare-metal Vyatta gateway) takes the longest time due to the special sales approval process, hardware installation, and network setup in the router to let network traffic flow through the gateway. We observed provisioning a Vyatta gateway takes up to 8 hours. Provisioning bare-metal servers (up to 4 hours) or virtual servers (up to 30 minutes) are fully automated so that this does not take too much of time. However, the VLAN provisioning is currently not automated through APIs, so CMO creates a
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
Relation in the IHS-WAS subspace The y axis shows PðAjBÞ (PGS PostgreSQL server; HPUX_SARP HP-UX system activity report package; PowerPath EMC PowerPath; GFS2 Global File system 2; MYS MySQL server; LVM Logical Volume Manager; IHS IBM HTTP server; WAS WebSphere ) (Figure adapted from [5] )
ticket with all the details, which takes more time when any unexpected events happen. We could provision VLANs within 4 hours. Once the target environment is set up, an image-based migration (block-level data copy) is triggered to start copying source data to the target environment. The migration efficiency attributed to the bulk migration saves the end-to-end migration time. The bulk migration takes into consideration various optimizations such as block deduplication, skipping empty data blocks, and transfer protocol optimization. To observe how VMware SRM performs, we transfer 25 images simultaneously. Table 3 shows the results of the migration when 25 VMs have only operating systems without any data and when 25 VMs are filled with the same data. We can easily observe that the data transfer optimization is excellent with VMware SRM. To see the relation between the quantity of data and transfer time, one big image is transferred with random data filling. As expected, the result shows the proportional relation between the transfer size and the transfer time. The maximum and minimum time verify the transfer optimization is more effective when images are similar under unstable network. While CMO uses a live image migration, users need to decide when the servers are cut over to the target. Here, the phrase cut over refers to shutting down the source server and operating the target server. Once the cutover time is determined after the data transfer is done, CMO shuts down the source machines, transfer the delta (changed data), and turn on the target machines. The time between
J. HWANG ET AL.
1:9
Table 3
Comparison for data transfer time (VMware SRM). Time is expressed in units of hours: minutes: seconds.
the shutting down process and the turning on process depends on how much data has been changed (dirty data). Figure 4 illustrates the elapsed cutover time based on the rate of data change per each machine. Three workload behaviors (i.e., the location of data change is random, interleaved, or chunk) are experimented with. Here, then term chunk refers to a block of consecutive bytes with a certain size. The total cutover time is linearly proportional to the rate of data change overall, but depending on how the dirty data is distributed, the cutover time varies, which means different workloads have different cutover time. While CMO is intended to orchestrate the entire end-to-end process, savings attributed to orchestration and automation can be empirically measured. We collect the current end-to-end timeline with baseline 100 VMs from the migration practitioners, and measure the CMO timeline with the experimental results. The results provide good measurements for the comparison. We do not disclose the segmented timeline due to the IBM confidentiality in the paper, but instead provide a total comparison ratio of both cases. One assumption is that images are already virtual and managed in VMWare vCenter and ESXi. Also, all images are considered an average size for the network bandwidth available to migration, and images can be offline for the entire change window. Another assumption is that the network can be fully saturated to accommodate data migration, and the peak data rate of change does not exceed the network bandwidth available to migration. Moreover, no changes are made to the image or application, including IP address changes, and no design is done for the target environment beyond documenting the target infrastructure. Post-migration tools only include backup changing tools. Finally, the change can occur any day of the week, and images only include Development/ Test and Pre-Production images. The discovery step includes installation of ALDM scripts, executing scripts, data collection and compression, and uploading the data to the analysis server. The script installation is performed manually so that each server needs a login and downloads the scripts. In order to collect sufficient data, the scripts usually run for 1 week, and collect statistics every 15 minutes to track performance and
1 : 10
J. HWANG ET AL.
Figure 4 Rate of data change vs cutover time (VMware SRM)
network details. However, CMO provides an automated installation/collection/compression/upload process so that no human intervention is required. Currently, the analysis step requires a human interaction to uncompress the data, run parsing, and verify the output, whereas CMO allows a “one-click” human activity to trigger the analysis process. By “one-click,” we generally mean a simple operation with minimum actions required. Still, humans need to verify the output, adjust the parameters, and rerun if needed. The current wave planning is mainly done using experience and discussion, but CMO accepts preferences and finds the optimized wave plans. The current pre-configuration, migration, and post-configuration need manual activities, but CMO uses APIs and BPM workflows to automate the same actions. The user acceptance step is performed manually in both cases because users need to verify all services are up and running without any problems. From this experience, the overall results show CMO saves 4 times of the total time spent on the end-to-end migration.
Conclusion In this paper, we have presented CMO (Cloud Migration Orchestrator), an automation and orchestration framework
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
for large-scale enterprise cloud migration, making use of the IBM BPM (business process management). The processes of CMO include source environment discovery and cloud-fitness assessments, target environment provisioning (through APIs), network configuration, migration execution, and post-configuration. CMO is aimed to provide a self-service model through automation, where customers can control the entire end-to-end process in order to avoid waiting for the slow administrative processes. Through the experiments, we have shown how CMO can help reason about the migration process by discovering/analyzing the source data center, and reduce the end-to-end time with a large-scale bulk migration. The challenges still remaining in CMO are around how to migrate dynamically changing workloads with higher requirements such as server compliance and custom reconfiguration. Also, the quality assurance needs to be orchestrated and automated as well for the sake of the end-to-end migration. Thus, future work includes further broadening the CMO’s capabilities, such as those involving supporting customized bootstrapping, more sophisticated post-configurations, and quality-of-service assurance.
10
11
12
13 14
15 16 17
*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both **Trademark, service mark, or registered trademark of VMware, Sun Microsystems, SoftLayer, Inc , an IBM Company, Vyatta, Inc , Microsoft Corporation, Linus Torvalds, Infoblox, Inc , or Oracle in the United States, other countries, or both
19
20
References 1 M Armbrust, A Fox, R Grith, A D Joseph, R Katz, A Konwinski, G Lee, D Patterson, A Rabkin, I Stoica, and M Zaharia, “A view of cloud computing,” Commun ACM, vol 53, no 4, pp 50–58, Apr 2010 2 X Meng, J Shi, X Liu, H Liu, and L Wang, “Legacy application migration to cloud,” in Proc IEEE Int Conf CLOUD Comput , 2011, vol 11, pp 750–751 3 E Keller, S Ghorbani, M Caesar, and J Rexford, “Live migration of an entire net-work (and its hosts),” in Proc ACM Workshop Hot Topics Netw , 2012, vol 11, pp 109–114 4 M Menzel and R Ranjan, “Cloudgenius Decision support for web server cloud migration,” in Proc Int Conf WWW, 2012, vol 21, pp 979–988 5 K Bai, N Ge, H Jamjoom, E E Jan, L Renganarayana, and X Zhang, “What to discover before migrating to the cloud,” in Proc IFIP/IEEE Int Symp IM Netw , 2013, vol 13, pp 320–327 6 J Jermyn, J Hwang, K Bai, M Vukovic, N Anerousis, and S Stolfo, “Improving readiness for enterprise migration to the cloud,” in Proc Middleware, Bordeaux, France, 2014, vol 15, pp 5 1–5 7 7 P Jamshidi, A Ahmad, and C Pahl, “Cloud migration research A systematic review,” IEEE Trans Cloud Comput , vol 1, no 2, pp 142–157, Jul 2013 8 M A Chauhan and M A Babar, “Towards process support for migrating applications to cloud computing,” in Proc Int Conf CSC, Nov 22–24, 2012, pp 80–87 9 W Nowakowski, M Smialek, A Ambroziewicz, and T Straszak, “Requirements-level language and tools for
IBM J. RES. & DEV.
18
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016
21 22 23 24 25 26
27 28
capturing software system essence,” Comput Sci Inf Syst , vol 10, no 4, pp 1499–1524, Oct 2013 A Menychtas, C Santzaridou, G Kousiouris, T A Varvarigou, L Orue-Echevarria, J Alonso, J Gorroñogoitia, H Bruneliere, O Strauß, T Senkova, B Pellens, and P Stuer, “ARTIST methodology and framework A novel approach for the migration of legacy software on the cloud,” in Proc Int Symp Symbol Numer Algorithms Sci Comput , 2013, vol 15, pp 424–431 H Teyeb, A Balma, N Ben Hadj-Alouane, and S Tata, “Optimal virtual machine placement in large-scale cloud systems,” in Proc IEEE Int Conf CLOUD Comput , 2014, vol 7, pp 424–431 A Khajeh-Hosseini, I Sommerville, J Bogaerts, and P Teregowda, “Decision support tools for cloud migration in the enterprise,” in Proc IEEE Int Conf CLOUD Comput , 2011, vol 4, pp 541–548 B Pfitzmann and N Joukov, “Migration to multi-image cloud templates,” in Proc IEEE Int Conf Serv Comput , 2011, vol 8, pp 80–87 A Khajeh-Hosseini, D Greenwood, and I Sommerville, “Cloud migration A case study of migrating an enterprise it system to IaaS,” in Proc IEEE Int Conf CLOUD Comput , 2010, vol 3, pp 450–457 W M P Van Der Aalst, A H M ter Hofstede, and M Weske, “Business process management A survey,” in Proc Int Conf BPM, 2003, vol 3, pp 1–12 J Hwang, W Huang, M Vukovic, and N Anerousis, “Enterprise scale cloud migration orchestrator,” in Proc IFIP/ IEEE Int Symp IM Netw , 2015, vol 14, pp 1002–1007 C Liu, B T Loo, and Y Mao, “Declarative automated cloud resource orchestration,” in Proc 2nd ACM SOCC, 2011, vol 2, pp 26 1–26 8 J Zhang, L Renganarayana, X Zhang, N Ge, V Bala, T Xu, and Y Zhou, “Encore Exploiting system environment and correlation information for misconfiguration detection,” in Proc Int Conf ASPLOS, 2014, vol 19, pp 687–700 R Filepp, L Shwartz, C Ward, R D Kearney, K Cheng, C C Young, and Y Ghosheh, “Image selection as a service for cloud computing environments,” in Proc IEEE Int Conf SOCA, Dec 2010, vol 2, pp 1–8 M Hajjat, X Sun, Y -W E Sung, D Maltz, S Rao, K Sripanidkulchai, and M Tawarmalani, “Cloudward bound Planning for beneficial migration of enterprise applications to the cloud,” in Proc ACM SIGCOMM, 2010, vol 40, pp 243–254 IBM Redbook, IT Infrastructure Discovery, Analytics for Logical Dependency Mapping (ALDM) [Online] Available http //www ibm com/services/ALDM M Nidd, K Bai, J Hwang, M Vukovic, and M Tacci, “Automated business process discovery,” in Proc IFIP/IEEE Int Symp IM Netw , 2015, vol 14, pp 794–797 J T Piao and J Yan, “A network-aware virtual machine placement and migration approach in cloud computing,” in Proc Int Conf GCC, 2010, vol 9, pp 87–92 VMware, Data Center Virtualization and Cloud Infrastructure [Online] Available http //www vmware com/products/ RackWare, Rackware Management Module (RMM) [Online] Avaliable http //www rackwareinc com/products/ C Ward, N Aravamudan, K Bhattacharya, K Cheng, R Filepp, R Kearney, B Peterson, L Shwartz, and C C Young, “Workload migration into clouds challenges, experiences, opportunities,” in Proc IEEE Int Conf CLOUD Comput , 2010, vol 3, pp 164–171 M Marzolla, O Babaoglu, and F Panzieri, “Server consolidation in clouds through gossiping,” in Proc IEEE Int Symp WoWMoM, 2011, vol 9, pp 1–6 J Hwang, Y -W Huang, M Vukovic, and J Jermyn, “Cloud transformation analytics services (case study of cloud fitness validation for server migration),” in Proc IEEE Int Conf SCC, 2015, vol 12, pp 387–394
J. HWANG ET AL.
1 : 11
29 M A Salido, A Giret, and F Barber, “Distributing constraints by sampling in non-binary csps,” in Proc IJCAI Workshop Distrib Constraint Reasoning, 2003, pp 79–87 30 S Dharanipragada, M Franz, J S McCarley, K Papineni, S Roukos, T Ward, and W J Zhu, “Statistical methods for topic segmentation,” in Proc INTERSPEECH, 2000, vol 6, pp 516–519 31 SoftLayer, SoftLayer API Reference [Online] Available http //sldn softlayer com/reference/softlayerapi 32 Brocade, Vyatta vRouter [Online] Available http //www brocade com/en/products-services/software-networking/ sdn-controllers-applications/sdn-controller html 33 B Zhang, J Hwang, L Ma, and T Wood, “Towards Security-Aware Virtual Server Migration Optimization to the Cloud,” in Proc ICAC, 2015, pp 71–80 34 J Hwang, “Computing Resource Transformation, Consolidation and Decomposition in Hybrid Clouds,” in Proc CNSM, 2015, pp 144–152
Received May 6, 2015; accepted for publication June 1, 2015 Jinho Hwang IBM Research Division, Thomas J Watson Research Center, Yorktown Heights, NY 10598 USA (jinho@ us ibm com) Dr Hwang is a Research Staff Member at the IBM T J Watson Research Center, working on cloud transformation centered on software-defined clouds to enable customers to adapt quickly to heterogeneous cloud environments Dr Hwang received his Ph D degree from The George Washington University for his work on innovative cloud virtualization technologies with respect to computing resources He was with The George Washington University from 2005 to 2006 as a visiting scholar, and with POSCO ICT R&D center in South Korea from 2007 to 2010 He interned at the IBM T J Watson Research Center and AT&T Labs-Research He has published more than 30 papers at top international conferences and journals (two best paper awards) and filed more than 10 patents applications Dr Hwang is a member of Institute of Electrical and Electronics Engineers (IEEE) and the Association for Computing Machinery (ACM)
Maja Vukovic IBM Research Division, Thomas J Watson Research Center, Yorktown Heights, NY 10598 USA (maja@ us ibm com) Dr Vukovic is a Research Staff Member at the IBM T J Watson Research Center in the Cloud Platform and Services department Dr Vukovic is an IBM Master Inventor and a member of the IBM Academy of Technology She received her Ph D degree from the University of Cambridge, United Kingdom, for her work on context-aware service composition Dr Vukovic has more than 60 publications at top international conferences and journals, and more than 50 patent applications She is a senior member of IEEE Nikos Anerousis IBM Research Division, Thomas J Watson Research Center, Yorktown Heights, NY 10598 USA (nikos@ us ibm com) Dr Anerousis is a senior manager and Principal Research Staff Member in the Cloud Platform and Services department at the IBM T J Watson Research Center He received Ph D , M Phil , and M S degrees in electrical and computer engineering from Columbia University, and a B S degree in electrical engineering and computer science from the National Technical University of Athens In addition to his current focus on cloud automation technologies, he has an extensive background in software engineering, networks, middleware, and services He has worked and consulted extensively on virtually all aspects and stages of the innovation lifecycle, from design, financing, prototyping, and development to strategy and steady-state operations He is an internationally recognized expert in the field of distributed systems and network management, having authored numerous peer-reviewed publications on the subject, and he has delivered tutorials and keynote presentations
Kun Bai IBM Research Division, Thomas J Watson Research Center, Yorktown Heights, NY 10598 USA (kunbai@us ibm com) Dr Bai is a Research Staff Member at the IBM T J Watson Research Center He received a B S degree in computer science from Huazhong University of Science and Technology, China, in 1997, an M S degree in computer science from University of Alberta, Canada, in 2004, and a Ph D degree in information science and technologies from Pennsylvania State University in 2009 He subsequently joined IBM, where he has worked on mobile computing and data center relocation and migration In 2013, he received an IBM Outstanding Innovation Award for his work on analytics for logical dependency and mapping He is author or coauthor of 4 patents and 26 technical papers Dr Bai is a member of IEEE and ACM Michael Tacci IBM Global Technology Services, Philadelphia, PA 19130 USA (mtacci@us ibm com) Mr Tacci is a Senior Architect in the IT Optimization and Migration Services department He received an A A S degree in electrical engineering and a B S degree in computer science from the State University of New York and the university’s Watson School of engineering, respectively He joined IBM in 2007 in the Global Business Services Division During his tenure at IBM, Mr Tacci has worked as a developer and lead designer of numerous products in the area of IT discovery and analytics, most notably Analytics for Logical Dependency Mapping He has filed six patent applications since his tenure with IBM Global Technology Services Mr Tacci has extensive industry experience in aerospace and informatics prior to joining IBM
1 : 12
J. HWANG ET AL.
IBM J. RES. & DEV.
VOL. 60
NO. 2/3
PAPER 1
MARCH/MAY 2016