Research Computing in the Cloud: Functional ... - EDUCAUSE Library

1 downloads 57 Views 430KB Size Report
Jul 15, 2015 - requires careful analysis to identify the appropriate strategy. ... and size of memory) of hardware and the nature of provisioned software—simple “bare” ..... Similarly, SAS also offers a variety of options for licensing in the cloud.
Research Computing in the Cloud Functional Considerations for Research EC AR W o r kin g G ro up P ap e r | Ju l y 15 , 2 0 15

ECAR working groups are where EDUCAUSE members come together to create solutions to today’s problems and provide insight into higher education IT’s tomorrow. Individuals at EDUCAUSE member institutions are invited to collaborate on projects that address core technology challenges and advance emerging technologies of importance to colleges and universities. More information can be found at the ECAR working groups website.

Introduction Transitioning from an exclusively local computation and storage infrastructure to the use of cloud services—particularly in the public cloud—is becoming commonplace in higher education. Each use case requires careful analysis to identify the appropriate strategy. Thus far, however, this trend has not been fully tested for research computing. Research computing involves unique needs and issues, including long-term use and manipulation of data, sensitive data, restricted-use data, and network performance. What research computing might realistically best be done in the public cloud? What limitations persist? What special cost issues exist? This paper explores those questions to start a conversation among CIOs, other IT leaders, and researchers about what research computing services are currently available in the cloud, which services will be feasible in the cloud, and what services are not ready. It touches on the roles we play with data providers and cloud vendors and the realistic capabilities, benefits, and drawbacks of the cloud for research computing. This paper focuses on what is technically possible in the cloud rather 1 than the business model and cost structures of using the cloud and will simply note other issues, such as 2

constraints of the handling of research data, in passing. For example, multicenter collaborative research projects could leverage public cloud resources for many of the same reasons business engages—scalability, rapid deployment, ease of access, etc. A cloud-first strategy, however, is not really a strategy. It is a just a statement. The strategy must be driven by the business need. We do not perform a detailed review of the financial considerations for using public cloud services in this paper. Although public cloud services may not be a money saver, they may allow resources to be focused on higher-value services. This paper also documents some of the issues that must be considered to create an effective research computing cloud strategy, highlight those services that could work well in the cloud for research computing, and look at what doesn’t work well for research computing in the cloud, given the frameworks of current public cloud offerings.

© 2015 EDUCAUSE. CC by-nc-nd.

Overarching Issues The issues outlined below transcend any of the specific areas that come up later in our discussion of what works well and what doesn’t for research computing in the cloud. By touching on the issues, this section serves as an expansion point for later consideration as we continue to discuss this topic in higher education.

Commodity Infrastructure vs. Application Services The first overarching issue is to distinguish two common aspects of cloud-based services: making infrastructure a commodity versus using the cloud to offer highly specialized application services. Commodity infrastructure cloud services focus on computation and data storage, plus sometimes computation with specific (usually large) data resources close to the

Cloud Service Models

computing facilities in terms of high-bandwidth, low-latency networking. The most commonly recognized cloud service

Software as a service (SaaS) allows customers to use applications managed by the supplier on a cloud infrastructure.

model is the provision by the cloud vendor of computational and/or storage infrastructure. Moving this infrastructure to the cloud transforms it from a local capital investment—which

Platform as a service (PaaS) allows customers to deploy their own applications on a cloud infrastructure.

includes not only machine room space, power, cooling, and hardware but also the attendant operational cost of running

Infrastructure as a service (IaaS) allows customers to provision and manage basic computing resources on a shared platform.

all this—to a model that rents infrastructure on an as-needed basis. Computational and storage infrastructure thus

Taken from the ECAR working group paper Preparing the IT Organization for the Cloud: An Introduction.

becomes a commodity, with the attendant advantages of large-scale acquisition and ongoing modernization by the cloud vendor and the elimination of the need for a periodic large capital outlay and the corresponding highly technical/financial/political decisions about what specific hardware to buy. For computational infrastructure, there are limited choices of the type (e.g., number of cores and size of memory) of hardware and the nature of provisioned software—simple “bare” virtual machines or virtual machines configured with an operating system and perhaps a standard set of 3

research tools (e.g., R or a database management system). Although the choices may be limited, they can be made separately for each of several sets of research users to be supported. Alternatively, there are specific cloud-hosted application services (SaaS). These include, for instance, turnkey services for specialized symbolic manipulation (e.g., network Mathematica/Alpha from Wolfram, machine learning services being hosted on Azure and other platforms, learning analytics, etc.). Currently these services are seeing very limited use in research computing, though they may see greater adoption in future, in part due to an emerging trend for vendors to only release certain services in the cloud—for instance, the Microsoft machine learning service—so that there may not be a noncloud alternative. This trend may be discipline specific (e.g., in bioinformatics, there may be some tools that are only web based), but a greater move to SaaS will essentially democratize access to certain computational approaches and research methodologies.

Private Cloud vs. Public Cloud vs. Hybrid Cloud A second overarching issue hinges on the distinction between cloud as a technology versus cloud as a business arrangement. The key technology properties of the cloud include the well-scaled provision of virtual machines, storage volumes, etc., as services rather than as specific hardware assets that the user

2

acquires and maintains. These technology properties are often combined with the business model of a (usually large) commercial cloud vendor that provides services to a wide variety of customers. In this case, the resulting infrastructure is called a public cloud. By contrast, with a private cloud, an organization might use the same virtualization technologies as a commercial cloud provider but would offer services to its internal users rather than to commercial customers. As a specific example, a university might invest in a set of servers and use them to present computational services (as with the Amazon EC2 set of computing platforms) and storage services (as with the Amazon S3 set of storage platforms). Here, the user would see the same (or similar) technical interfaces as with a commercial cloud provider, but these services would be provided by the organization’s own IT group. Further, to the extent that the technical interfaces are convincingly very similar to those of the commercial cloud vendor, the university user would not care whether they are provided by the university or by the commercial vendor. This would permit a hybrid cloud possibility, in which the university itself would provide a baseline set of computing and storage resources, with periods of very high demand spilling over to use commercial computing and storage resources. While the university may lack the advantages of extreme scale enjoyed by successful commercial providers, it has corresponding advantages of proximity to the user (reflected in lower networking costs and opportunities for more immediate technical support). Though described here in terms of a single university providing these services, a university system or a consortium of universities could alternatively provide these services to their members. Additionally, more complex SaaS styles would also be possible. Finally, this paper encourages readers to be flexible in their consideration of the technical and business aspects of cloud-based services.

Special Cost Issues Although this paper doesn’t address the financial cost trade-offs of moving to the cloud, there are some cost considerations that are unique to research computing that TCO calculators don’t account for. In particular, we want to note instances where the technological barriers to research computing in the cloud essentially render it cost-ineffective. For many scientific research projects, individual and aggregate data sets are very large. This has obvious significance with respect to the storage costs, usually in dollars per byte per month. These costs might be particularly problematic in making the data available for a long period of time after project completion. Less obvious and sometimes of overwhelming significance are the costs of data movement. One common pattern is for the cloud vendor to charge per byte for data to be moved between the cloud storage facility and the campus, a nonlocal computational resource, or any other site on the Internet. These charges create a dynamic that contrasts sharply with the style of some scientific projects, such as those at the 4

Large Hadron Collider (in which replicas of multiterabyte data sets are frequently and nimbly copied or moved from one university or national lab site to another), when the data is the subject of a broad collaboration, or when the project chooses to change its cloud provider. In some cases, large commercial cloud providers may be willing to waive data-transfer charges in alternate arrangements in which data 5

6

centers of the commercial cloud are attached to high-speed networks such as Internet2 or ESnet.

3

Consequently, it is important that these issues be taken into consideration early in the planning process, especially when writing the data-management section of research proposals. The cost implications of data access, sharing, and archiving policies must be carefully reviewed while keeping in mind that data movement may only be implicit and therefore easy to overlook. Commitments to making data generally available, or even sharing it among project collaborators, should be examined to ensure that they will not cause an unexpected and substantial increase in costs. We make no attempt to analyze these costs in comparison with the corresponding costs of conventional noncloud approaches but stress that these widely appreciated trade-offs will look different in the context of big data science.

What Works Well for Research Computing in the Cloud Several use cases have emerged in which the scalability and elasticity of commercial cloud resources, combined with relatively inexpensive overall costs, make a cloud solution both reasonable and attractive for research computing. Researchers want to experiment with multiple ways of solving a problem, and using virtual instances of operating systems, software installations, configurations, and customizations allow them to find the one that works best for them. This flexibility, along with the agility that the cloud promises (e.g., in cases where the templates can be used to configure instances of computational devices that are not limited by the underlying physical infrastructure in a very short time), makes cloud an attractive option for researchers. In the administrative computing space, we note that some computing services (e.g., Workday ERP) are only available in the cloud. This trend is also appearing in the research computing domain. In this section we will discuss examples in the cloud that support research well.

High Throughput Traditional high-performance computing workloads are characterized by tightly coupled scientific applications for which substantial processing capability, high-speed and low-latency interconnects, and performance parallel input/output are required. That is, they require high performance. A different class of problems, though, may be described as high-throughput computing (HTC). In this parallel, perhaps complementary, paradigm, we are more concerned with the overall number of Floatingpoint Operations Per Second (FLOPS) that can be computed per longer timescale—think weeks or months, not seconds. HTC comprises many independent serial jobs that may be similar but operate on different data and are not tightly coupled—for example, parameter sweeps rather than message passing interface (MPI) jobs, or hundreds to tens of thousands of single-core jobs rather than a few many-core jobs. One HTC application being used in many disciplines is the Monte Carlo simulation. In a Monte Carlo simulation the scientist runs a large number (perhaps thousands) of jobs that are identical except for a different random number. Thus, each job computes how a physical, biological, or other situation plays out for one specific point in a probability distribution. Often, the size of input or output data is modest. The input specific to a given job (and not shared by the other jobs of the broader batch) is often just the one random number, and the output is also quite compact. This Monte Carlo approach is heavily used in areas as diverse as high-energy physics (including within the Compact Muon Solenoid collaboration at

4

7

the Large Hadron Collider), the social sciences, and engineering. Further, these approaches are often the entry point for a scientist into scientific computation. Often, a scientist who has never done scientific computation before can, with the help of a computationally oriented consultant from the local statistics department, adapt this method with good effect. For scientific research, where this computational model dominates, cloud solutions are likely to be particularly appropriate. Consider a typical condo-style campus cluster: perhaps 5,000–50,000 compute cores, shared by dozens to hundreds of researchers. In most instances, resource manager and job scheduler parameters set reasonable limits on the number of cores (or other resources) that can be used by any individual researcher or user at a given time. And the upper bound of cores or job slots available is fixed by the on-premises resources at hand—if your institution has a 10,000-core cluster, you cannot run 100,000 single-core jobs simultaneously. In a cloud scenario, though—at least when contemplating current commercial cloud offerings—the number of cores, and thus the scope of possible analyses, is limited only by the upper limit on your credit card or purchase order. If a researcher wants to run 200,000 simultaneous single-core jobs in the next hour, most campus-based research computing centers would be hard pressed to deliver those capabilities. But the scalability and elasticity of commercial cloud resources, along with relatively low overall costs for basic resource requirements, makes a cloud solution both reasonable and attractive. That said, the basic win in the cloud is that you can get more resource than what you can get locally, and that you trade capital expenditures (CAPEX) for operating expenditures (OPEX), which is a win if your 8

demand is bursty (which is often the case with research computing). If the aggregate demand of a campus community is relatively consistent over very long periods of time, it may be cheaper to invest in the CAPEX and build. One economic argument is that when demand is really spiky, the use of computational clouds allows you to acquire more capacity as you need it, without upfront expense. Although this might be the case for individual researchers or projects, it might not be when aggregated over the campus community. One final technical comment: Since HTC computations are made up of many independent serial jobs, the isolation among what might be several jobs running on a single multicore physical server provided by virtual machine (VM) technology is an advantage. Increasingly, however, this isolation can also be provided on conventional campus clusters using the emerging container technology available in recent Linux releases. Moreover, the overhead of containers may be less than that of VMs.

CPU-Bound and Memory Access–Bound Computations In many cases, including the HTC computations described above but also MPI computations that are relatively loosely coupled, the key resource needed is many hours of crunching that requires CPU cycles (speed of processor is limiting) and memory accesses (amount of memory or memory bandwidth is limiting) but only relatively light access to interprocess communication and shared data files. These cases may be seen as a gray area between the HTC examples that form a sweet spot for cloud computing, where cloud-based infrastructure may present difficult challenges. The ability to scale up rapidly in the cloud may be advantageous.

5

For those instances, however, where the computing is memory intensive or where message-passing communication is required, the cloud currently falls short. See more on these limitations in the section on challenges to research computing in the cloud.

Services Only Available in the Cloud As was mentioned earlier, some vendors are beginning to provision services only in the cloud, so that they are not (and likely will never be) available as on-premises solutions; the applications and capabilities are only in the cloud. Because these services only exist in the cloud, by default they work well in this environment. The impact of this approach, however, is a need to be vigilant about your requirements and make sure that they can be accommodated in the user agreements—for instance, when working with personally identifiable information or sensitive data—and to understand when rights may be signed away (and if that is agreeable). Similarly, rights touching on intellectual property, privacy, disclosure issues, or terms of use (e.g., who will keep a copy of the data) may prove to be very important (even if they are nontechnical) and pertinent to understanding whether research computing can be done in the cloud. This comes into relief when considering the role that grant-funding agencies (and their requirements) may play.

Challenges to Research Computing in the Cloud While the cloud offers strengths and fits several kinds of research computing needs well, in several areas, weaknesses in the way cloud resources are engineered and provisioned present specific and persistent challenges to their application.

Message Passing Interface (MPI) For models and simulations that rely heavily on tightly coupled MPI message passing, interconnect latency and bandwidth are fundamentally important to performance. Climate models, for example, may need to scale to several tens of thousands of cores while also requiring MPI latencies of less than 5 microseconds to achieve such scaling. The standard Ethernet provided in most commercial cloud offerings of one gigabit per second (1 Gbps)—or in some cases 10 Gbps—may provide sufficient bandwidth for such jobs. The latency of those interconnects, however, may be too high to permit these computations to scale to the degree needed. Also, cloud architectures rely on server virtualization. Current virtualization technologies add significantly to interconnect latencies, thus rendering the cloud ineffective for tightly coupled jobs even if a low-latency interconnect such as InfiniBand is used. The increased latency from both the underlying network technology and the virtualization layer decreases the application performance for tightly coupled codes so drastically that most current cloud options are not a realistic platform.

Computing Requiring Very Large Memory Commercial cloud computational resources, driven by their need to serve a broad IT market, will make provisions for a limited range of memory sizes. Until recently, for example, computing instances typically included only a few gigabytes of RAM. Some scientific applications, however, require very large memory footprints; plant genomics applications provide one of several application areas where this is the case. To serve these needs, for example, some compute resources at NSF centers support memory footprints at

6

and above 6 TB. At present, commercial cloud providers do not include such very large memories in their offerings. Note, though, that this reflects business decisions by commercial cloud providers rather than any deep technical issue and thus may change over time. For example, Amazon Web Services (AWS) 9

currently offers memory-optimized instances up to 244 GB RAM and 32 virtual CPUs, and Microsoft 10 Azure offers up to 16 cores and 112 GB of memory. Judging the severity of this challenge will therefore require tracking over time.

Large Data The need for more storage is a constant issue for research computing. You might think that cloud options would provide an easy solution—on-demand scale with no capital startup costs or lengthy lead times required. In reality, it is not so simple. One issue is cost. Although the upfront cost per unit is enticing, you must also factor in other costs, which often amount to a significant charge over the span of a research project. The ever-growing need for storage of large data sets is a constant issue for research computing. Not only can large amounts of data accumulate over the span of a research project, but data analysis also often continues beyond the funding cycle of a grant. Regulation and policy might also mandate that the data be made available to others for a certain period of time after the funding ends. Normally an institution would rely on tiered storage or archiving and backup services to lower the cost of keeping the research data available either near-line or offline, but rarely will an investigator ask for data to be deleted. Engagements with cloud providers should take into consideration the availability and the cost of maintaining the data, either online (for the necessary computing and analysis) or near-line (for preservation purposes). While the upfront cost per unit may be appealing, the cost of transport and backup services must also be factored into the expenses. For example, Amazon Glacier has a very complex charging model for transferring data “in,” “out to Amazon cloud,” “between AWS regions,” and 11 “out to the Internet,” to name just a few examples.

Big Data As data storage and transport are being considered with cloud options, you may want to think about distinguishing between different characteristics of big data and what makes it big. First, there is the “big” that comes from a very large volume of files, where the files are typically small in size. Examples include log files on servers, click-throughs on software systems and user interactions, or online learning systems like MOOCs. Another type of big data are files that are very large in size—such as image visualizations and video files—and sometimes highly complex. Large files are being generated from a range of laboratory equipment, often with large volume and high velocity, as well.

Edge Cases Beyond the classic high-performance computing use cases, other considerations must be addressed when developing a campus research computing cloud strategy. The answer to the question about the utility of the public cloud to support these edge cases is a typical research response: It depends.

7

Long-Term Use and Manipulation of Data A case can be made that data are the most critical asset produced today by academic research. Secondary use of data, combined with other research digital assets, is enabling new discoveries. Issues for long-term use and manipulation of these data fall into at least two general categories: policy and operations. Policy issues for sharing, standards, and recognition of data set publication as scholarly work are important but are not addressed here.

12

The cloud may be of great value for addressing some of the

operational issues. Research data are often intended to be retained for long-term use and manipulation. Such data can indeed be kept in a cloud environment for those purposes, and, with sufficient funding, this arrangement 13 might work well. The difficulties come into play when looking for a sustainable business model. The choices here are clear: You either need a place to put the data for the long term, or you have to pay for it in the cloud. You can’t just put in on a dusty server in the corner anymore. One example of how this might play out is when you want to open data for many people to use and you think that using the data is going to be computationally intensive. In this case, it could be hosted in an appropriate commercial cloud as public read data, and people who wanted to use it could pay to provision their own computational resources.

Sensitive and Restricted-Use Data Many considerations apply to sensitive and restricted data. Data are a key asset of our research activity. Applying adequate security measures to protect against data loss and ensure data integrity is becoming increasingly important. Any security plan for managing the risk to sensitive and restricted-use data includes administrative, physical, and technical safeguard standards. A common concern that has hindered the adoption of public cloud services for research containing sensitive or restricted-use data is the lack of transparency by cloud vendors about how they are implementing the safeguard standards. The higher education community also needs to recognize that any stance on data security will be highly sensitive to local interpretations—such as legal requirements, policy concerns, or approaches to risk management—that must be addressed. Consider, for example, a case that has a requirement that the data must be stored in the United States. Many providers can assure that your data will be located within the United States, but you have to know to ask. In addition, you need to document how this is accomplished—what controls are in place? Research today is a global, collaborative activity, and this can result in more complexity. Would the location of the data store in and of itself trigger other countries’ privacy laws? Many countries are particularly sensitive to having their data stored outside their country, especially after the Edward Snowden revelations.

Network Speed and Latency Network performance in using cloud services takes on three components: 

First, you need the kind of network performance associated with the successful use of Internet services generally—good (but not necessarily excellent) throughput and latency.



Second, in the important special case when large data objects (e.g., scientific data sets of the big data variety—say, over 10 GB) are in play, very good end-to-end network throughput is needed.

8

Accomplishing this using TCP/IP requires not only plentiful network capacity (perhaps at least 10 Gbps) but also very small end-to-end packet loss and end-to-end round-trip latency that is stable and not more than, say, 50 msec. When this case is particularly relevant, placing the on-campus servers 14 that make use of the cloud service on the campus ScienceDMZ will be needed. 

Third, in the important special cases when there are very frequent interactions between on-campus servers and the cloud service (e.g., when the application is “chatty,” as with applications that exhibit a very large number of small database interactions), low round-trip latency, say below 10 msec, may be 15 needed.

Given that the speed of light through a fiber-based wide-area network is about 200 km/msec, the latency targets given above translate to a network requirements that the cloud service be hosted within a few hundred or a thousand miles (depending on the specific situation) of the campus. Many research universities leverage the dedicated services of Internet2 to transport large data sets. Peering arrangements between networks enable access to commercial clouds. Where these networks exchange traffic could have a significant impact on performance. The number of hops and saturation levels of ports can result in dropped packets and generally poor performance. If certain performance is required, a campus may need to select a cloud provider that has an access point within a certain physical proximity. Over a typical campus WAN connection, the transfer times in the sidebar might be unacceptable without the use of some kind of transfer acceleration products such as GridFTP, Tsunami, or Aspera. These are provided as open-source or commercially available products, but support to the chosen platform has to be provided on both ends of the transmit/receive connection (cloud provider and local). Special attention must be given to environments that share the WAN connectivity because the transfer acceleration products can easily saturate typical campus resources over an extended period of time

How long does it take to move one petabyte of data? A petabyte is a thousand terabytes, the equivalent of a million gigabytes, a billion megabytes, or eight billion megabits. Internet Connection

Transfer Time

10 Mbps

25.4 years

100 Mbps

2.53 years

1 Gbps

92.6 days

10 Gbps

9.25 days

Calculated using speeds cited at https://fasterdata.es.net/home/requirements-andexpectations.

Packet Loss Packet loss—when packets of data don’t arrive at their destination—plays two distinct roles in accessing remote cloud services. First, and perhaps not obviously, high-throughput TCP flows of big-data objects require very low end-to-end packet loss. As noted above, this may call for the on-campus computers that are used to access the remote cloud service to be placed on the campus ScienceDMZ. Second, if there are frequent back-and-forth interactions between the on-campus computer and the remote cloud service, then it is very important that packet loss be infrequent to prevent the need for time-outs to detect the need for retransmissions.

9

Tuning and Optimization in a Virtual Environment Tuning and optimization represent new challenges compared to traditional “bare metal” high-performance computing (HPC) environments but offer unique advantages as well. A key design point is to realize that clouds offer immense scalability (with the commensurate need for failure detection and resiliency), at the cost of ultimate individual node performance. CPU frequency is typically sacrificed for core density, and the CPUs themselves may be from past generations. Exacerbating the issues is the tendency of VM environments to export an older, more limited virtual CPU to the hosted OS, which can severely dampen performance. For example, with a mixture of Westmere and Sandy Bridge CPUs, the VM would typically export the lowest common denominator of features, which would lead to only 25% of the expected performance for encryption on the newer cores. In general, we recommend following in the footsteps of Google, Facebook, Amazon, and other largescale compute consumers. Take advantage of the intrinsic scalability, where users do not have to limit their requests artificially to smaller jobs to fit into limited local clusters (sometimes only for reasons of getting scheduled within a reasonable period of time). Architect your systems to handle failure gracefully, 16 or use systems like Spark, Chapel, or Hadoop to handle it for you. Be latency tolerant but aggressively parallel—take advantage of the thousands cores available on a single task. Make use of cached on-cloud data to improve performance and limit financial exposure. We also recommend spinning off a new cluster per job, both for the sake of security (reducing the ability of jobs to accidently or deliberately interfere with others) and so users do not have to wait in a queue for their tasks to begin, optimizing for their productivity rather than a typical fixed cluster configuration.

Software Licensing While software licensing in the cloud (also known as software as a service, or SaaS) can present some special challenges with respect to contracts, on-premises versus off-site usage, and availability of the same tools wherever a researcher is working, there is also flexibility in pricing for the cloud that allows researchers to select what is right for their usage patterns. Mathworks, for example, offers Matlab Distributed Computing Server pricing on the basis of a number of cores per year for those who have a relatively constant Matlab workload, and it also offers pay-as-you-go pricing for burstier workloads. Similarly, SAS also offers a variety of options for licensing in the cloud. Not all software providers have developed licensing models that consider cloud computing, so researchers need to be aware of what is available—especially for more specialized software or software from smaller companies. Another consideration that is not specific to but may be exacerbated by computing in the cloud is software licensing when collaborating with nonacademic partners. Academic software licenses are often specifically tied to use by members of the academic community, and therefore special consideration needs to be given when working with corporate or even government laboratory collaborators. Because the cloud may facilitate such collaborations, institutions should be particularly aware of the licensing implications.

Switching between Cloud Providers Ideally, customers should not have to lock in their services to one specific cloud provider and should be able to use multiple providers for redundancy (if the data management plan requires it) in a seamless fashion. Universities looking to mitigate risks will want to position themselves to take advantage of various

10

competitive offers. There might also be cases where data moved to a public cloud might need to be moved back into a private cloud due to regulatory issues or cost factors or in the event of a service provider going out of business or failing to provide the service in a manner consistent with the campus requirements. Multiple issues arise when attempting to move data in and out of cloud platforms, including portability and interoperability issues between various public and private cloud solutions, as well as issues relating to the portability and interoperability of data and applications. The lack of mature, open computing standards has been a hurdle thus far,

17

but there are strategies to mitigate these issues when developing an IT plan.

Think about standards before moving data and applications to the cloud so that those resources can be taken back out (making sure that they are in a format that will enable removal). If you can’t answer the question “How do I get it back out?” (i.e., an exit strategy), then you shouldn’t be putting it in. While this is not unique to research computing, the problem is exacerbated by the nature of research computing. Software used in research computing is often task specific and frequently optimized for a given set of hardware specifications. Sometimes there is no defined standard for cutting-edge R&D work. Moving applications is not trivial and requires knowledge of architecture specifications that are often not exposed to the cloud consumer. Without proper planning, moving data out of or between clouds could be a very costly and painful process. The development, adoption, and implementation of standards will mitigate this risk over time.

18

For now, such risk elevates the importance of documenting an exit strategy.

Conclusion While several studies have considered the costs of commercial cloud offerings for use as computational and storage resources supporting academic research, there have not been significant studies analyzing the requirements for different types of research computing approaches and the available capabilities in the cloud. What is needed is a living document that critically compares the capabilities of cloud architectures (whether commercial or private) to the diversity of needs for modeling, simulation, analysis, and data analytics. Such a document would support a computational scientist, for example, in assessing viable options for organizing the resources needed to conduct a program of computational research. How might the resources of the cloud, national agency resources, and campus cyberinfrastructure best combine to support such research? The document might also support campus or national cyberinfrastructure leaders in assessing how to meet the diversity of computational and storage needs of their researcher constituencies with appropriate combinations of investments in directly purchased and operated infrastructure and payments for cloud offerings.

11

Authors Special thanks go to the following ECAR-CCI Working Group members who authored this report: Guy T. Almes (ECAR-CCI Co-Chair) Director, Academy for Advanced Telecommunications and Learning Technologies Texas A&M University Celeste Anderson Director, External Networking Group University of Southern California Curtis W. Hillegas (ECAR-CCI Co-Chair) Associate CIO, Research Computing Princeton University Timothy Lance President NYSERNet Rob Lane Manager, Research Computing Services Columbia University Clifford A. Lynch Executive Director Coalition for Networked Information

Ruth Marinshaw CTO, Research Computing Stanford University Gregory E. Monaco Director for Research & Cyberinfrastructure Initiatives/Great Plains Network Kansas State University Clare van den Blink Vice President Information Technology Services and CIO Pace University Eduardo Zaborowski Director of Research Computing–Albert Einstein College of Medicine Yeshiva University Ralph J. Zottola CTO, Strategy, Research, and Communications University of Massachusetts Central Office

Contributors We would like to additionally thank the following contributors to this report: Daniel Andresen Director, Institute for Computational Research in Engineering and Science Kansas State University

Dave Lifka Director, Center for Advanced Computing Cornell University

Citation for This Work ECAR-CCI Working Group. Research Computing in the Cloud: Functional Considerations for Research. Research bulletin. Louisville, CO: ECAR, July 15, 2015. Available from http://www.educause.edu/ecar.

Notes 1. For more on cost and the cloud, see the ECAR working group paper TCO for Cloud Services: A Framework. 2. For more information about how to work with research data, see Douglas Blair et al., Research Data Storage: A Framework for Success, ECAR working group paper, July 15, 2014, and Blair et al., The Compelling Case for Data Governance, ECAR working group paper, March 17, 2015. 3. See “R (programming language),” Wikipedia. 4. See The Large Hadron Collider, CERN. 5. See Internet2.

12

6. See ESnet. 7. See Compact Muon Solenoid. 8. As was noted in the ECAR working group paper TCO for Cloud Services, “On-premises and cloud-based solutions typically have different expense cycles and significant differences in capital versus operational expenses. On-premises solutions may have more upfront (capital) expenses and lower ongoing expenses while cloud-based solutions may have a more consistent level of annualized expenses.” It further states, “Cloud computing services are not capital assets because they are not owned by the university and thus are generally subject to indirect costs.” In addition, this subject is discussed in the blog “Federal Indirect Costs Affect Total Cost of Ownership,” which discusses how “current rules require applying federal indirect cost rates to directcharged computing costs such as cloud services but not to capital equipment.” Finally, the growth of cloud services and infrastructure in higher education and how IT funding has traditionally been assigned via CAPEX or OPEX has caused some tension; this is being looked at more closely in the current ECAR working group project on “IT Funding Models: Current Limitations and Needs.” 9. Jeff Barr, “Now Available—New Memory-Optimized EC2 Instances (R3),” April 10, 2014. 10. See “Azure Subscription and Service Limits, Quotas, and Constraints.” 11. According to “Amazon Glacier Pricing,” “Glacier is designed with the expectation that retrievals are infrequent and unusual, and data will be stored for extended periods of time. You can retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. If you choose to retrieve more than this amount of data in a month, you are charged a retrieval fee starting at $0.01 per gigabyte.” 12. The reader is encouraged to read recent commentaries by Philip Bourne, Associate Director of Data Science at the National Institutes of Health, at PEBOURNE. Bourne is an open data evangelist leading the charge at the NIH. 13. Note, also, that the economics and functional requirements of using cloud storage for long-term preservation as opposed to (perhaps long-term) active access and use are quite different. In addition, there are reasons to believe that current cloud storage options are not terribly attractive, at least on a cost basis, for archiving purpose if one wants a high level of verification of the stability of data. To learn more, see the detailed blog posts on this topic by David Rosenthal regarding Amazon S3 and Glacier at DSHR’s Blog. 14. See Science DMZ. 15. Sometimes even very intelligent scientists can miss this issue. An amusing example concerns remote access to systems on the astronomical observatories atop Mauna Kea on the island of Hawaii. This remote access, from a base station at the foot of the mountain, worked well, despite the fact that it required round-trip network traffic upon each mouse click or keystroke from the user at the base station. Pleased with the success of this remote access, the astronomers decided to extend it to remote access from a laboratory in California. This worked in a sense, but the interactions were so sluggish as to make it impractical to use. In hindsight, it’s obvious that the very frequent network interactions between the mouse/keyboard at the user site and the servers at the observatory atop Mauna Kea, when combined with the 2,500-mile distance between California and Hawaii, were a fatally flawed combination. The fact that even astronomers did not anticipate a problem that was, at root, caused by the finite speed of light was somewhat embarrassing. 16. For more information, see Hadoop: ECAR-WG Technology Spotlight. 17. IEEE P2301 Draft Guide for Cloud Portability and Interoperability Profiles (CPIP) is under development; see NIST Cloud Computing Standards Roadmap, Special Publication 500-291, Version 2. 18. Ibid.

13