Minimizing Variable Costs for Shared Data

AWS Whitepaper: Minimizing Variable Costs for Shared Data November 2015

Amazon Web Services, Inc. 410 Terry Avenue North Seattle, WA 98109-5210 Cage Code: 66EB1 DUNS Number: 965048981 NAICS: 518210

This Amazon Web Services, Inc. (AWS) package is provided for informational purposes only. The services discussed in this package are standard commercial services. This package may include a set of suggested solutions for this opportunity that are based on our limited information, and should not be construed as a binding offer from AWS. For current prices for AWS services, please refer to the AWS website at www.aws.amazon.com. This package includes Amazon Web Services, Inc. commercial, financial, or trade secret data that includes confidential, and/or trade secret information.

AWS Whitepaper: Minimizing Variable Costs for Shared Data

Table of Contents 1.0 2.0 3.0

Introduction ........................................................................................................... 1 Shared Data on AWS ........................................................................................... 1 Sharing Data with Amazon S3 .............................................................................. 2

3.1

How to Host and Share Data.............................................................................. 2

3.1.1 3.2 4.0

Managing Costs of Sharing Data ................................................................. 3

How to Access Shared Data .............................................................................. 4 Case Studies ........................................................................................................ 4

4.1 US Geological Survey (USGS) Landsat ............................................................. 4 4.2 National Oceanic and Atmospheric Administration (NOAA) Next Generation Weather Radar (NEXRAD) .......................................................................................... 5 4.3 Additional Public Data Sets ................................................................................ 5 5.0

Getting Started ..................................................................................................... 6

November 2015

Page i

This package includes Amazon Web Services, Inc. commercial, financial, or trade secret data that includes confidential, and/or trade secret information.


1.0 Introduction Whether an organization is seeking to share data publicly (“open data”) or only with trusted partners, Amazon Web Services, Inc. (AWS) can facilitate a data sharing model that eliminates the need to maintain and ship multiple copies of large data sets while simplifying costs associated with hosting and retrieving the data. This paper focuses specifically on using Amazon Simple Storage Service (Amazon S3) and its Requester Pays feature, which is optimal for lowering costs and level of effort for the data curator.

2.0 Shared Data on AWS Sharing data on AWS enables innovation by making data on our flexible and low-cost computing resources easily accessible to a wide audience. Previously, large data sets, such as the mapping of the human genome, required hours or days to locate, download, customize, and analyze. Geospatial data from federal agencies or individual consortiums required a lengthy ordering process, management of physical containers, and high labor costs to facilitate the entire process. Now, similar data sets can be accessed and analyzed by users around the world via an AWS centralized data repository in Amazon S3 and Amazon Elastic Compute Cloud (Amazon EC2) instances or Amazon Elastic MapReduce (Amazon EMR) (hosted Hadoop) clusters. By hosting this important data where it can be quickly accessed and easily processed with elastic computing resources, AWS hopes to enable several benefits for data curators and the users that access shared data: 





Lower Cost of Data Administration – Shared data on AWS eliminates the need for redundant copies and shipping of physical media. Cloud storage allows data curators to maintain only one authoritative copy of data and make it available to an entire ecosystem of users, whether it’s a group of trusted partners or the entire world. A data curator’s time and resources can then be redirected from shared data administration to other efforts. Lower Cost of Innovation – Data that is shared on AWS can be readily processed and analyzed with a variety of tools that are also available through the AWS cloud. When data is shared via the cloud, anyone can work with data of any size without needing to worry about time to acquire data or provisioning storage and compute to use the data. Both experimentation and production use cases can work in parallel against the same copy of data, all with a reduced price tag. Greater Access to Shared Data – AWS helps make shared data available to over 1,000,000 customers worldwide who have access to AWS’s comprehensive toolkit for gathering, storing, analyzing, and working with data at any scale. This further fuels use and analysis of large data sets that would not normally reach beyond their points of origin.

The cloud-based model for hosting shared data provides numerous benefits for both hosting organizations and the groups and users that want to access the data, including

November 2015

Page 1



cost savings, freed time and resources, greater access to information, and expanded capacity for research and innovation. There are several options for hosting and sharing data sets on AWS. For organizations wanting to share data at scale with trusted partners and at lower variable costs, we recommend hosting the public data on Amazon S3. This option is especially useful for organizations that want to share data but do not have a method to recoup the costs of creating and sharing the data or have access to extensive funding for the physical shipment of data.

3.0 Sharing Data with Amazon S3 Amazon S3 allows customers to store and retrieve any amount of data, at any time, from anywhere on the web. Amazon S3 is an object storage and publishing service designed to be highly scalable and flexible without the need for customers to maintain file servers. Customers only pay for the Amazon S3 resources that they use, with no minimum fees or long-term contracts. They can store any type and amount of data that they want and read the same data a million times or only for specific circumstances. Amazon S3 is one of AWS’s foundational services and has been around almost as long as AWS itself, having premiered in early 2006. Amazon S3 holds trillions of objects and regularly peaks at millions of requests per second. By being designed for 99.999999999% durability and up to 99.99% availability of objects over a given year, Amazon S3 provides customers with a stable and reliable storage system. In addition, Amazon S3 is integrated with a variety of partner tools and storage gateways, with offerings available from well-known vendors and a variety of open-source tools. Using Amazon S3 to host data allows data curators to provide broad access to a variety of users around the world and enables sharing with organizations and consortiums that interact with the data curator and its peers, all from one, centralized copy of the data. In addition, Amazon S3 storage (buckets) can be configured so that data transfer fees are paid by the user requesting the data and not the data curator hosting the data. With this configuration, data curators will still be responsible for the storage fees associated with hosting their data, but the variable costs associated with sharing the data become the responsibility of the requester.

3.1

How to Host and Share Data

To use Amazon S3, data curators need an AWS account. If they don't already have one, they’ll be prompted to create one when they click the Sign Up button on the Amazon S3 page. Once the account has been created and the data curator is signed in, the curator is ready to create a bucket, the primary storage unit for Amazon S3. After naming the bucket and choosing which of the worldwide AWS regions will host it, the data curator is ready to add data to the bucket. Data curators can have numerous buckets per account, and the total volume of data and number of objects that can go into an account’s bucket is virtually unlimited. For more detailed steps in this process, please see the Amazon S3 Getting Started Guide. November 2015

Page 2



By default, all Amazon S3 resources—buckets, objects, and related subresources (e.g., configurations)—are private: only the data curator using the AWS account that created it can access the resource. The data curator can optionally grant access permissions to others by writing an access policy. For buckets that the data curator wants to provide for public data use, they can add policies to make the data available to any user that the data curator wants to share with, ranging from anonymous users from any location to just those users that meet certain identity restrictions (e.g., IP addresses, multi-factor authentication). From there, customers can provide access to the bucket via a URL, a website, or a special invitation. The Amazon S3 Console User Guide provides more information on how to set up Amazon S3 to particular specifications. 3.1.1 Managing Costs of Sharing Data Amazon S3 has costs associated with storing data and transferring data outside of Amazon S3, which are detailed on the service’s pricing page. Data curators using Amazon S3 are responsible for the cost of storing their data, which is calculated based on the number of bytes being stored each month. These costs are pennies per gigabyte across all of the available regions, representing a significant cost savings over traditional, on-premises storage. In a traditional data sharing scenario, public data curators would also be responsible for the transfer costs associated with users accessing the data shared online. However, a data curator can configure an Amazon S3 bucket to be a Requester Pays bucket. With Requester Pays buckets, the requester instead of the data curator pays the cost of the request and the data download from the bucket (unless the requester transfers data within the same AWS region as the data curator’s bucket, in which case the data transfer is free for both parties). This option can be easily enabled through the Amazon S3 console by clicking Enable in the Requester Pays section of a bucket’s properties. The Requester Pays configuration can be applied to as many buckets as needed. With this option, data curators can take the personnel, time, and funds once applied to responding to and billing customer requests and reapply them instead to improving the quality, applicability, and reach of the public data. To provide a better picture of the potential cost savings available by using Amazon S3 for storage and enabling the Requester Pays configuration, we encourage readers to use the AWS Simple Monthly Calculator. The calculator can provide an estimate of how much it would cost to store a certain amount of data in Amazon S3 per month (using the Storage field) as a comparison to other storage option costs. The calculator can also provide the estimated monthly cost that would be associated with user requests (using the PUT/COPY/POST/LIST Requests and GET and Other Requests fields) and data transfer out (using the Data Transfer Out field), both of which would be paid for by the requesters if the data curator configures the Amazon S3 buckets to be Requester Pays buckets.

November 2015

Page 3



3.2

How to Access Shared Data

When data is publicly shared in Amazon S3 buckets, users from around the world can access it using simple HTTP requests without contacting the data curator. This not only frees the time and resources of the data curator, but also gives the public users of the data greater freedom to access, analyze, and innovate with the data at a far reduced cost as compared to ordering physical media. When a data curator’s buckets have been made publicly available without Requester Pays enabled, any user is able to access the bucket to read and download data via HTTP with no need to authenticate their identity. In this situation, the users requesting the data would not need to create an AWS account to download the data, and the data curator would pay for the transfer and storage of the data. The identity of individual requesters would not be able to be determined, but requests and data transfers would still be logged. However, when the Requester Pays option is enabled on a bucket, requesters will need to create their own AWS account to access the data. All requests must be authenticated in order to properly identify the requester and apply the charges associated with requesting the data against the requester’s AWS account. To ensure that charges are applied correctly in the Requester Pays model, the requester must include x-amzrequest-payer in their requests (either in the HTTP header or as a parameter in a REST request) to show that they understand that they will be charged for the request and the data download. Hosting and sharing data sets on Amazon S3 can be an ideal approach for both data curators and the users that need access to the data. And with the Requester Pays bucket policy, data curators can minimize the variable costs associated with public data while also streamlining the process of getting the data to users around the world.

4.0 Case Studies AWS hosts several sets of geospatial data for public access, helping data curators share valuable information around the world and with a variety of organizations and users. We enable data curators to transition from heavily siloed systems (where redundant copies of data sets are stored in multiple locations) to a transparent environment (where one copy of a data set can be centrally maintained, updated, and shared).

4.1

US Geological Survey (USGS) Landsat

The Landsat program is a joint effort of the US Geological Survey (USGS) and the National Aeronautics and Space Administration (NASA). First launched in 1972, the Landsat series of satellites has produced the longest continuous record of Earth’s land surface as seen from space. NASA is in charge of developing remote-sensing instruments and spacecraft, launching the satellites, and validating their performance. USGS develops the associated ground systems, then takes ownership and operates the November 2015

Page 4



satellites, as well as managing data reception, archiving, and distribution. Since late 2008, Landsat data has been made available to all users free of charge. Carefully calibrated Landsat imagery provides the US and the world with a long-term, consistent inventory of vitally important global resources. AWS has made Landsat 8 data freely available on Amazon S3 so that anyone can use our on-demand computing resources to perform analysis and create new products without needing to worry about the cost of storing Landsat data or the time required to download it. All Landsat 8 scenes from 2015 are available along with a selection of cloud-free scenes from 2013 and 2014. All new Landsat 8 scenes are made available each day, often within hours of production. Hosting Landsat data on AWS has enabled product development and research by a variety of AWS customers, including startups, mapping software companies, commercial satellite companies, universities, systems integrators, marketing research firms, and agriculture technology firms. Because the data is maintained in a single location, these customers are able to focus on their own product development without needing to dedicate resources to acquiring or storing their own redundant copies of the data.

4.2

National Oceanic and Atmospheric Administration (NOAA) Next Generation Weather Radar (NEXRAD)

AWS has entered into a research agreement with the National Oceanic and Atmospheric Administration (NOAA) to explore sustainable models to increase the output of open NOAA data. Publicly available NOAA data drives multi-billion dollar industries and critical research efforts. Under this new agreement, AWS and its collaborators will look at ways to push more NOAA data to the cloud and build an ecosystem of innovation around it. The Next Generation Weather Radar (NEXRAD) is a network of 160 high-resolution Doppler radar sites throughout the US and select overseas locations whose data is managed by NOAA. NEXRAD detects precipitation and atmospheric movement and disseminates data in five-minute intervals from each site. As part of the NOAA Big Data Project, the real-time feed and full historical archive of original resolution (Level II) NEXRAD data, from June 1991 to present, is now freely available on Amazon S3 for anyone to use. This is the first time the full NEXRAD Level II archive has been accessible to the public. Now anyone can use the data on demand in the cloud without worrying about storage costs and download time.

4.3

Additional Public Data Sets

AWS’s model of sharing geospatial data can be applied to many other types of public data. We currently host many other sets of public data, including climate, microbiome, genomic, email, and website statistical data.

November 2015

Page 5



5.0 Getting Started AWS provides a leading cloud platform that is unique in maturity and scale. Getting started on AWS is fast and easy, and we welcome the opportunity to address any questions or concerns about transitioning shared data projects to the cloud. Customers can email [email protected] to start a conversation about hosting their shared data sets on AWS. AWS also provides hands-on learning via qwikLABS to help individuals gain more technical knowledge of the services and products that we offer. The “Using Open Data with Amazon S3” qwikLAB demonstrates how to upload data to Amazon S3 and make it available for anyone to access via a web browser. Users will learn how to create an Amazon S3 bucket, configure it to host a website, upload objects to it, and use JavaScript to display those objects on a web page. Along the way, users will learn some best practices for creating public data. At the end of this lab, users will have deployed a simple website that makes data easy to access and provides basic documentation of the data.

November 2015

Page 6


Minimizing Variable Costs for Shared Data

Minimizing Variable Costs for Shared Data

Suggest Documents

Minimizing Variable Costs for Shared Data - Amazon AWS

Symmetry-Aware Predicate Abstraction for Shared-Variable ...

Minimizing startup and Transfer Costs During

Minimizing Manufacturing and Quality Costs in ... - CiteSeerX

Maximizing recoveries. Minimizing costs. Maintaining relationships.

Minimizing Manufacturing and Quality Costs in ... - CiteSeerX

Minimizing the Delta Test for Variable Selection in ... - Semantic Scholar

Minimizing fleet operating costs for a container transportation company

Minimizing Inventory Costs for Capacity-constrained ... - Science Direct

A Hybrid Shared-nothing/Shared-data Storage Scheme for Large ...

Improving Request-Combining for Widely Shared Data in Shared ...

Minimizing Data Consumption with Sequential

Driving Down Costs Through Shared Services - Wipro

Fixed costs, variable costs at firm level: market dynamics

Shared Gaussian Process Latent Variable Model for Multi-view ... - iBUG

SVA, a tool for analysing shared-variable programs - Department of ...

Variable Reordering for Shared Binary Decision Diagrams ... - CiteSeerX

Towards Optimizing Energy Costs of Algorithms for Shared Memory ...

Dynamic Resource Allocation for Shared Data

CACHE MANAGEMENT FOR SHARED SEQUENTIAL DATA ACCESS

Incorporating Variable Costs of Adoption Into

Optimizing fixed and variable compensation costs for ... - CiteSeerX

Duality Theory for Variable Costs in Joint Production - School of ...

A Relational Model of Data for Large Shared Data Banks