A Guide to Earth Science Data - IEEE Computer Society

9 downloads 12993 Views 564KB Size Report
opportunities to big data analytics research for understanding the Earth's ... and spatiotemporal constraints) that present challenges to traditional data-centric.
Computing and Climate

A Guide to Earth Science Data: Summary and Research Challenges

Anuj Karpatne and Stefan Liess | University of Minnesota

Recent growth in the scale and variety of Earth science data has provided unprecedented opportunities to big data analytics research for understanding the Earth’s physical processes. But Earth science datasets exhibit some unique characteristics (such as adherence to physical properties and spatiotemporal constraints) that present challenges to traditional data-centric approaches.

E

arth science datasets that capture a variety of information about the Earth’s surface are obtained via various acquisition methods at varying domains of coverage (both in space and time) and with varying data characteristics. For example, observational data about the Earth’s surface can be collected by local sensor recordings (in situ data) or via instruments mounted on remote sensing satellites (remote sensing data). These observations are commonly available at nonuniform locations with scarce coverage in space and time often due to the uneven distribution of a limited number of in situ sensors across the world, which makes it necessary to convert them to fixed spatial and temporal grids using various interpolation, aggregation, and sampling techniques. Such techniques can range from simple linear interpolation methods to reanalysis techniques that use physically constrained simulations. Another source of information about the Earth’s climatic processes includes climate simulations, which are generated by global climate models, also known as general circulation models (GCMs). Given the volume and variety of Earth science data, it’s important to understand their different types and properties to make the best use of big data analytics approaches for Earth science research. Furthermore, Earth science datasets exhibit some unique domain-specific characteristics—such as adherence to physical laws and properties and the presence of spatiotemporal constraints—that both differentiate Earth

14

Computing in Science & Engineering

1521-9615/15/$31.00 © 2015 IEEE

Copublished by the IEEE CS and the AIP November/December 2015

Global snapshot for time t1

Global snapshot for time t2 NPP

.

Pressure Precipitation

NPP

.

.

SST

Pressure Precipitation SST

Latitude

Grid cell

Longitude

Time

Zone

Figure 1. Schematic of multiple grid-based remote sensing datasets such as sea-surface temperature (SST), precipitation, pressure, and net primary productivity (NPP), represented on a spatial grid, with each grid cell covering a range of latitude and longitude values, and varying with time.

science data from other datasets in real-world applications and make it challenging to use traditional data-centric methods for Earth science research. In this article, we provide a brief overview of the diverse types of datasets used in Earth science research along with some of the major challenges faced by data science research in analyzing Earth science data. Types of Earth Science Data Earth science data can be broadly categorized into the following three categories: ■■

■■

■■

observational data, which include sensor measurements from various sources, such as ground-based stations and satellite instruments; reanalysis data, which involve physical models in conjunction with observational data for interpolating values over large areas with missing or poor quality observations; and climate simulations, which are generated using physical models of the Earth’s climatic processes.

We provide a brief description of each of these categories of Earth science data and their respective data sources in the following subsections. Observational Data Observational datasets commonly used in Earth science research can be broadly classified into station-based and gridded data. Station-based data. Due to the large-scale nature of Earth’s observed systems, Earth science datasets are

rarely directly measured on a regular coordinate system, except for small-scale experiments that resemble laboratory studies. Sensor observations in Earth science research are generally obtained from irregularly and nonuniformly spaced stations such as weather stations over land, on ships, ocean buoys, and balloon measurements over a vertical section of the atmosphere (www.ncdc.noaa.gov/oa/climate/ghcn-daily; www.sparc-climate.org/data-center). They provide the most direct and therefore least error-prone data sources available because they don’t employ complex postprocessing operations for handling missing values or nonuniformity present in the data. However, in situ sensors generally suffer from physical limitations that introduce errors and uncertainty in these datasets. For example, balloon measurements don’t strictly represent a vertical section of the atmosphere, but weather balloons are free to move in the horizontal direction during their ascent, making the analysis of such datasets difficult. Hence, in addition to in situ measurements, remote sensors such as ground-based radar imagery and satellite instruments are able to measure diverse properties about the Earth over large distances. Gridded data. Most Earth science sensor recordings

are postprocessed using various interpolation, aggregation and sampling techniques to provide easily accessible datasets at a fixed spatial grid, with a particular spatial resolution, and available at regular time intervals. Figure 1 provides a schematic representation of multiple grid-based datasets such as sea-surface temperature (SST), precipitation,

www.computer.org/cise



15

Computing and Climate

Climate data can be simulated using physical atmospheric or oceanic circulation models, by modeling one component and using the other components as boundary conditions, or by using coupled models of multiple components. pressure, and net primary productivity (NPP). As an example, station-based surface temperature and precipitation measurements are interpolated to 2D horizontal grids (spatial resolution ranging from 0.5 degrees to 5 degrees; www.cru.uea.ac.uk), and satellite data usually undergo several steps before being released, such as calibration, orbital correction, quality control, and conversion to regular grids (http://disc.sci.gsfc.nasa.gov/AIRS/documentation/v5_docs/AIRS_V5_Release_User_Docs/V5_ Data_Release_UG.pdf). Satellites have successfully monitored several attributes about the Earth such as surface temperature, humidity, clouds, and chemical composition of the atmosphere. However, certain details in atmospheric thermodynamics and dynamics need to be resolved by in situ measurements, especially when sensor observations are obtained from spatially distant remote sensors such as satellites, making it difficult for simple interpolation algorithms to function. For example, these remote sensors can’t provide continuous measurements simultaneously over all locations since polar-orbiting satellites are in an orbit around the Earth. Reanalysis Data When observational sensor datasets are scantily available or irregularly placed rendering simple interpolation methods functionless, comprehensive physical models must be used in conjunction with observed sensor recordings to calculate possible values over large areas with missing values or data of poor quality. The analysis and interpolation of observed sensor recordings requires physical knowledge of involved natural processes to fill in missing or poor quality values. For atmospheric and oceanic data, this knowledge is provided by general circulation models (http:// gmao.gsfc.nasa.gov/merra), such as the Goddard Earth Observing System Data Assimilation System Version 5 (GEOS-5; http://gmao.gsfc.nasa. gov/systems/geos5). After generating analysis fields at each time by assimilating observations into the physical model, 16



the model verifies the consistency of the data products over time and makes appropriate adjustments to the data by balancing between observational uncertainty and acceptable noise in the system. This step is referred to as reanalysis. Multiple reanalysis products are currently available, each with different input sources and underlying physical model (http://reanalyses.org). Climate Simulations Climate data can be simulated using physical atmospheric or oceanic circulation models, by modeling one component and using the other components as boundary conditions, or by using coupled models of multiple components. The state-of-theart coupled models also include land surface and chemistry models that cover vegetation changes, volcanic eruptions, and other small-scale influences on the global climate system. Boundary conditions include projections of future anthropogenic influences, such as greenhouse gas emissions, ozone loss, and land surface changes. A recent collection of coupled model simulations can be found in the fifth Coupled Model Intercomparison Project (CMIP5) data archive,1 which can be downloaded from the Earth System Grid Federation at http:// pcmdi9.llnl.gov. Table 1 lists commonly used datasets in Earth science research, along with their respective data sources for accessing them. Data-Centric Challenges The analysis of Earth science datasets using big data analytics approaches pose several unique challenges: some are due to the inherent spatial-temporal nature of Earth science data, and others are specific to the target application. The following subsections throw some light on the key characteristics and challenges in analyzing Earth science datasets. Uncertainty and Incompleteness Earth science datasets are plagued with noise/uncertainty and incompleteness due to sensor interference and instrument malfunctions. This issue is particularly acute in the case of remote sensors, November/December 2015

Table 1. Datasets used in Earth science research. Data type

Sources

Use

Spectral reflectance

Centre National d’Etudes Spatiales (http://missions-scientifiques. cnes.fr/VEGETATION), NASA (http://modis-land.gsfc.nasa.gov), National Oceanic and Atmospheric Association (NOAA; http:// noaasis.noaa.gov/NOAASIS/ml/avhrr.html); US Geological Survey (USGS; http://landsat.usgs.gov)

Spectral reflectance is used to compute vegetation indices, surface temperature and several other variables that are fundamental to studies in forestry, agriculture, and urbanization

River discharge

German Federal Institute of Hydrology (www.bafg.de/cln_030/ nn_266918/GRDC/EN), University of Wisconsin-Madison, Center for Sustainability and the Global Environment (SAGE; http:// nelson.wisc.edu/sage/data-and-models/riverdata/index.php)

River discharge levels are an important component of the hydrological cycle, which is in turn connected to agriculture and urbanization

Nighttime lights

US Department of Defense, NOAA (www.esrl.noaa.gov/gmd/ccgg)

Mapping urbanization dynamics

Aerosols

NASA (http://aeronet.gsfc.nasa.gov), World Data Centre for Aerosols (http://ebas.nilu.no), Japan Aerospace Exploration Agency (http://data.gosat.nies.go.jp)

Atmospheric aerosol concentration is often higher in urban areas; impacts regions temperature and precipitation patterns

Carbon cycle greenhouse gases

NOAA (www.esrl.noaa.gov/gmd/ccgg)

Impacts of land disturbances on the carbon cycle

Digital elevation model

Ministry of Economy, Trade, and Industry (METI; http:// jspacesystems.or.jp/ersdac/GDEM/E/4.html), USGS (http://eros. usgs.gov/#/Find_Data/Products_and_Data_Available/SRTM)

Topography can affect landslide risk, spread of wildfires, agricultural productivity, potential for urbanization, and so on

Climate records

National Climatic Data Center (CDC; www.ncdc.noaa.gov/oa/ climate/ghcn-daily), SPARC (www.sparc-climate.org/data-center)

Datasets obtained through in situ or satellite-based sensors consisting of information about the land, ocean, and atmosphere such as temperature, pressure, sea surface height, precipitation, and so on

Reanalysis data

Modern-Era Retrospective analysis for Research and Applications (MERRA; http://gmao.gsfc.nasa.gov/merra), Goddard Earth Observing System Data Assimilation System Version 5 (GEOS-5; http://gmao.gsfc.nasa.gov/systems/ geos5), Reanalysis Intercomparison and Observations (RIO; http://reanalyses.org)

Model data generally supplement observational data with physics based models, such as GCMs, for interpolating and projecting over regions and time intervals where sensor observations are sparse

Climate simulations

CMIP5 (Earth System Grid Federation; http://pcmdi9.llnl.gov), CMIP3 (http://www-pcmdi.llnl.gov/ipcc/about_ipcc.php)

Model data from GCMs are provided as is, no assimilation of observed values is performed; models provide a sense for general climate variability, but their results aren’t expected to provide information about specific time steps

where atmospheric (clouds and other aerosols) and surface (snow and ice) interference are constantly encountered. Temporal Variability Ecosystem observations tend to have a high degree of temporal variation. For example, vegetation data

such as greenness usually changes naturally on multiyear scales, but infrequent and local events such as forest fires and logging can induce shorttime events in naturally occurring spatiotemporal processes. Handling such naturally occurring temporal variations is necessary to avoid detection of spurious patterns.

www.computer.org/cise



17

Computing and Climate

Spatial Autocorrelation Tobler’s first law of geography states that, “Everything is related to everything else, but near things are more related than distant things.”2 Thus, the spatial dependence of Earth science data needs to be incorporated for obtaining physically consistent results. Spatial Heterogeneity Climate and ecosystem processes exhibit a high degree of variability in space, due to changes in geography, topography, and climatic conditions in different regions of Earth. This heterogeneity in the data drives the need for developing local or regional models, each corresponding to a h ­ omogeneous group of locations, instead of learning a single global model that’s applicable across all regions around the world. Multiresolution and Mulitscale Changes occurring on the Earth’s surface appear at different spatial and temporal scales. For example, events such as urbanization, fires, and deforestation tend to impact smaller areas than droughts. The degree of spatial heterogeneity of each dataset determines the necessary grid size to resolve important characteristics. Some datasets, such as population and political borders (important to connect events to political decision making), are usually available for predefined regions and need to be interpolated to the gridded space. Weighted average values can be attributed to grid cells that cover multiple spatial regions. One common approach is to build a bridge between these disparate scales and develop algorithms that can identify patterns at multiple resolutions without upsampling the entire data to the highest resolution.

T

he growing volume, variety, and complexity of Earth science data pose several challenges to big data analytics for conducting Earth science research. This motivates the need for developing novel scientific computing methodologies for addressing these challenges and advancing our understanding of the Earth’s physical processes. To this effect, recent advances in machine learning research, such as the use of sparse structured regularization approaches for incorporating spatiotemporal constraints3,4 and the use of heterogeneous machine learning approaches for handling various forms of heterogeneity in data,5,6 offer promising applications in the Earth science domain, motivating future directions of research in these areas.

18



Acknowledgments This research was supported in part by National Science Foundation Awards 1029711 and 0905581, NASA Award NNX12AP37G, the University of Minnesota Doctoral Dissertation Fellowship, and the University of Minnesota Informatics Institute Fellowship. Access to computing facilities was provided by the Minnesota Supercomputing Institute.

References 1. K.E. Taylor, R.J. Stouer, and G.A. Meehl, “An Overview of CMIP5 and the Experiment Design,” Bulletin Am. Meteorological Soc., vol. 93, no. 4, 2012, pp. 485–498. 2. W. Tober, “A Computer Movie Simulating Urban Growth in the Detroit Region,” Economic Geography, vol. 46, no. 2, 1970, pp. 234–240. 3. A.R. Goncalves et al., “Multi-Task Sparse Structure Learning,” Proc. 23rd ACM Int’ l Conf. Information and Knowledge Management, 2014, pp. 451–460. 4. K. Subbian and A. Banerjee, “Climate Multi-Model Regression Using Spatial Smoothing,” Proc. SIAM Int’ l Conf. Data Mining, 2013, pp. 324–332. 5. A. Karpatne et al., “Predictive Learning in the Presence of Heterogeneity and Limited Training Data,” Proc. SIAM Int’l Conf. Data Mining, 2014, pp. 253–261. 6. R.R. Vatsavai, “Gaussian Multiple Instance Learning Approach for Mapping the Slums of the World Using Very High Resolution Imagery,” Proc. 19th ACM Int’ l Conf. Knowledge Discovery and Data Mining, 2013, pp. 1419–1426.

Anuj Karpatne is a PhD candidate in the Department of Computer Science at the University of Minnesota, Twin Cities. His research focuses on developing data mining algorithms for scientific applications involving Earth science data. Karpatne has an Integrated M.Tech in mathematics and computing from the Indian Institute of Technology Delhi. Contact him at anuj@ cs.umn.edu. Stefan Liess is a research associate at the University of Minnesota. His research interests cover climate and climate change research with observational data and dynamical models, focusing especially on climate dynamics, interactions between climate and vegetation, and interactions between the troposphere and the stratosphere. Liess received a PhD in meteorology from the University of Hamburg. He’s a member of the American Meteorological Society and the American Geophysical Union. Contact him at [email protected]. November/December 2015

Suggest Documents