Big data and hydroinformatics (PDF Download Available)

Uncorrected Proof 1

© IWA Publishing 2016 Journal of Hydroinformatics

|

in press

|

2016

Big data and hydroinformatics Yiheng Chen and Dawei Han

ABSTRACT Big data is popular in the areas of computer science, commerce and bioinformatics, but is in an early stage in hydroinformatics. Big data is originated from the extreme large datasets that cannot be processed in tolerable elapsed time with the traditional data processing methods. Using the analogy from the object-oriented programming, big data should be considered as objects encompassing the data, its characteristics and the processing methods. Hydroinformatics can benefit from the big data

Yiheng Chen (corresponding author) Dawei Han Water and Environment Management Research Centre, Department of Civil Engineering, University of Bristol, Bristol BS8 1TR, UK E-mail: [email protected]

technology with newly emerged data, techniques and analytical tools to handle large datasets, from which creative ideas and new values could be mined. This paper provides a timely review on big data with its relevance to hydroinformatics. A further exploration on precipitation big data is discussed because estimation of precipitation is an important part of hydrology for managing floods and droughts, and understanding the global water cycle. It is promising that fusion of precipitation data from remote sensing, weather radar, rain gauge and numerical weather modelling could be achieved by parallel computing and distributed data storage, which will trigger a leap in precipitation estimation as the available data from multiple sources could be fused to generate a better product than those from single sources. Key words

| big data, data fusion, hydroinformatics, precipitation estimates

INTRODUCTION The inevitable trend of big data along with the growing capa-

provider Netflix has its recommendation system based on

bility to handle huge datasets is reshaping how we

hundreds of millions of accumulated anonymous movie rat-

understand the world. According to Google Scholar, the

ings to improve the probability that users rent the movies

number of publications containing the phrase ‘big data’ in

recommended by Netflix (Bennett & Lanning ).

the title and the number of publications about big data

Although the popularity of big data is related to its com-

and water are shown in Figure 1, revealing that the interest

mercial value, we believe that the idea of big data can benefit

in big data has dramatically risen since 2010, however, the

hydroinformatics research for multiple reasons. First, the big

research on big data in hydroinformatics is still at a very

data analysis encourages the utilization of multiple datasets

early stage. This is a very simple example of the so-called

from various sources to discover the big trends. Second, the

big data analysis, as the result is based on searching a vast

computing tools developed for big data analysis, e.g., parallel

number of academic publications powered by Google Scho-

computing and distributed data storage, can help tackle

lar. Google Scholar indexed academic publications provide

data-intensive jobs in the field of hydroinformatics. Third,

internet users with a very efficient way to find academic pub-

the novel correlation found by mining various large datasets

lications. The value of the online search engine is its

has the potential to lead to new scientific exploration. Apart

lightning fast speed which enables the user to get the

from the companies in the internet industry working closely

result from the ocean of online information in merely milli-

with the data from the internet, scientists have collected sub-

seconds. Another application of big data is precision

stantial amounts of data for hydrology, meteorology and

marketing, i.e., the online movie subscription rental service

earth observation with a history much longer than that of

doi: 10.2166/hydro.2016.180

Uncorrected Proof 2

Y. Chen & D. Han

Figure 1

|

|

Big data and hydroinformatics

Journal of Hydroinformatics

|

in press

|

2016

Left: the number of publications about big data. Right: the number of publications about big data and water by Google Scholar.

the internet. The development of the internet and the move-

process the data becomes tricky. The conflict between the

ment of open data significantly accelerates data sharing and

boom of big data and the data storage hard system, where

improves the accessibility of archived data. The hydroinfor-

the I/O speed is limited by the physical mechanism of

matics community will benefit from the active combination

hard disk, stimulated the development of parallel computing

of a huge amount of data and data processing technologies

and distributed data storage. After being able to effectively

for knowledge discovery and management. Precipitation is

manage large datasets, seven types of data modelling algor-

one important part of the water cycle in hydrology. The

ithms are summarized. Furthermore, when the correlation

accumulated precipitation datasets from heterogeneous

between datasets is successfully modelled, whether to only

sources, e.g., rain gauges, weather radars, satellite remote

utilize the correlation or to discover more scientific knowl-

sensing and numerical weather models, have reached tens

edge is discussed.

of terabytes in size, with different characteristics, i.e., spatial and temporal coverage, resolution and uncertainties. Data

Google Flu Detector

fusion is a possible method to utilize the accumulated datasets to produce a better result with enhanced resolution and

Currently, the concept of big data is popular in the analysis

minimized uncertainty.

of sociology, public health, business and bioinformatics. The

This paper consist of three parts. The first part starts

increasingly expanding internet is attracting people’s atten-

with an explanation of the concept of big data, then intro-

tion as one major data source. The data on the internet,

duces the popular Apache Hadoop family to handle large

especially new media, are generated by individuals, reflect-

amounts of data and seven classes of data analysis models,

ing their daily life, emotions, shopping preferences, etc.

and discusses important ideas developed from the big data

Without doubt, these types of data can be easily utilized in

era. The second part discusses the impact of big data on

the field of online business, public health, sociology, as

hydroinformatics with the focus on the issues of data shar-

these topics mainly focus on individual behaviours. In fact,

ing. Then, the third part emphasizes the future of

big data analysis opens a new way for researchers in these

precipitation data fusion as one promising big data utiliz-

areas to find out what is actually happening from the

ation in the area of hydroinformatics.

recorded online behaviours of individuals. Google developed a flu detector that monitors healthseeking behaviour in the form of online web search queries

BACKGROUND

by millions of users around the world every day. The methodology was to find the best matches among 50 million

This section aims to introduce the popular term ‘big data’

search terms to fit 1,152 flu data points from Central Disease

starting with the example of Google Flu Detector (GFD), fol-

Control (CDC). By analysing the large numbers of search

lowed by the explanation of the concept of ‘big data’. Once

queries, Googler found 45 search terms, when used in a

we get huge amounts of data, how to physically store and

mathematical model, were strongly correlated with the

Uncorrected Proof 3

Y. Chen & D. Han

|



|

in press

|

2016

percentage of physician visits for influenza-like symptoms,

describe datasets with quantity and complexity beyond the

based on which the GFD estimates the level of weekly influ-

capacity of normal computing tools to capture, curate,

enza activity with a 1-day reporting lag (Ginsberg et al.

manage and process with a tolerable speed (Snijders et al.

). From the perspective of the Google users, they tend

). Another explanation of big data refers to developing

to consult the accessible internet rather than immediately

new insights or creating new values at a large scale instead

consulting the doctor, when they feel a bit ill. The GFD pre-

of a smaller one (Mayer-Schönberger & Cukier ). De

dicts the influenza activity from user query logs, though with

Mauro et al. () investigated 14 existing definitions of

some noises, responding much faster than the CDC with a

big data, and proposed a formal definition as:

2-week reporting lag, which gives Google an advantage over the traditional disease control method. However, the

Big Data represents the Information assets characterized

GFD does not always perform well. In 2009, its poor under-

by such a High Volume, Velocity and Variety to require

estimation of the influenza-like illness (ILI) in the United

specific Technology and Analytical Methods for its trans-

States of the swine flu pandemic forced Google to modify

formation into Value.

its algorithm as people’s search behaviour changed due to the exceptional nature of the pandemic. In December

This definition can be subdivided into three groups: the

2012, it overestimated more than double the doctor visits

characteristics of the datasets, the specific technologies

for ILI than the CDC (Butler ).

and analytical methods to manipulate the data, and the

Despite the advantage of the quick response and reason-

ideas to extracts insights from the data and creation of

able accuracy of the GFD, the uncertainty from human

new values. Therefore, big data is not just about massive

behaviour searching that led the model to departure from

amounts of data. In general, the goal of big data analysis is

the CDC data cannot be ignored. This type of uncertainty

knowledge discovery from massive datasets, which is a chal-

is embedded within the mechanism of the analysis, which

lenging systematic problem. The data analysis systems

may only be overcome by an improved algorithm. Regard-

should: utilize the existing hardware platform with distribu-

less of the weakness of GFD, the point is that the apparent

ted and parallel computing; accommodate a variety of data

value of the data may only be the tip of the iceberg.

formats, models, loss functions and methods; be highly cus-

Google started its business by providing an online searching

tomizable for users to specify their data analysis goals

service for internet users without the purpose of predicting

through an expressive but simple language; provide useful

the outbreak and threats of influenza, but the search query

visualizations of key components of the analysis; communi-

logs become extremely valuable after being accumulated

cate with other computational platforms seamlessly; and

for several years. The reason for this is that Google effec-

provide many of the capabilities familiar from large-scale

tively collected the information that the search engine

databases (Council ).

users want to know at a certain time and certain location. The big information pattern, contributed by millions of

The expanding data vs. the developing computing

users around the world, showed additional big value

power

behind search query data. To summarize, the GFD shows two features of the big data analysis, crowdsourcing and

The typical big data characteristics include high volume (the

by-product.

quantity of data generated), high velocity (the speed of collecting data), and high variety (the category of data)

What is big data

(Laney ). The concern is whether the existing computing system can handle the increasingly large data. An Inter-

The fashionable term of ‘big data’ is sometimes so hot that

national Data Corporation (IDC) report has estimated that

many people attempt to embrace it in this data-rich era with-

the data size of the world will grow from 130 exabytes

out a clear understanding. The term ‘big data’ is simple but

(1018 bytes) in 2005 to 40 zettabytes (1021 bytes) in 2020,

makes its meaning ambiguous; it is commonly used to

at a 40% annual increase (Gantz & Reinsel ). New

Uncorrected Proof 4

Y. Chen & D. Han

|



|

in press

|

2016

datasets are continuously being collected from the internet,

parallel computing and distributed storage were developed

the Internet of Things, the remote sensing network and

to counter this issue. MapReduce is a distributed program-

e-commence, wearable devices, etc. Unfortunately, only

ming model for processing and generating large datasets

3% of all data are properly tagged and ready for use, and

developed by Google. The idea of MapReduce is to specify

only 0.5% of data are analysed, which yields a large poten-

a Map and a Reduce function which are suitable for parallel

tial market for data utilization (Burn-Murdoch ). The

computing, and the underlying runtime system automati-

actual data size needed is dependent on the task of data

cally

analysis, which further scales down the size of data to be

clusters of machines, handles machine failures, and sche-

processed. On the other hand, the data storage capacity

dules inter-machine communication to make efficient use

has increased dramatically in the past decades. In 1956,

of the network and disks. As the size of datasets is extremely

IBM made the first commercial disk drive with a capacity

large for big data problems, a cluster of machines connected

of 3.75 MB (Oracle ). In 1980, the world’s first giga-

in a network are used to overcome the limit of computing

byte-capacity disk drive (2.52 GB), the IBM 3380, was the

power and data storage of a single machine, but the network

size of a refrigerator. After 25 years, the first 500 GB desktop

bandwidth becomes the bottleneck as it is a rare resource.

hard drive was shipped (Dahl ), followed by the 1 TB

Thus, the MapReduce system is optimized targeting at redu-

one in 2007 (Perenson ). In 2014, Western Digital

cing data transfer across the network through sending the

shipped the 8 TB hard drive and announced the world’s

code to the local machine and writing the intermediate

first 10 TB hard drive (Hartin & Watson ). The unit

data to local disk. The MapReduce system minimizes the

cost of data storage will drop down from $2.00 per GB to

impact of slow machines, and can handle machine failures

$0.20 per GB from 2012 to 2020 (Gantz & Reinsel ).

and data loss by redundant execution. The success of the

The storage of data should no longer be a big problem

MapReduce programming model relies on several things.

owing to massive storage technologies such as Direct

First, the model automatically deals with the details of par-

Attached Storage (DAS), Network Attached Storage

allelization, fault tolerance, locality optimization and load

(NAS) and Storage Area Network (SAN), as well as the

balancing, which makes it easy for programmers even with-

cloud data storage.

out experience with parallel and distributed computing.

parallelizes

the

computation

across

large-scale

The storage capacity of the hard disk keeps increasing,

Second, the Map and Reduce functions are capable of a var-

nevertheless the I/O speed of the hard disk grows slowly

iety of applications, such as sorting, data mining, machine

due to the limitation of the hard disk mechanism. Solid

learning, etc. Third, the MapReduce can scale up to large

state disk (SSD) has a much higher I/O rate and negligible

clusters of thousands of commodity machines, which

seek time; however, in the meantime, the cost per unit sto-

means the computing resources can be utilized for big pur-

rage is much higher than that of the hard disk. Regardless

poses (Dean & Ghemawat ). The Hadoop is an open-

of the cost, the SSD has a lower storage capacity than the

source version of the MapReduce framework developed by

single device. The I/O speed of the data storage devices is

Apache, freely available to the scientific community. The

the bottleneck of extreme large data processing rather than

Hadoop contains the Hadoop Distributed File System

the data storage capacity.

(HDFS) working together with MapReduce after Google published the technical details of the Google File System

The MapReduce parallel computing

(Ghemawat et al. ). The Apache Hadoop also contains Hadoop Common, the common utilities that support the

An appropriate software system is essential to dealing with

other Hadoop modules; and Hadoop YARN, a framework

extremely large datasets apart from the development of the

for job scheduling and cluster resource management.

hardware system. As the improvement of I/O speed of the

There are many other projects in Apache which are related

hardware system did not catch the speed of the expansion

to Hadoop, including HBase (a scalable, distributed data-

of data storage, the time required to process data dramati-

base that supports structured data storage for large tables),

cally increased without an appropriate algorithm. The

Hive (a data warehouse infrastructure that provides data

Uncorrected Proof 5

Y. Chen & D. Han

|



|

in press

|

2016

summarization and ad hoc querying), Mahout (a scalable

estimation, kernel regression methods, radial basis function

machine learning and data mining library), Pig (a high-

neural networks and mean-shift tracking – and modern

level data-flow language and execution framework for paral-

methods such as support vector machines and kernel princi-

lel computation) and ZooKeeper (a high-performance

pal components analysis (PCA). Other instances include

coordination service for distributed applications), etc.

k-means, mixtures of Gaussian clustering, hierarchical clus-

(Apache ).

tering, spatial statistics of various kinds, spatial joins, the

Hadoop MapReduce has a weakness during iterative

Hausdorff set distance, etc. Graph-theoretic computation is

data analysis in that the intermittent datasets are stored on

the third giant, including problems with graph traversing.

the local hard disk. As the iterative data analysis requires

The graph can be either the data itself or the statistical

multiple read and write of local intermittent data, this will

model in the form of a graph depending on the nature of

dramatically slow down the analysis. This happens to most

the problem. Common statistical computations include

machine learning algorithms, e.g., gradient decent. Apache

betweenness, centrality and commute distances; used to

Spark is the latest programing model in the big data world,

identify nodes or communities of interest. Nevertheless,

featuring its lightning fast data processing speed for iterative

the challenges arise when computing in large-scale, sparse

jobs (Zaharia et al. ). The Spark achieved its lightning

graphs. When the statistical model takes the form of a

fast speed by the implementing Resilient Distributed Data-

graph, graph-search algorithms continue to remain impor-

sets (RDDs), a distributed memory abstraction that lets the

tant, but there is also a need to compute marginal

programmer perform in-memory computation (Zaharia

probabilities and conditional probabilities over graphs; oper-

et al. ). The Spark outperforms Hadoop by 20 times in

ations generally referred to as ‘inference’ in the graphical

speed by utilizing the RAM instead of hard disk to store

models literature. The forth computational giant is linear

the intermittent data.

algebraic computations, including linear systems, eigenvalue problems and inverses, deriving a large number of linear

Modelling big data

models, e.g., linear regression, PCA and many variants. Many of them are suitable for generic linear algebra

There are many data-based computational methods, and

approaches, but there are two important issues. One is

they can be classified as ‘the seven computational giants of

that the optimization in statistical learning problems does

massive data analysis’ (Council ). Data-based computing

not necessarily need to be trained to high accuracy to

is facing challenges due to the expansion of data volume and

avoid overfitting. Another important difference is that multi-

dimensionality. The first giant is basic statistics including

variate statistics has its own matrix form, that of a kernel (or

calculating the mean, variance and moments; estimating

Gram) matrix; while, on the other hand, the computational

the number of distinct elements; number counting and fre-

linear algebra involves techniques specialized to take advan-

quency analysis; and calculating order statistics such as

tage of certain matrix structures. In kernel methods, such as

the median. These tasks typically require O(N) complexity

Gaussian process regression or kernel PCA, the kernel

calculations for N data points. The second computational

matrix can be too large to be stored in the matrix explicitly,

giant is the generalized N-body problem, including nearly

requiring probably matrix-free algorithms. Optimization is

any problem involving distances, kernels, or other simi-

the fifth giant in massive data analysis. Linear algebraic

larities between pairs or higher-order n-tuples of data

computations are the main subroutine of second-order

2

points. The computational complexity is typically O(N ) or 3

optimization algorithms. Non-trivial optimizations will con-

O(N ). N-body problems are involved in range searches,

tinue and become increasingly common as methods have

nearest-neighbour search problems and the nearest-neigh-

become more sophisticated. Linear programming, quadratic

bour classification problem. They also appear in nonlinear

programming, and second-order cone programming are

dimension reduction methods, also known as manifold

involved in support vector machines and recent classifiers,

learning methods. N-body problems are related to kernel

and semidefinite programming appears in manifold learning

computation, like kernel estimators – such as kernel density

methods. Other standard types of optimization problems,

Uncorrected Proof 6

Y. Chen & D. Han

|



|

in press

|

2016

e.g., geometric programming, are to be applied in data analy-

explanatory variables or lurking variables (Moore &

sis in the near future. The sixth one is integration of

McCabe ). The association between two variables is

functions, which is required for fully Bayesian inference,

not that simple, no wonder how complex it is to understand

and also non-Bayesian settings, most notably random effects

the practical problems with multiple variables. Focusing on

models. The integrals that appear in statistics are often

correlation makes it much easier for practical data mining

expectations. The frontier is the high-dimensional integrals

application without too much effort on the causation.

arising in Bayesian models for modern problems. The

Machine learning, the artificial intelligence method, is the

approaches for this problem include Markov Chain Monte

typical algorithm lying behind the big data analysis, which

Carlo, or sequential Monte Carlo in some cases, approxi-

is a ‘black box’ model. Users feed inputs to the machine

mate Bayesian computation (ABC) operating on summary

learning algorithm and get outputs from it without knowing

data, and population Monte Carlo, a form of adaptive impor-

what really happens to the data training process. This pro-

tance sampling. Alignment problems is the seventh giant,

cess is practically useful without necessarily understanding

consisting of problems involving matchings between two

the causation behind it, but the causation is what scientists

or more data objects or datasets, such as data integration,

are always seeking for. For academic purpose, detecting

data fusion. The fundamental alignment problems are

the potential of the data correlation is not the ultimate

usually carried out before performing further data analysis.

goal. Instead, the big data should help the development of science in a way that the novel association between big data-

Correlation vs. causation

sets can be detected to motivate further research for the causation. From the control theory perspective, the scienti-

The most significant part of the big data concept is the fun-

fic exploration is to open the ‘black box’ of the objective

damental and innovative ideas that change how people

world iteratively. On the other hand, the scientific model

interact with the world. The enrichment of available data

developed from analysing large datasets can then be vali-

enables people to consider the entire system rather than

dated through the correlation of the datasets. Figure 2

taking few samples, thereby scientists can discover trends

gives a clear illustration of the ideas stated above. The

or phenomena that cannot be revealed with small data.

major difference is that the science focuses on causation,

The idea of big data always encourages to think bigger, to

either derived from correlation, or validated through corre-

broaden the horizon to cover a big scope rather than focus

lation, while the big data analysis in industry focuses on

on a few small areas. Moreover, the big data analysis focuses

values of correlation from the data.

on correlation rather than causation, in that the correlations between datasets do not necessarily lead to causation, or

Relevance to hydroinformatics

that making use of the correlation is sometimes more valuable than exploring the causation behind it (Mayer-

Hydroinformatics,

Schönberger & Cukier ). For simplicity, the associations

hydraulics, comprises the application of information and

originated

from

the

computational

of two variables can be classified in three types, i.e., causa-

communications technologies (ICTs) to the understanding

tion, common response and confounding. Causation

and management of the waters of the world (Abbott ),

means direct cause-and effect connection between variables,

addressing the increasingly serious problems of the equi-

revealing that they are strongly correlated. Common

table and efficient use of water for different purposes.

response means the association between variables is in

Once the term hydroinformatics was defined, it meant to

fact caused by another lurking variable. The change of the

integrate artificial intelligence into numerical simulation

observed variables is in response to the changes of the

and modelling, and to shift the computational-intensive

hidden variable, even though the observed variables have

analysis to information-based research. The two main lines

no direct causal link. Two variables are confounded when

of hydroinformatics, data mining for knowledge discovery

their effects on a response variable cannot be distinguished

and knowledge management (Abbott ), are strongly

from each other. The confounded variables may be either

dependent on information of which data, both textual or

Uncorrected Proof 7

Y. Chen & D. Han

Figure 2

|

|



|

in press

|

2016

The relationship between the datasets, correlation and causation.

non-textual, is the major carrier. Data from smart meters,

Water-related data include the measurement of precipitation

smart sensors and smart services, remote sensing, earth

(rainfall, snow and hail), river flow, water quality, soil moist-

observation systems, etc., will prompt hydroinformatics

ure, soil characteristics, ground water condition, air

into the inevitable big data era. The challenge of big data

temperature and humidity, solar flux, etc. The observation

and data mining for environmental projects is the most

methods developed are from local stations for point

pressing one in the near future (Pierson ). One simple

measurement to remote sensing – radar and satellites, and

example of big data analysis is called text mining. It has

drone. Earth observation satellites are generating huge

been carried out on the 50th anniversary of Water Resources

volumes of data including weather- and water-related infor-

Research to produce word clouds, shown in Figure 3, based

mation. ESA launched SMOS for soil moisture observation

on highly cited papers for every ten years of Water Resources

in 2009, and will launch ADM-Aeolus for Atmospheric

Research, which provides a visual representation of the

Dynamics observation in 2017 (ESA ). NASA launched

themes emphasized in each decade (Rajaram et al. ).

SMAP to map soil moisture and determine the freeze or thaw state in 2015 (SMAP ). The GPM mission launched

Data for hydroinformatics

in 2015 aims to provide global rain and snow observations based upon the success of TRMM launched in 1997

In general, water-related problems are quite complex due to

(NASA ). EUMETSAT has two generations of active

the interrelationships between water-related environmental,

METEOSAT satellites in geostationary orbit and a series

social and business factors. The data being generated and

of three polar orbiting METOP satellites for weather now-

collected relevant to hydroinformatics features huge

casting and forecasting and understanding climate change.

volumes and multiple types. For the purpose of simplifica-

Without doubt, the increasing amount of earth observation

tion, the data sources for hydroinformatics, without loss of

data, including precipitation, soil moisture and wind

generosity, can be classified into three dimensions, i.e., the

speed, etc., will improve the understanding of the global

natural dimension, the social dimension and the business

water cycle, and benefit weather forecasting, flood and

dimension.

drought prediction. Unfortunately, although many satellites

The natural dimension is about water as one important

were launched or are to be launched, the huge amount of

component of the natural environment. Understanding the

available data is rarely used; only 3–5% of data is used on

water cycle, the temporal and spatial distribution of water

a daily average, while billions of dollars have been invested

and the interaction of water and the environment is part

annually (Selding ). Apart from the earth observation

of the objectives of hydroinformatics for improving water

data, reanalysis data are another important information

resource management, flood and drought management.

source with high data quality. In other words, the

Uncorrected Proof 8

Y. Chen & D. Han

|



|

in press

|

2016

information source is not limited to the observation of the current situation and the archived past situation; the model generated data cannot be neglected. Reanalysis of archived observations is achieved by combining advanced forecast models and data assimilation systems to create global datasets of the atmosphere, land surface and oceans, as an operational analysis dataset will suffer from inconsistency due to the frequent improvements of the forecast models. The NCEP Climate Forecast System Reanalysis includes over 80 variables, goes back to 1948 and is continuing (National Centers for Environmental Prediction ). ECMWF has series of ERA projects for global atmospheric reanalysis that can be traced back to 1957 (ECMWF ). The Japan Meteorological Agency conducted the JRA-55 project for a high-quality homogeneous climate dataset covering the last half century (Kobayashi et al. ). The model generated data are four dimensional, three dimensions in space and one in time, and of high spatial and temporal coverage and resolution, resulting in huge volumes of data, which means that hydroinformatics is entering a data-intensive era. Utilization of the currently available data is challenging due to the uncertainties of the data, the challenges of processing and the lack of ideas of data utilization. In the big data era, it is encouraged to make the best of the huge amount of data with tolerance of the uncertainties. The processing of large amounts of datasets is becoming easier with the development of computing tools. The lack of creative ideas is the main limitation of the utilization of data. A frontier application example is a prototype software that automatically finds an ideal location for hydro-power based on over 30 freely remote sensing and environmental datasets in the UK (Leicester ). The social dimension is about the interaction of water environment and human society. With the digitalization of textual information available online and the explosion of social media, textual mining technologies enable a new research area of the public attitude towards certain issues. For instance, five million scientific articles have been analysed to explore the impact of the Fukushima disaster on the media attitude towards nuclear power (Lansdall-Welfare et al. ). Similar ideas can be applied to discover waterFigure 3

|

Word clouds of highly cited papers from Water Resources Research in each decade as an example of big data related to water (Rajaram et al. 2015).

related issues, e.g., the social attitude towards climate change, water saving, water policy, etc. Apart from the

Uncorrected Proof 9

Y. Chen & D. Han

|



|

in press

|

2016

discovery of public attitude, the internet is logging the activi-

purposes such as to help a utility company prepare for the

ties of internet users, which can be potentially valuable to

after effects of a major storm or to help airlines and airports

discover real-world situations, demonstrated by the example

manage weather-generated delays by rearranging or combin-

of GFD mentioned in the previous section. Twitter data is

ing flights more efficiently (IBM ). Another possibility is

now attracting many researchers to dig out water environ-

that, as inspired by the big data application in e-commerce

ment-related research. It was found that Twitter content

that utilizes accumulated user activity logs for a recommen-

could infer daily rainfall rates in five UK cities, which revealed

dation system, the smart metering data can be integrated

that the online textual features in Twitter were strongly related

with end-user water consumption data, wireless communi-

to the topic with significant inference (Lampos & Cristianini

cation networks and information management systems in

). Two Dutch organisations, Deltares and Floodtags, have

order to provide real-time information on how, when and

developed real-time flood-extent maps based on tweets about

where water is being consumed by the consumer and utility

floods for Jakarta, Indonesia (Eilander ). This method

(Stewart et al. ). The information from the combination

gives disaster management a real-time view of the situation

of data will be valuable to architects, developers and planners,

with wide coverage. The enrichment of the new media data

seeking to understand water consumption patterns for future

on the internet enables a new model for scientific research.

water planning. Smarter metering is one example of the ambi-

The new model gathers information from what the internet

tious ideas of the Internet of Things as a global infrastructure

users post online. The users are actually playing the role of

for the information society, enabling advanced services by

information collector, and they deposit the information

interconnecting things based on existing and evolving intero-

about what they observe about the environment to the inter-

perable ICTs (ITU ). Furthermore, the operation data

net. The internet is like a boundless ocean of data that

collected by companies in the water industry also have poten-

records how internet users interact with the internet. The

tial values for data mining for optimizing the system and

data ocean has valuable potential for scientists to discover

providing more information for decision-making.

novel correlations between real-world situations. The fundamental data mining techniques behind the big data

The trend of open data

application, such as GFD, estimating precipitation from Twitter, etc., are the same, i.e., to dig out the correlation between

The increasing number of openly available data sources will

the information and the targeted result. The distinction of

benefit the research community as data is the basic material

these analyses is that the social network data application is

for data-based research. Open data means data that can be

based on people’s mental reaction to certain events while

freely used, modified, and shared by anyone for any purpose

the nature scientific research is mainly based on the physically

(Opendefinition ). Open data is the further development

interpretable model. As the behaviour of people is ambiguous

of free data that data is freely licensed for limited purposes

to interpret and predict, the big data analysis of social network

and certain users, while closed data is usually restricted by

data is dominated by the machine learning or statistical

copyright, patents or other mechanisms. The goals of the

approaches.

open data movement are similar to those of other ‘Open’

The business dimension covers but is not limited to water

movements, such as open source, open hardware, open con-

extraction, water treatment, water supply, waste water collec-

tent and open access. The data owner may not have the

tion and treatment. IBM has been a pioneer in utilizing data

appropriate ideas and techniques to produce extra values

and computing tools collaboration with National Oceanic

from the data, while, on the other hand, people with innova-

and Atmospheric Administration (NOAA) to explore the

tive ideas and the ability to process the data may find it

business of weather. They built one of the first parallel proces-

difficult to find and access the data they need. The open

sing supercomputers for weather modelling in 1995, named

data movement will activate the combination of data, data

Deep Thunder Project. Deep Thunder creates 24- to 48-hour

mining methods and new ideas to create additional values

forecasts at 1–2 km resolution with a lead time of 3 hours to

by removing the barrier between the data providers and

3 days and combines with other data customized for business

the data users. Thus, the research data and its products

Uncorrected Proof 10

Y. Chen & D. Han

|



|

in press

|

2016

can achieve full value and accelerate future research only

not been identical. A very simple example is that even the

when being open. Multiple national governments have cre-

expression of dates is different. Chinese use ‘yyyy-mm-dd’;

ated web sites for the open delivery of their data for

Europeans use ‘dd-mm-yyyy’; and in the US people use

transparency and accountability, e.g., Data.gov for the US

‘mm-dd-yyyy’. This issue with dates was tackled by ISO

government, Data.gov.uk for the UK government, European

8601, an international agreement to use ‘yyyy-mm-dd’ for

Union Open Data Portal (http://open-data.europa.eu/) and

the format of dates. Other issues may include but are not

Canada’s Open Government portal (http://open.canada.

limited to the observation resolution, both temporal and

ca/en), etc. For open data in science, the World Data

spatial, the expression of missing value, the data processing

System (WDS) of the International Council for Science

methods, the units of the data, etc. As the characteristics of

was created based on the legacy of the World Data Centres

different datasets vary, the data should be clearly tagged by

in 2008 to ensure the universal and equitable access to qual-

the metadata which is essential for the data user to carry out

ity-assured scientific data, data services, products and

data analysis. The metadata is the information about infor-

information. National Climatic Data Center, containing

mation, which describes, explains, locates, or otherwise

huge amounts of environmental, meteorological and climate

makes it easier to retrieve, use or manage an information

datasets, is the world’s largest archive of weather data.

resource (Guenther & Radebaugh ). The metadata

SWITCH-ON is a European project that works towards sus-

should capture the basic characteristics of a data or infor-

tainable use of water resources, a safe society and

mation resource including who, what, when, where, why

advancement of hydrological sciences based upon open

and how about the data resource. In the big data era, ad

data. The project aims to build the first one-stop shop

hoc data analysis for simple tasks may be time-consuming

portal of open data, water information and its users in one

when the data size becomes extremely large. It can be

place (SWITCH-ON ). EarthCube is a project launched

worthwhile for the data provider to process feature extrac-

in 2011 that develops a common cyber infrastructure for the

tion offline and incorporate these features into the

purpose of collecting, accessing, analysing, sharing and

metadata, such as mean values, extremums, general trend

visualizing all forms of data and related resources for under-

or pattern prior to the data release. Such pre-process of

standing and predicting a complex and evolving solid Earth,

data can make it much easier for data users to find the

hydrosphere, atmosphere, space environment systems,

data they need.

through the use of advanced technological and compu).

Another challenging issue of integration data usage is

ongoing

the variety of data formats, varying from simple binary or

movement of open data can boost data-based research and

CSV format to advanced self-describing Network Common

data usage by removing the legal restriction on data use.

Data Form (netCDF), Hierarchical Data Format (HDF),

Many data portals are being created for data sharing through

GRIddedBinary (GRIB), Extensible Markup Language

web services with much more powerful data search tools

(XML), waterML, etc. For satellite data, High Rate Infor-

where users can find data by location, time and data types,

mation Transmission (HRIT), Low Rate Information

etc.

Transmission (LRIT), High Rate Picture Transmission

tational

capabilities

(EarthCube

The

(HRPT) and Low Rate Picture Transmission (LRPT) are Issues of data sharing

the CGMS standards agreed upon by satellite operators for the dissemination of digital data to users via direct broad-

The trend of open data will motivate data sharing and com-

cast. The difference is that HRIT and LRIT transmit data

prehensive utilization of data by removing the restriction of

originating from geostationary satellites while the HRPT

patents, copyrights, but other issues of data sharing necessi-

and LRPT transmit data originating from low earth orbit sat-

tate cooperative effort and innovative ideas. Data format is

ellites. Also, their names suggest that they operate at

one of them. As the datasets related to water are collected

different data broadwidth. The World Meteorological

by different organizations in different countries all around

Organization (WMO) has two binary data formats: Binary

the world, how the data was recorded and expressed has

Universal Form for the Representation of meteorological


Y. Chen & D. Han

|



|

in press

|

2016

data (BUFR) to represent any meteorological dataset

application layer and the system layer, or connects between

employing a continuous binary stream, and GRIB format

different software components. The data-based analysis

to transmit large volumes of gridded data to automated

necessitates such middleware to handle large datasets from

centres over high-speed telecommunication lines using

different sources in a variety of data formats and many com-

modern protocols. The Man computer Interactive Data

putational models as well as being compatible with multiple

Access System (McIDAS) goes beyond a simple data

programming languages. The open source development has

format to a set of tools for analysing and displaying meteor-

to be implemented to such middleware to enable the whole

ological data for research and education. NetCDF is a

research community to contribute to and benefit from it.

machine-independent, self-describing, binary data format standard and a set of software libraries for exchanging

Boosts from cloud computing

array-based scientific data; it features self-describing, portable, scalable, appendable, shareable and archivable data.

The tools developed in the big data era, such as Hadoop

HDF, HDF4 or HDF5 is a library and multi-object file

MapReduce, Apache Spark, can handle extremely large

format for the transfer of graphical and numerical data

datasets within tolerable runtimes, but the knowledge and

between computers developed by NASA. HDF supports sev-

technique to set up and manage the tools are required.

eral different data models in a single file, including

The commercial cloud computing service is available to

multidimensional arrays, raster images and tables, which

scientists as an alternative, where data storage and proces-

respectively have their specific data type and API. The

sing can be done in the cloud, such as Microsoft Azure,

XML is a general-purpose markup language, primarily

Amazon Elastic Compute Cloud, Google Compute Engine,

used to share information via the Internet (WMO ).

Rackspace, Verizon and GoGrid. The commercial cloud

The WaterML2 is a variation of XML specified for water

has a usage-based price policy, making the computing job

observation data, and allowing data exchange across infor-

more cost-effective than implementing local clusters.

mation

systems

(OGC

).

Standard

Hydrologic

Cloud computing is scaleable to suit the job, and does not

Exchange Format (SHEF) was created to store and

require extensive knowledge on configuring local clusters.

exchange hydrometeorological data in the 1980s, and is

US NOAA has launched its Big Data Project collaborating

readable by both humans and machines (Bissell et al. ).

with Amazon Web Service, Google Cloud Platform, IBM,

The variety of data formats may cost scientists much

Microsoft and the Open Cloud Consortium (Department

time dealing with different formats rather than working on

of Commerce ). The NOAA data will be brought to the

scientific problems when utilizing multiple datasets from a

cloud platform together with big data processing services

variety of sources. To enhance the accessibility of hydrologi-

such as Google BigQuery and Google Cloud Dataflow, to

cal data, GEOWOW (Global Earth Observation System of

explore and to create new findings. NOAA’s Big Data Pro-

Systems (GEOSS) interoperability for Water, Ocean and

ject indicated a coming trend of combing the tremendous

Water) contributes to international standardization pro-

volume of high quality data held by the government and pri-

cesses within the Hydrology Domain Working Group, a

vate industry’s vast infrastructure and technical capacity of

joint working group of the Open Geospatial Consortium

data management and analysis.

(OGC) and the WMO. GEOWOW developed for the first time a common global exchange infrastructure for hydrological data based on standardized formats and services.

BIG DATA FOR PRECIPITATION ESTIMATE

GEOWOW aims to evolve the GEOSS in the aspect of water, and is part of the GEOSS Conmen Infrastructure

The available precipitation data

(GCI) (GEOWOW ). In addition, a middleware that connects the data I/O scripts and the data analysis tools

Although computer scientists have attempted to use newly

may be a feasible alternate featuring reusability. Middleware

emerged social network data to estimate rainfall, as men-

is the glue of software, and usually lies between the

tioned in the previous section, it is like a ‘dessert’; the


Y. Chen & D. Han

|



|

in press

|

2016

main data sources of rainfall measurement are rain gauges,

future direction, the existing data sources for rainfall

weather radars and satellites, which are the ‘main course’.

measurement have accumulated a vast quantity of data

The ‘dessert’ has some obvious shortages apart from its

which can substantially benefit from the big data technol-

advantage on data cost and quick response. The use of Twit-

ogy. Table 1 shows information of some widely used

ter data to estimate rainfall or flood situation, as mentioned

datasets containing precipitation data. The features of pre-

in the previous section, requires the prevalence of Twitter at

cipitation data from different sources vary significantly due

a local level, e.g., developed urban area with a large number

to the different measuring mechanisms and processing

of users and a wide internet access, which implies the spatial

algorithms.

coverage and resolution of the data can be poor in less developed cities and rural areas. The temporal length of the

Data fusion

Twitter data is significantly less than the meteorological records, which can be traced back to 1861, while Twitter

Hydrologists are pursuing fine and accurate estimates of pre-

was launched in 2006. Despite the low cost and quick

cipitation data in both space and time for drought and flood

response of the new data sources foretelling a possible

management.

Table 1

|

Rain

gauge

observations

are

direct

Information of some datasets containing precipitation

Spatial and temporal coverage and Datasets

Data source

Data size

resolution

GPCC Global Precipitation Climatology Centre monthly precipitation dataset

Gauge based

4.2 GB

Monthly values from 1901/01. Varys, 0.5 , 1.0 and 2.5 global grid

(Beck et al. )

The Next Generation Weather Radar (NEXRAD)

Radar

Comprising 160 sites throughout the US. 1 grid. 1 hour, 3 hour and total storm accumulated data since 1988

(NCEI )

Global Historical Climatology Network Daily Database

Station record

22 GB

Daily since 1861. Contains records from over 80,000 stations in 180 countries and territories

(Menne et al. )

CPC Global Summary of Day/Month Observations

Station record

13.7 GB

Approx. 8,900 actively reporting stations in global daily data since 1979

(Climate Prediction Center National Centers for Environmental Prediction National Weather Service Noaa U. S. Department of Commerce )

GPCP (Daily): Global Precipitation Climatology Project -1DD product

Geostationary infrared satellite

0.78 GB

Daily rainfall accumulation globally on a 1 grid in latitude and longitude starting in October 1996

(Pendergrass )

The Tropical Rainfall Measuring Mission (TRMM)

Satellite

236 GB

3 Hourly from Jan 1st 1998 to mid-2017. 0.25 latitude/ longitude grid over the domain 50 S–50 N

(NASA )

W

73.1 TB

W

W

W

W

W

The Global Precipitation Measurement Mission (GPM)

W

Satellite

N/A

W

Provide half-hourly and monthly precipitation estimates on a 0.1 latitude/longitude grid over the domain 60 N–S

(NASA ) W

W

NCEP Climate Forecast System Reanalysis

Model reanalysis

67 TB

W

6 hourly from 1979. 0.1 latitude/ longitude grid globally

(Saha et al. )


Y. Chen & D. Han

|



|

in press

|

2016

measurements of rainfall on the ground, but are often sparse

previous data fusion studies indicate that data fusion pro-

in regions with complex landform, clustered in valleys or

cesses can generally improve data quality over the data

populated areas, and of poor temporal consistency. Thus,

from a single source. Nevertheless, there is an apparent

gauge data may not be able to provide sufficient information

limitation of the previous studies in that they only proposed

about the spatial extent and intensity of precipitation

the methodology and tested it with limited spatial and tem-

(Verdin et al. ). Estimating precipitation from satellites

poral coverage; in other words, the amount of data was

provides an alternative method for collecting rainfall data

limited, so they did not concern themselves much about

with the inherent advantage of detecting the spatial distri-

the efficiency of the algorithm which is the key factor in pro-

bution of the precipitation. They are different in the

cessing big data.

observation mechanism resulting in a substantial difference

In fact, applying data fusion technique to the existing

in the features of observation results. Satellite-based

terabytes of precipitation data is a tough issue for hydrolo-

measurement is intermittent, area-averaged observation,

gists as processing the huge amount of data generated

while rain gauge measurement is continuous and point

every day will be extremely time-consuming. It becomes

observation (Arkin & Ardanuy ). There is a trade-off

more problematic when dealing with the accumulated his-

of accuracy and spatial coverage between each data

torical data which are equally valuable. Owing to the

source. The observations of rain gauges and radar have the

development of big data, the ability of a cluster of computers

best measurement of actual rainfall but with the most lim-

to process large amounts of data has been greatly improved

ited spatial coverage. Geostationary satellites with infrared

primarily by implementing the idea of parallel computing,

sensors are less accurate but the coverage is broad and con-

which is to subdivide the job into small portions and to

tinuous. Between them are the microwave sensors on low

involve a cluster of computers to work simultaneously.

earth orbits which provide more reliable estimates of pre-

Thus, parallel computing posed a requirement on the

cipitation but with incomplete temporal sampling and

fusion algorithm that the data fusion process can be separ-

coarse spatial resolution (Gorenburg et al. ). In the big

ated into individual parts. Data fusion algorithms, e.g., the

data era, it is encouraged to make use of the joint data

Bayesian kriging method proposed by Verdin et al. (),

from various sources. It is promising to fuse the existing pre-

have the disadvantage of snapshot; in other words, are tem-

cipitation data from heterogeneous data sources. As

porally independent, while the time series of precipitation

heterogeneous data sources possess different advantages

are available for possible improvements in accuracy. How-

and disadvantages, they can complement each other in an

ever, this snapshot feature makes the fusion process easy

optimal way (Sander & Beyerer ).

to be separated temporally for parallel computing, which

Verdin et al. () used a Bayesian data fusion model

can effectively speed up the processing procedure.

with ordinary kriging to blend infrared precipitation data

Hadoop MapReduce was designed to handle textual

and gauge data on the Central and South American

data initially, and how it performs on processing high-

region. The method was applied to pentadal and monthly

volume remote sensing image data has been assessed by

total precipitation fields during 2009. This blending

only a few papers, of which the results were positive.

method significantly improves upon the satellite-derived

Almeer () researched the performance of eight pixel-

estimates and is also competitive in its ability. Wang et al.

level image processing algorithms using Hadoop, with the

() assessed the performance of the multiscale Kalman

result that the method is scalable and efficient in processing

smoother-based framework in precipitation fusion. They

multiple large images used mostly for remote sensing appli-

tested the algorithm on 2003 hourly NEXRAD MPE precipi-

cations, and the Hadoop runtime is significantly lower than

W

W

tation data of two spatial resolutions, i.e., 1/8 and 1/32 , 2

the runtime of a single PC. Lv et al. () developed a par-

in the US. Linear

allel model of K-means clustering algorithm based on

weighted algorithm, multiple linear regression, and artificial

Hadoop MapReduce to process satellite remote sensing

neural network were also used to fuse the remote sensing

images, of which the results are acceptable, and the runtime

data with the ground data (Srivastava et al. ). All the

drops. It is reasonable to believe that Hadoop MapReduce

respectively, covering 152,175 km


Y. Chen & D. Han

|



|

in press

|

2016

on a cluster of machines will work on the data fusion job as

The challenges of big data have also been included in

the parallel computing is not restricted by the type of data.

this paper. Data sharing is one of them, as water-related

Thus, the future of fusing precipitation data with the aid of

datasets have a variety of formats with different observation

big data techniques can be promising. The reasons are, as

methods generated from different organizations. Either a

mentioned above, that the data fusion of heterogeneous pre-

general standardized format for data exchange or an open

cipitation data sources can offer better results than data

sourced data management tool that glues all relevant scripts

from each single source, and the fusion process can be accel-

for read and write of different data formats can benefit the

erated by parallel computing.

research community on handling datasets. Many data portals based on web services are being created for data exchange and encouraging data-based research. Contradic-

CONCLUSIONS

tion is another challenge of big data in that the correlation between datasets is practically more useful than the causa-

The big data era is an upcoming trend that no one can

tion between datasets, while the causation is the purpose

escape from. Scientists are expected to embrace the big

of scientific research. The correlation identified from a

data era rationally without being blurred by the overwhelm-

vast range of datasets ought to help researchers explore

ing trend. The concept of big data originated from the

new potential causation between the phenomena for further

popularization of the internet as digitalizing of information

research, instead of only replacing the logic-based model.

among the world becomes much easier and cheaper for

The real challenge in the near future is how to make the

future data mining purposes. The commercial value, e.g.,

best use of the available data, as currently there is little

precision marketing, data-based decision-making, behind

being done about big data relevant to hydroinformatics.

the expanding datasets makes the term ‘big data’ extremely

Thus, the purpose of the paper is to encourage the research

trendy. The idea of big data is very adaptable, and can be

community to develop new ideas for the big data era.

valuable for academic purposes as well. Hydroinformatics

Precipitation estimation is one possible area to make a

can benefit from the expanding amount of data collected,

start, as data related to precipitation is being collected

generated and available to the research community. Data

from multiple sources, such as rain gauges, weather radars

from smart meters, smart sensors and smart services,

and satellites. Global precipitation data collected from

remote sensing, earth observation systems, Internet of

NEXRAD and GPM can reach tens of TBs, which is a big

Things, etc., will prompt hydroinformatics into the inevita-

data problem. One promising future is to fuse precipitation

ble big data era. The data usage can be categorized into

data from multiple sources – weather radar, satellite

three dimensions: the natural dimension, analysing climate

remote sensing, rain gauge and model reanalysis data – to

change, flood and drought management and the global

generate a rainfall estimation product with a better spatial

water cycle; the social dimension, focusing on the inter-

and temporal resolution and minimized uncertainty. The

action between water environment and human society;

parallel computing, distributed data storage paradigms and

and the business dimension, using data-based decision-

cloud computing platforms developed during the explosion

making systems for optimizing water resource management

of information are essential to accelerate the data processing

systems and future water planning. The data processing

procedure. The implementation of big data in precipitation

tools like parallel computing, distributed storage have been

data fusion and the parallel computing model are the tip

developed to help users handle the large datasets in hun-

of the iceberg in the big data era. The utilization of available

dreds of GBs or even TBs in tolerable time to make real-

data is not limited to improving the precipitation estimation.

time application possible and interactive human–computer

The future should rely on an ‘all data revolution’, in which

analysis feasible. Cloud computing platforms will make it

innovative analytical ideas, utilizing data from all existing

unnecessary to download the data to a local machine and

and new sources, and providing a deeper, clearer under-

run the model locally, but provide superior computing effi-

standing will significantly shift how we recognize the

ciency in the future cloud computing era.

world (Lazer et al. 2014).


Y. Chen & D. Han

|


REFERENCES Abbott, M. B.  Hydroinformatics: Information Technology and the Aquatic Environment. Avebury Technical, Aldershot, UK. Abbott, M.  Introducing hydroinformatics. J. Hydroinform. 1, 3–19. Almeer, M. H.  Hadoop MapReduce for remote sensing image analysis. Int. J. Emerg. Technol. Adv. Eng. 2, 443–451. Apache.  What Is Apache Hadoop? [Online]. Apache Hadoop. https://hadoop.apache.org/ (accessed 22 June 2015). Arkin, P. A. & Ardanuy, P. E.  Estimating climatic-scale precipitation from space: a review. J. Climate 2, 1229–1238. Beck, C., Grieser, J. & Schneider, U.  Global precipitation analysis products. Global Precipitation Climatology Centre (GPCC). DWD. Bennett, J. & Lanning, S.  The netflix prize. In Proceedings of KDD Cup and Workshop, New York. ACM, 35. Bissell, V. C., Pasteris, P. A. & Bennett, D. G.  Standard hydrologic exchange format (SHEF). J. Water Resour. Plann. Manage. 110, 392–401. Burn-Murdoch, J.  Study: Less than 1% of the world’s data is analysed, over 80% is unprotected [Online]. The Guardian. http://www.theguardian.com/news/datablog/2012/dec/19/ big-data-study-digital-universe-global-volume (accessed 18 June 2015). Butler, D.  When Google got flu wrong. Nature 494, 155–156. Climate Prediction Center National Centers for Environmental Prediction National Weather Service NOAA U. S. Department of Commerce  CPC Global Summary of Day/Month Observations, 1979-continuing. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory, Boulder, CO, USA. Council, N. R.  Frontiers in Massive Data Analysis. The National Academies Press, Washington, DC, USA. Dahl, E.  PC Drive Reaches 500GB [Online]. PCWorld. http:// www.pcworld.com/article/120102/article.html (accessed 10 May 2015). Dean, J. & Ghemawat, S.  MapReduce: simplified data processing on large clusters. Commun ACM 51, 107–113. De Mauro, A., Greco, M. & Grimaldi, M.  What is Big Data? A consensual definition and a review of key research topics. In: 4th International Conference on Integrated Information, Madrid, Spain. doi. 2341.5048. Department of Commerce  U.S. Secretary of Commerce Penny Pritzker Announces New Collaboration to Unleash the Power of NOAA’s Data [Online]. Department of Commerce. https://www.commerce.gov/news/press-releases/2015/04/ us-secretary-commerce-penny-pritzker-announces-newcollaboration-unleash (accessed 2 December 2015). EarthCube  About EarthCube [Online]. Earthcube.org. http:// earthcube.org/info/about (accessed 15 July 2015). ECMWF  ECMWF Climate Reanalysis [Online]. ECMWF. http://www.ecmwf.int/en/research/climate-reanalysis (accessed 16 July 2015).


|

in press

|

2016

Eilander, D.  Twitter used to create real-time flood maps [Online]. https://www.deltares.nl/en/news/twitter-used-tocreate-real-time-flood-maps/ (accessed 27 April 2015). ESA  Earth Explorers overview [Online]. European Space Agency. http://www.esa.int/Our_Activities/ Observing_the_Earth/Earth_Explorers_an_overview (accessed 12 February 2016). Gantz, J. & Reinsel, D.  The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future 2007, 1–16. GEOWOW  GEOWOW Water [Online]. GEOWOW. http:// www.geowow.eu/water.html (accessed 25 July 2015). Ghemawat, S., Gobioff, H. & Leung, S.-T.  The Google file system. ACM SIGOPS operating systems review. ACM, 29–43. Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S. & Brilliant, L.  Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014. Gorenburg, I. P., McLaughlin, D. & Entekhabi, D.  Scalerecursive assimilation of precipitation data. Adv. Water Resourc. 24, 941–953. Guenther, R. & Radebaugh, J.  Understanding Metadata. National Information Standard Organization (NISO) Press, Bethesda, MD, USA. Hartin, E. & Watson, K.  Announces New Innovations that Set the Standard for Performance, Reliability, Capacity, Agility and Efficiency for Helping Companies Harness the Power of Data [Online]. http://www.hgst.com/press-room/pressreleases/HGST-unveils-intelligent-dynamic-storage-solutionsto-transform-the-data-center (accessed 10 May 2015). IBM  IBM100 - Deep Thunder [Online]. IBM’s 100 Icons of Progress: IBM. http://www-03.ibm.com/ibm/history/ ibm100/us/en/icons/deepthunder/ (accessed 14 December 2015). ITU  Internet of Things Global Standards Initiative [Online]. ITU. http://www.itu.int/en/ITU-T/gsi/iot/Pages/default.aspx (accessed 28 July 2015). Kobayashi, S., Ota, Y. & Harada, Y.  The JRA-55 reanalysis: general specifications and basic characteristics. J. Meteor. Soc. Japan 93, 5–48. Lampos, V. & Cristianini, N.  Nowcasting events from the social web with statistical learning. ACM Trans. Intell. Syst. Technol. (TIST) 3, 72. Laney, D.  3D Data management: controlling data volume, velocity and variety. META Group Res. Note 6, 70. Lansdall-Welfare, T., Sudhahar, S., Veltri, G. & Cristianini, N.  On the coverage of science in the media: a big data study on the impact of the Fukushima disaster. In Big Data (Big Data), 2014 IEEE International Conference on. IEEE, pp. 60–66. Leicester, U. O.  Big data technology finds ideal river locations to generate hydro-power. [Online]. ScienceDaily. http://www. sciencedaily.com/releases/2015/04/150413075144.htm (accessed 28July 2015). Lv, Z., Hu, Y., Zhong, H., Wu, J., Li, B. & Zhao, H.  Parallel K-means clustering of remote sensing images based on


Y. Chen & D. Han

|


mapreduce. Web Information Systems and Mining. SpringerVerlag, Berlin, Heidelberg, Germany. Mayer-Schonberger, V. & Cukier, K.  Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt, Boston, MA, USA. Menne, M. J., Durre, I., Vose, R. S., Gleason, B. E. & Houston, T. G.  An overview of the global historical climatology network-daily database. J. Atmos. Ocean. Technol. 29, 897– 910. Moore, D. & McCabe, G.  Introduction to the Practice of Statistics, 5th edn. W. H. Freeman, New York, USA. NASA  Global Precipitation Measurement (GPM) Mission Overview [Online]. Pmm.nasa.gov. http://pmm.nasa.gov/ GPM (accessed 11 May 2015). NASA  Readme for TRMM Product 3B42 (V7) [Online]. GES DISC NASA. http://disc.sci.gsfc.nasa.gov/precipitation/ documentation/TRMM_README/TRMM_3B42_readme. shtml (accessed 11 May 2015). National Centers For Environmental Prediction, N. W. S. N. U. S. D. O. C.  NCEP/NCAR Global Reanalysis Products, 1948-continuing. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory, Boulder, CO, USA. NCEI  NEXRAD Products [Online]. http://www.ncdc.noaa. gov/data-access/radar-data/nexrad-products (accessed 15 May 2015). OGC  OGC@WaterML [Online]. Open Geospatial Consortium. http://www.opengeospatial.org/standards/ waterml (accessed 4 July 2015). OPENDEFINITION  Defining Open in Open Data, Open Content and Open Knowledge [Online]. Opendefinition.org. http://opendefinition.org/od/ (accessed 5 July 2015). ORACLE.  Time Capsule, 1956 Hard Disk [Online]. Oracle. http://www.oracle.com/technetwork/issue-archive/2014/14jul/o44timecapsule-2219543.html (accessed 12 May 2015). Pendergrass, A.  The Climate Data Guide: GPCP (Daily): Global Precipitation Climatology Project [Online]. https:// climatedataguide.ucar.edu/climate-data/gpcp-daily-globalprecipitation-climatology-project (accessed 15 May 2015). Perenson, M.  Hitachi Introduces 1-Terabyte Hard Drive [Online]. PCWorld. http://www.pcworld.com/article/ 128400/article.html (accessed 10 May 2015). Pierson, L.  Civil Engineer Turned Environmental Data Scientist Harnesses Big Environmental Data at UNESCOIHE [Online]. statisticsviews.com. http://www. statisticsviews.com/details/feature/7136441/Civil-EngineerTurned-Environmental-Data-Scientist-Harnesses-BigEnvironmental-D.html (accessed 15 June 2015). Rajaram, H., Bahr, J. M., Bloschl, G., Cai, X., Scott Mackay, D., Michalak, A. M., Montanari, A., Sanchez-Villa, X. & Sander,


|

in press

|

2016

G.  A reflection on the first 50 years of Water Resources Research. Water Resour. Res. 51, 7829–7837. Saha, S., Moorthi, S., Pan, H.-L., Wu, X., Wang, J., Nadiga, S., Tripp, P., Kistler, R., Woollen, J. & Behringer, D.  The NCEP climate forecast system reanalysis. Bull. Am. Meteor. Soc. 91, 1015–1057. Sander, J. & Beyerer, J.  Bayesian fusion: modeling and application. In Sensor Data Fusion: Trends, Solutions, Applications (SDF), 2013 Workshop on. IEEE, pp. 1–6. Selding, P. B. D.  U.S. Government-leased Satellite Capacity Going Unused [Online]. SpaceNews.com. http://spacenews. com/32581us-government-leased-satellite-capacity-goingunused/ (accessed 20 June 2015). SMAP  SMAP Overview [Online]. SMAP, JPL. http://smap.jpl. nasa.gov/observatory/overview/ (accessed 5 July 2015). Snijders, C., Matzat, U. & Reips, U.-D.  Big data: Big gaps of knowledge in the field of internet science. Int. J. Internet Sci. 7, 1–5. Srivastava, P. K., Han, D., Rico-Ramirez, M. A., Al-Shrafany, D. & Islam, T.  Data fusion techniques for improving soil moisture deficit using SMOS satellite and WRF-NOAH land surface model. Water Resour. Manage. 27 (15), 5069–5087. Stewart, R. A., Willis, R., Giurco, D., Panuwatwanich, K. & Capati, G.  Web-based knowledge management system: linking smart metering to the future of urban water planning. Australian Planner 47, 66–74. SWITCH-ON  About SWITCH-ON [Online]. SWITCH-ON Project. http://www.project.water-switch-on.eu/ (accessed 15 June 2015). Verdin, A., Rajagopalan, B., Kleiber, W. & Funk, C.  A Bayesian kriging approach for blending satellite and ground precipitation observations. Water Resour. Res. 51, 908–921. Wang, S., Liang, X. & Nan, Z.  How much improvement can precipitation data fusion achieve with a Multiscale Kalman Smoother-based framework? Water Resour. Res. 47 (3). WMO  Satellite Data Formats and Standards [Online]. World Meteorological Organization. http://www.wmo.int/pages/ prog/sat/formatsandstandards_en.php (accessed 6 July 2015). Zaharia, A. M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I.  Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. USENIX Association, Berkeley, CA, USA, 10. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. & Stoica, I.  Resilient distributed datasets: A fault-tolerant abstraction for inmemory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, Berkeley, CA, USA, pp. 2–2.

First received 21 August 2015; accepted in revised form 22 January 2016. Available online 2 March 2016