Uncorrected Proof 1
© IWA Publishing 2016 Journal of Hydroinformatics
|
in press
|
2016
Big data and hydroinformatics Yiheng Chen and Dawei Han
ABSTRACT Big data is popular in the areas of computer science, commerce and bioinformatics, but is in an early stage in hydroinformatics. Big data is originated from the extreme large datasets that cannot be processed in tolerable elapsed time with the traditional data processing methods. Using the analogy from the object-oriented programming, big data should be considered as objects encompassing the data, its characteristics and the processing methods. Hydroinformatics can benefit from the big data
Yiheng Chen (corresponding author) Dawei Han Water and Environment Management Research Centre, Department of Civil Engineering, University of Bristol, Bristol BS8 1TR, UK E-mail:
[email protected]
technology with newly emerged data, techniques and analytical tools to handle large datasets, from which creative ideas and new values could be mined. This paper provides a timely review on big data with its relevance to hydroinformatics. A further exploration on precipitation big data is discussed because estimation of precipitation is an important part of hydrology for managing floods and droughts, and understanding the global water cycle. It is promising that fusion of precipitation data from remote sensing, weather radar, rain gauge and numerical weather modelling could be achieved by parallel computing and distributed data storage, which will trigger a leap in precipitation estimation as the available data from multiple sources could be fused to generate a better product than those from single sources. Key words
| big data, data fusion, hydroinformatics, precipitation estimates
INTRODUCTION The inevitable trend of big data along with the growing capa-
provider Netflix has its recommendation system based on
bility to handle huge datasets is reshaping how we
hundreds of millions of accumulated anonymous movie rat-
understand the world. According to Google Scholar, the
ings to improve the probability that users rent the movies
number of publications containing the phrase ‘big data’ in
recommended by Netflix (Bennett & Lanning ).
the title and the number of publications about big data
Although the popularity of big data is related to its com-
and water are shown in Figure 1, revealing that the interest
mercial value, we believe that the idea of big data can benefit
in big data has dramatically risen since 2010, however, the
hydroinformatics research for multiple reasons. First, the big
research on big data in hydroinformatics is still at a very
data analysis encourages the utilization of multiple datasets
early stage. This is a very simple example of the so-called
from various sources to discover the big trends. Second, the
big data analysis, as the result is based on searching a vast
computing tools developed for big data analysis, e.g., parallel
number of academic publications powered by Google Scho-
computing and distributed data storage, can help tackle
lar. Google Scholar indexed academic publications provide
data-intensive jobs in the field of hydroinformatics. Third,
internet users with a very efficient way to find academic pub-
the novel correlation found by mining various large datasets
lications. The value of the online search engine is its
has the potential to lead to new scientific exploration. Apart
lightning fast speed which enables the user to get the
from the companies in the internet industry working closely
result from the ocean of online information in merely milli-
with the data from the internet, scientists have collected sub-
seconds. Another application of big data is precision
stantial amounts of data for hydrology, meteorology and
marketing, i.e., the online movie subscription rental service
earth observation with a history much longer than that of
doi: 10.2166/hydro.2016.180
Uncorrected Proof 2
Y. Chen & D. Han
Figure 1
|
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
Left: the number of publications about big data. Right: the number of publications about big data and water by Google Scholar.
the internet. The development of the internet and the move-
process the data becomes tricky. The conflict between the
ment of open data significantly accelerates data sharing and
boom of big data and the data storage hard system, where
improves the accessibility of archived data. The hydroinfor-
the I/O speed is limited by the physical mechanism of
matics community will benefit from the active combination
hard disk, stimulated the development of parallel computing
of a huge amount of data and data processing technologies
and distributed data storage. After being able to effectively
for knowledge discovery and management. Precipitation is
manage large datasets, seven types of data modelling algor-
one important part of the water cycle in hydrology. The
ithms are summarized. Furthermore, when the correlation
accumulated precipitation datasets from heterogeneous
between datasets is successfully modelled, whether to only
sources, e.g., rain gauges, weather radars, satellite remote
utilize the correlation or to discover more scientific knowl-
sensing and numerical weather models, have reached tens
edge is discussed.
of terabytes in size, with different characteristics, i.e., spatial and temporal coverage, resolution and uncertainties. Data
Google Flu Detector
fusion is a possible method to utilize the accumulated datasets to produce a better result with enhanced resolution and
Currently, the concept of big data is popular in the analysis
minimized uncertainty.
of sociology, public health, business and bioinformatics. The
This paper consist of three parts. The first part starts
increasingly expanding internet is attracting people’s atten-
with an explanation of the concept of big data, then intro-
tion as one major data source. The data on the internet,
duces the popular Apache Hadoop family to handle large
especially new media, are generated by individuals, reflect-
amounts of data and seven classes of data analysis models,
ing their daily life, emotions, shopping preferences, etc.
and discusses important ideas developed from the big data
Without doubt, these types of data can be easily utilized in
era. The second part discusses the impact of big data on
the field of online business, public health, sociology, as
hydroinformatics with the focus on the issues of data shar-
these topics mainly focus on individual behaviours. In fact,
ing. Then, the third part emphasizes the future of
big data analysis opens a new way for researchers in these
precipitation data fusion as one promising big data utiliz-
areas to find out what is actually happening from the
ation in the area of hydroinformatics.
recorded online behaviours of individuals. Google developed a flu detector that monitors healthseeking behaviour in the form of online web search queries
BACKGROUND
by millions of users around the world every day. The methodology was to find the best matches among 50 million
This section aims to introduce the popular term ‘big data’
search terms to fit 1,152 flu data points from Central Disease
starting with the example of Google Flu Detector (GFD), fol-
Control (CDC). By analysing the large numbers of search
lowed by the explanation of the concept of ‘big data’. Once
queries, Googler found 45 search terms, when used in a
we get huge amounts of data, how to physically store and
mathematical model, were strongly correlated with the
Uncorrected Proof 3
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
percentage of physician visits for influenza-like symptoms,
describe datasets with quantity and complexity beyond the
based on which the GFD estimates the level of weekly influ-
capacity of normal computing tools to capture, curate,
enza activity with a 1-day reporting lag (Ginsberg et al.
manage and process with a tolerable speed (Snijders et al.
). From the perspective of the Google users, they tend
). Another explanation of big data refers to developing
to consult the accessible internet rather than immediately
new insights or creating new values at a large scale instead
consulting the doctor, when they feel a bit ill. The GFD pre-
of a smaller one (Mayer-Schönberger & Cukier ). De
dicts the influenza activity from user query logs, though with
Mauro et al. () investigated 14 existing definitions of
some noises, responding much faster than the CDC with a
big data, and proposed a formal definition as:
2-week reporting lag, which gives Google an advantage over the traditional disease control method. However, the
Big Data represents the Information assets characterized
GFD does not always perform well. In 2009, its poor under-
by such a High Volume, Velocity and Variety to require
estimation of the influenza-like illness (ILI) in the United
specific Technology and Analytical Methods for its trans-
States of the swine flu pandemic forced Google to modify
formation into Value.
its algorithm as people’s search behaviour changed due to the exceptional nature of the pandemic. In December
This definition can be subdivided into three groups: the
2012, it overestimated more than double the doctor visits
characteristics of the datasets, the specific technologies
for ILI than the CDC (Butler ).
and analytical methods to manipulate the data, and the
Despite the advantage of the quick response and reason-
ideas to extracts insights from the data and creation of
able accuracy of the GFD, the uncertainty from human
new values. Therefore, big data is not just about massive
behaviour searching that led the model to departure from
amounts of data. In general, the goal of big data analysis is
the CDC data cannot be ignored. This type of uncertainty
knowledge discovery from massive datasets, which is a chal-
is embedded within the mechanism of the analysis, which
lenging systematic problem. The data analysis systems
may only be overcome by an improved algorithm. Regard-
should: utilize the existing hardware platform with distribu-
less of the weakness of GFD, the point is that the apparent
ted and parallel computing; accommodate a variety of data
value of the data may only be the tip of the iceberg.
formats, models, loss functions and methods; be highly cus-
Google started its business by providing an online searching
tomizable for users to specify their data analysis goals
service for internet users without the purpose of predicting
through an expressive but simple language; provide useful
the outbreak and threats of influenza, but the search query
visualizations of key components of the analysis; communi-
logs become extremely valuable after being accumulated
cate with other computational platforms seamlessly; and
for several years. The reason for this is that Google effec-
provide many of the capabilities familiar from large-scale
tively collected the information that the search engine
databases (Council ).
users want to know at a certain time and certain location. The big information pattern, contributed by millions of
The expanding data vs. the developing computing
users around the world, showed additional big value
power
behind search query data. To summarize, the GFD shows two features of the big data analysis, crowdsourcing and
The typical big data characteristics include high volume (the
by-product.
quantity of data generated), high velocity (the speed of collecting data), and high variety (the category of data)
What is big data
(Laney ). The concern is whether the existing computing system can handle the increasingly large data. An Inter-
The fashionable term of ‘big data’ is sometimes so hot that
national Data Corporation (IDC) report has estimated that
many people attempt to embrace it in this data-rich era with-
the data size of the world will grow from 130 exabytes
out a clear understanding. The term ‘big data’ is simple but
(1018 bytes) in 2005 to 40 zettabytes (1021 bytes) in 2020,
makes its meaning ambiguous; it is commonly used to
at a 40% annual increase (Gantz & Reinsel ). New
Uncorrected Proof 4
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
datasets are continuously being collected from the internet,
parallel computing and distributed storage were developed
the Internet of Things, the remote sensing network and
to counter this issue. MapReduce is a distributed program-
e-commence, wearable devices, etc. Unfortunately, only
ming model for processing and generating large datasets
3% of all data are properly tagged and ready for use, and
developed by Google. The idea of MapReduce is to specify
only 0.5% of data are analysed, which yields a large poten-
a Map and a Reduce function which are suitable for parallel
tial market for data utilization (Burn-Murdoch ). The
computing, and the underlying runtime system automati-
actual data size needed is dependent on the task of data
cally
analysis, which further scales down the size of data to be
clusters of machines, handles machine failures, and sche-
processed. On the other hand, the data storage capacity
dules inter-machine communication to make efficient use
has increased dramatically in the past decades. In 1956,
of the network and disks. As the size of datasets is extremely
IBM made the first commercial disk drive with a capacity
large for big data problems, a cluster of machines connected
of 3.75 MB (Oracle ). In 1980, the world’s first giga-
in a network are used to overcome the limit of computing
byte-capacity disk drive (2.52 GB), the IBM 3380, was the
power and data storage of a single machine, but the network
size of a refrigerator. After 25 years, the first 500 GB desktop
bandwidth becomes the bottleneck as it is a rare resource.
hard drive was shipped (Dahl ), followed by the 1 TB
Thus, the MapReduce system is optimized targeting at redu-
one in 2007 (Perenson ). In 2014, Western Digital
cing data transfer across the network through sending the
shipped the 8 TB hard drive and announced the world’s
code to the local machine and writing the intermediate
first 10 TB hard drive (Hartin & Watson ). The unit
data to local disk. The MapReduce system minimizes the
cost of data storage will drop down from $2.00 per GB to
impact of slow machines, and can handle machine failures
$0.20 per GB from 2012 to 2020 (Gantz & Reinsel ).
and data loss by redundant execution. The success of the
The storage of data should no longer be a big problem
MapReduce programming model relies on several things.
owing to massive storage technologies such as Direct
First, the model automatically deals with the details of par-
Attached Storage (DAS), Network Attached Storage
allelization, fault tolerance, locality optimization and load
(NAS) and Storage Area Network (SAN), as well as the
balancing, which makes it easy for programmers even with-
cloud data storage.
out experience with parallel and distributed computing.
parallelizes
the
computation
across
large-scale
The storage capacity of the hard disk keeps increasing,
Second, the Map and Reduce functions are capable of a var-
nevertheless the I/O speed of the hard disk grows slowly
iety of applications, such as sorting, data mining, machine
due to the limitation of the hard disk mechanism. Solid
learning, etc. Third, the MapReduce can scale up to large
state disk (SSD) has a much higher I/O rate and negligible
clusters of thousands of commodity machines, which
seek time; however, in the meantime, the cost per unit sto-
means the computing resources can be utilized for big pur-
rage is much higher than that of the hard disk. Regardless
poses (Dean & Ghemawat ). The Hadoop is an open-
of the cost, the SSD has a lower storage capacity than the
source version of the MapReduce framework developed by
single device. The I/O speed of the data storage devices is
Apache, freely available to the scientific community. The
the bottleneck of extreme large data processing rather than
Hadoop contains the Hadoop Distributed File System
the data storage capacity.
(HDFS) working together with MapReduce after Google published the technical details of the Google File System
The MapReduce parallel computing
(Ghemawat et al. ). The Apache Hadoop also contains Hadoop Common, the common utilities that support the
An appropriate software system is essential to dealing with
other Hadoop modules; and Hadoop YARN, a framework
extremely large datasets apart from the development of the
for job scheduling and cluster resource management.
hardware system. As the improvement of I/O speed of the
There are many other projects in Apache which are related
hardware system did not catch the speed of the expansion
to Hadoop, including HBase (a scalable, distributed data-
of data storage, the time required to process data dramati-
base that supports structured data storage for large tables),
cally increased without an appropriate algorithm. The
Hive (a data warehouse infrastructure that provides data
Uncorrected Proof 5
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
summarization and ad hoc querying), Mahout (a scalable
estimation, kernel regression methods, radial basis function
machine learning and data mining library), Pig (a high-
neural networks and mean-shift tracking – and modern
level data-flow language and execution framework for paral-
methods such as support vector machines and kernel princi-
lel computation) and ZooKeeper (a high-performance
pal components analysis (PCA). Other instances include
coordination service for distributed applications), etc.
k-means, mixtures of Gaussian clustering, hierarchical clus-
(Apache ).
tering, spatial statistics of various kinds, spatial joins, the
Hadoop MapReduce has a weakness during iterative
Hausdorff set distance, etc. Graph-theoretic computation is
data analysis in that the intermittent datasets are stored on
the third giant, including problems with graph traversing.
the local hard disk. As the iterative data analysis requires
The graph can be either the data itself or the statistical
multiple read and write of local intermittent data, this will
model in the form of a graph depending on the nature of
dramatically slow down the analysis. This happens to most
the problem. Common statistical computations include
machine learning algorithms, e.g., gradient decent. Apache
betweenness, centrality and commute distances; used to
Spark is the latest programing model in the big data world,
identify nodes or communities of interest. Nevertheless,
featuring its lightning fast data processing speed for iterative
the challenges arise when computing in large-scale, sparse
jobs (Zaharia et al. ). The Spark achieved its lightning
graphs. When the statistical model takes the form of a
fast speed by the implementing Resilient Distributed Data-
graph, graph-search algorithms continue to remain impor-
sets (RDDs), a distributed memory abstraction that lets the
tant, but there is also a need to compute marginal
programmer perform in-memory computation (Zaharia
probabilities and conditional probabilities over graphs; oper-
et al. ). The Spark outperforms Hadoop by 20 times in
ations generally referred to as ‘inference’ in the graphical
speed by utilizing the RAM instead of hard disk to store
models literature. The forth computational giant is linear
the intermittent data.
algebraic computations, including linear systems, eigenvalue problems and inverses, deriving a large number of linear
Modelling big data
models, e.g., linear regression, PCA and many variants. Many of them are suitable for generic linear algebra
There are many data-based computational methods, and
approaches, but there are two important issues. One is
they can be classified as ‘the seven computational giants of
that the optimization in statistical learning problems does
massive data analysis’ (Council ). Data-based computing
not necessarily need to be trained to high accuracy to
is facing challenges due to the expansion of data volume and
avoid overfitting. Another important difference is that multi-
dimensionality. The first giant is basic statistics including
variate statistics has its own matrix form, that of a kernel (or
calculating the mean, variance and moments; estimating
Gram) matrix; while, on the other hand, the computational
the number of distinct elements; number counting and fre-
linear algebra involves techniques specialized to take advan-
quency analysis; and calculating order statistics such as
tage of certain matrix structures. In kernel methods, such as
the median. These tasks typically require O(N) complexity
Gaussian process regression or kernel PCA, the kernel
calculations for N data points. The second computational
matrix can be too large to be stored in the matrix explicitly,
giant is the generalized N-body problem, including nearly
requiring probably matrix-free algorithms. Optimization is
any problem involving distances, kernels, or other simi-
the fifth giant in massive data analysis. Linear algebraic
larities between pairs or higher-order n-tuples of data
computations are the main subroutine of second-order
2
points. The computational complexity is typically O(N ) or 3
optimization algorithms. Non-trivial optimizations will con-
O(N ). N-body problems are involved in range searches,
tinue and become increasingly common as methods have
nearest-neighbour search problems and the nearest-neigh-
become more sophisticated. Linear programming, quadratic
bour classification problem. They also appear in nonlinear
programming, and second-order cone programming are
dimension reduction methods, also known as manifold
involved in support vector machines and recent classifiers,
learning methods. N-body problems are related to kernel
and semidefinite programming appears in manifold learning
computation, like kernel estimators – such as kernel density
methods. Other standard types of optimization problems,
Uncorrected Proof 6
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
e.g., geometric programming, are to be applied in data analy-
explanatory variables or lurking variables (Moore &
sis in the near future. The sixth one is integration of
McCabe ). The association between two variables is
functions, which is required for fully Bayesian inference,
not that simple, no wonder how complex it is to understand
and also non-Bayesian settings, most notably random effects
the practical problems with multiple variables. Focusing on
models. The integrals that appear in statistics are often
correlation makes it much easier for practical data mining
expectations. The frontier is the high-dimensional integrals
application without too much effort on the causation.
arising in Bayesian models for modern problems. The
Machine learning, the artificial intelligence method, is the
approaches for this problem include Markov Chain Monte
typical algorithm lying behind the big data analysis, which
Carlo, or sequential Monte Carlo in some cases, approxi-
is a ‘black box’ model. Users feed inputs to the machine
mate Bayesian computation (ABC) operating on summary
learning algorithm and get outputs from it without knowing
data, and population Monte Carlo, a form of adaptive impor-
what really happens to the data training process. This pro-
tance sampling. Alignment problems is the seventh giant,
cess is practically useful without necessarily understanding
consisting of problems involving matchings between two
the causation behind it, but the causation is what scientists
or more data objects or datasets, such as data integration,
are always seeking for. For academic purpose, detecting
data fusion. The fundamental alignment problems are
the potential of the data correlation is not the ultimate
usually carried out before performing further data analysis.
goal. Instead, the big data should help the development of science in a way that the novel association between big data-
Correlation vs. causation
sets can be detected to motivate further research for the causation. From the control theory perspective, the scienti-
The most significant part of the big data concept is the fun-
fic exploration is to open the ‘black box’ of the objective
damental and innovative ideas that change how people
world iteratively. On the other hand, the scientific model
interact with the world. The enrichment of available data
developed from analysing large datasets can then be vali-
enables people to consider the entire system rather than
dated through the correlation of the datasets. Figure 2
taking few samples, thereby scientists can discover trends
gives a clear illustration of the ideas stated above. The
or phenomena that cannot be revealed with small data.
major difference is that the science focuses on causation,
The idea of big data always encourages to think bigger, to
either derived from correlation, or validated through corre-
broaden the horizon to cover a big scope rather than focus
lation, while the big data analysis in industry focuses on
on a few small areas. Moreover, the big data analysis focuses
values of correlation from the data.
on correlation rather than causation, in that the correlations between datasets do not necessarily lead to causation, or
Relevance to hydroinformatics
that making use of the correlation is sometimes more valuable than exploring the causation behind it (Mayer-
Hydroinformatics,
Schönberger & Cukier ). For simplicity, the associations
hydraulics, comprises the application of information and
originated
from
the
computational
of two variables can be classified in three types, i.e., causa-
communications technologies (ICTs) to the understanding
tion, common response and confounding. Causation
and management of the waters of the world (Abbott ),
means direct cause-and effect connection between variables,
addressing the increasingly serious problems of the equi-
revealing that they are strongly correlated. Common
table and efficient use of water for different purposes.
response means the association between variables is in
Once the term hydroinformatics was defined, it meant to
fact caused by another lurking variable. The change of the
integrate artificial intelligence into numerical simulation
observed variables is in response to the changes of the
and modelling, and to shift the computational-intensive
hidden variable, even though the observed variables have
analysis to information-based research. The two main lines
no direct causal link. Two variables are confounded when
of hydroinformatics, data mining for knowledge discovery
their effects on a response variable cannot be distinguished
and knowledge management (Abbott ), are strongly
from each other. The confounded variables may be either
dependent on information of which data, both textual or
Uncorrected Proof 7
Y. Chen & D. Han
Figure 2
|
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
The relationship between the datasets, correlation and causation.
non-textual, is the major carrier. Data from smart meters,
Water-related data include the measurement of precipitation
smart sensors and smart services, remote sensing, earth
(rainfall, snow and hail), river flow, water quality, soil moist-
observation systems, etc., will prompt hydroinformatics
ure, soil characteristics, ground water condition, air
into the inevitable big data era. The challenge of big data
temperature and humidity, solar flux, etc. The observation
and data mining for environmental projects is the most
methods developed are from local stations for point
pressing one in the near future (Pierson ). One simple
measurement to remote sensing – radar and satellites, and
example of big data analysis is called text mining. It has
drone. Earth observation satellites are generating huge
been carried out on the 50th anniversary of Water Resources
volumes of data including weather- and water-related infor-
Research to produce word clouds, shown in Figure 3, based
mation. ESA launched SMOS for soil moisture observation
on highly cited papers for every ten years of Water Resources
in 2009, and will launch ADM-Aeolus for Atmospheric
Research, which provides a visual representation of the
Dynamics observation in 2017 (ESA ). NASA launched
themes emphasized in each decade (Rajaram et al. ).
SMAP to map soil moisture and determine the freeze or thaw state in 2015 (SMAP ). The GPM mission launched
Data for hydroinformatics
in 2015 aims to provide global rain and snow observations based upon the success of TRMM launched in 1997
In general, water-related problems are quite complex due to
(NASA ). EUMETSAT has two generations of active
the interrelationships between water-related environmental,
METEOSAT satellites in geostationary orbit and a series
social and business factors. The data being generated and
of three polar orbiting METOP satellites for weather now-
collected relevant to hydroinformatics features huge
casting and forecasting and understanding climate change.
volumes and multiple types. For the purpose of simplifica-
Without doubt, the increasing amount of earth observation
tion, the data sources for hydroinformatics, without loss of
data, including precipitation, soil moisture and wind
generosity, can be classified into three dimensions, i.e., the
speed, etc., will improve the understanding of the global
natural dimension, the social dimension and the business
water cycle, and benefit weather forecasting, flood and
dimension.
drought prediction. Unfortunately, although many satellites
The natural dimension is about water as one important
were launched or are to be launched, the huge amount of
component of the natural environment. Understanding the
available data is rarely used; only 3–5% of data is used on
water cycle, the temporal and spatial distribution of water
a daily average, while billions of dollars have been invested
and the interaction of water and the environment is part
annually (Selding ). Apart from the earth observation
of the objectives of hydroinformatics for improving water
data, reanalysis data are another important information
resource management, flood and drought management.
source with high data quality. In other words, the
Uncorrected Proof 8
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
information source is not limited to the observation of the current situation and the archived past situation; the model generated data cannot be neglected. Reanalysis of archived observations is achieved by combining advanced forecast models and data assimilation systems to create global datasets of the atmosphere, land surface and oceans, as an operational analysis dataset will suffer from inconsistency due to the frequent improvements of the forecast models. The NCEP Climate Forecast System Reanalysis includes over 80 variables, goes back to 1948 and is continuing (National Centers for Environmental Prediction ). ECMWF has series of ERA projects for global atmospheric reanalysis that can be traced back to 1957 (ECMWF ). The Japan Meteorological Agency conducted the JRA-55 project for a high-quality homogeneous climate dataset covering the last half century (Kobayashi et al. ). The model generated data are four dimensional, three dimensions in space and one in time, and of high spatial and temporal coverage and resolution, resulting in huge volumes of data, which means that hydroinformatics is entering a data-intensive era. Utilization of the currently available data is challenging due to the uncertainties of the data, the challenges of processing and the lack of ideas of data utilization. In the big data era, it is encouraged to make the best of the huge amount of data with tolerance of the uncertainties. The processing of large amounts of datasets is becoming easier with the development of computing tools. The lack of creative ideas is the main limitation of the utilization of data. A frontier application example is a prototype software that automatically finds an ideal location for hydro-power based on over 30 freely remote sensing and environmental datasets in the UK (Leicester ). The social dimension is about the interaction of water environment and human society. With the digitalization of textual information available online and the explosion of social media, textual mining technologies enable a new research area of the public attitude towards certain issues. For instance, five million scientific articles have been analysed to explore the impact of the Fukushima disaster on the media attitude towards nuclear power (Lansdall-Welfare et al. ). Similar ideas can be applied to discover waterFigure 3
|
Word clouds of highly cited papers from Water Resources Research in each decade as an example of big data related to water (Rajaram et al. 2015).
related issues, e.g., the social attitude towards climate change, water saving, water policy, etc. Apart from the
Uncorrected Proof 9
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
discovery of public attitude, the internet is logging the activi-
purposes such as to help a utility company prepare for the
ties of internet users, which can be potentially valuable to
after effects of a major storm or to help airlines and airports
discover real-world situations, demonstrated by the example
manage weather-generated delays by rearranging or combin-
of GFD mentioned in the previous section. Twitter data is
ing flights more efficiently (IBM ). Another possibility is
now attracting many researchers to dig out water environ-
that, as inspired by the big data application in e-commerce
ment-related research. It was found that Twitter content
that utilizes accumulated user activity logs for a recommen-
could infer daily rainfall rates in five UK cities, which revealed
dation system, the smart metering data can be integrated
that the online textual features in Twitter were strongly related
with end-user water consumption data, wireless communi-
to the topic with significant inference (Lampos & Cristianini
cation networks and information management systems in
). Two Dutch organisations, Deltares and Floodtags, have
order to provide real-time information on how, when and
developed real-time flood-extent maps based on tweets about
where water is being consumed by the consumer and utility
floods for Jakarta, Indonesia (Eilander ). This method
(Stewart et al. ). The information from the combination
gives disaster management a real-time view of the situation
of data will be valuable to architects, developers and planners,
with wide coverage. The enrichment of the new media data
seeking to understand water consumption patterns for future
on the internet enables a new model for scientific research.
water planning. Smarter metering is one example of the ambi-
The new model gathers information from what the internet
tious ideas of the Internet of Things as a global infrastructure
users post online. The users are actually playing the role of
for the information society, enabling advanced services by
information collector, and they deposit the information
interconnecting things based on existing and evolving intero-
about what they observe about the environment to the inter-
perable ICTs (ITU ). Furthermore, the operation data
net. The internet is like a boundless ocean of data that
collected by companies in the water industry also have poten-
records how internet users interact with the internet. The
tial values for data mining for optimizing the system and
data ocean has valuable potential for scientists to discover
providing more information for decision-making.
novel correlations between real-world situations. The fundamental data mining techniques behind the big data
The trend of open data
application, such as GFD, estimating precipitation from Twitter, etc., are the same, i.e., to dig out the correlation between
The increasing number of openly available data sources will
the information and the targeted result. The distinction of
benefit the research community as data is the basic material
these analyses is that the social network data application is
for data-based research. Open data means data that can be
based on people’s mental reaction to certain events while
freely used, modified, and shared by anyone for any purpose
the nature scientific research is mainly based on the physically
(Opendefinition ). Open data is the further development
interpretable model. As the behaviour of people is ambiguous
of free data that data is freely licensed for limited purposes
to interpret and predict, the big data analysis of social network
and certain users, while closed data is usually restricted by
data is dominated by the machine learning or statistical
copyright, patents or other mechanisms. The goals of the
approaches.
open data movement are similar to those of other ‘Open’
The business dimension covers but is not limited to water
movements, such as open source, open hardware, open con-
extraction, water treatment, water supply, waste water collec-
tent and open access. The data owner may not have the
tion and treatment. IBM has been a pioneer in utilizing data
appropriate ideas and techniques to produce extra values
and computing tools collaboration with National Oceanic
from the data, while, on the other hand, people with innova-
and Atmospheric Administration (NOAA) to explore the
tive ideas and the ability to process the data may find it
business of weather. They built one of the first parallel proces-
difficult to find and access the data they need. The open
sing supercomputers for weather modelling in 1995, named
data movement will activate the combination of data, data
Deep Thunder Project. Deep Thunder creates 24- to 48-hour
mining methods and new ideas to create additional values
forecasts at 1–2 km resolution with a lead time of 3 hours to
by removing the barrier between the data providers and
3 days and combines with other data customized for business
the data users. Thus, the research data and its products
Uncorrected Proof 10
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
can achieve full value and accelerate future research only
not been identical. A very simple example is that even the
when being open. Multiple national governments have cre-
expression of dates is different. Chinese use ‘yyyy-mm-dd’;
ated web sites for the open delivery of their data for
Europeans use ‘dd-mm-yyyy’; and in the US people use
transparency and accountability, e.g., Data.gov for the US
‘mm-dd-yyyy’. This issue with dates was tackled by ISO
government, Data.gov.uk for the UK government, European
8601, an international agreement to use ‘yyyy-mm-dd’ for
Union Open Data Portal (http://open-data.europa.eu/) and
the format of dates. Other issues may include but are not
Canada’s Open Government portal (http://open.canada.
limited to the observation resolution, both temporal and
ca/en), etc. For open data in science, the World Data
spatial, the expression of missing value, the data processing
System (WDS) of the International Council for Science
methods, the units of the data, etc. As the characteristics of
was created based on the legacy of the World Data Centres
different datasets vary, the data should be clearly tagged by
in 2008 to ensure the universal and equitable access to qual-
the metadata which is essential for the data user to carry out
ity-assured scientific data, data services, products and
data analysis. The metadata is the information about infor-
information. National Climatic Data Center, containing
mation, which describes, explains, locates, or otherwise
huge amounts of environmental, meteorological and climate
makes it easier to retrieve, use or manage an information
datasets, is the world’s largest archive of weather data.
resource (Guenther & Radebaugh ). The metadata
SWITCH-ON is a European project that works towards sus-
should capture the basic characteristics of a data or infor-
tainable use of water resources, a safe society and
mation resource including who, what, when, where, why
advancement of hydrological sciences based upon open
and how about the data resource. In the big data era, ad
data. The project aims to build the first one-stop shop
hoc data analysis for simple tasks may be time-consuming
portal of open data, water information and its users in one
when the data size becomes extremely large. It can be
place (SWITCH-ON ). EarthCube is a project launched
worthwhile for the data provider to process feature extrac-
in 2011 that develops a common cyber infrastructure for the
tion offline and incorporate these features into the
purpose of collecting, accessing, analysing, sharing and
metadata, such as mean values, extremums, general trend
visualizing all forms of data and related resources for under-
or pattern prior to the data release. Such pre-process of
standing and predicting a complex and evolving solid Earth,
data can make it much easier for data users to find the
hydrosphere, atmosphere, space environment systems,
data they need.
through the use of advanced technological and compu).
Another challenging issue of integration data usage is
ongoing
the variety of data formats, varying from simple binary or
movement of open data can boost data-based research and
CSV format to advanced self-describing Network Common
data usage by removing the legal restriction on data use.
Data Form (netCDF), Hierarchical Data Format (HDF),
Many data portals are being created for data sharing through
GRIddedBinary (GRIB), Extensible Markup Language
web services with much more powerful data search tools
(XML), waterML, etc. For satellite data, High Rate Infor-
where users can find data by location, time and data types,
mation Transmission (HRIT), Low Rate Information
etc.
Transmission (LRIT), High Rate Picture Transmission
tational
capabilities
(EarthCube
The
(HRPT) and Low Rate Picture Transmission (LRPT) are Issues of data sharing
the CGMS standards agreed upon by satellite operators for the dissemination of digital data to users via direct broad-
The trend of open data will motivate data sharing and com-
cast. The difference is that HRIT and LRIT transmit data
prehensive utilization of data by removing the restriction of
originating from geostationary satellites while the HRPT
patents, copyrights, but other issues of data sharing necessi-
and LRPT transmit data originating from low earth orbit sat-
tate cooperative effort and innovative ideas. Data format is
ellites. Also, their names suggest that they operate at
one of them. As the datasets related to water are collected
different data broadwidth. The World Meteorological
by different organizations in different countries all around
Organization (WMO) has two binary data formats: Binary
the world, how the data was recorded and expressed has
Universal Form for the Representation of meteorological
Uncorrected Proof 11
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
data (BUFR) to represent any meteorological dataset
application layer and the system layer, or connects between
employing a continuous binary stream, and GRIB format
different software components. The data-based analysis
to transmit large volumes of gridded data to automated
necessitates such middleware to handle large datasets from
centres over high-speed telecommunication lines using
different sources in a variety of data formats and many com-
modern protocols. The Man computer Interactive Data
putational models as well as being compatible with multiple
Access System (McIDAS) goes beyond a simple data
programming languages. The open source development has
format to a set of tools for analysing and displaying meteor-
to be implemented to such middleware to enable the whole
ological data for research and education. NetCDF is a
research community to contribute to and benefit from it.
machine-independent, self-describing, binary data format standard and a set of software libraries for exchanging
Boosts from cloud computing
array-based scientific data; it features self-describing, portable, scalable, appendable, shareable and archivable data.
The tools developed in the big data era, such as Hadoop
HDF, HDF4 or HDF5 is a library and multi-object file
MapReduce, Apache Spark, can handle extremely large
format for the transfer of graphical and numerical data
datasets within tolerable runtimes, but the knowledge and
between computers developed by NASA. HDF supports sev-
technique to set up and manage the tools are required.
eral different data models in a single file, including
The commercial cloud computing service is available to
multidimensional arrays, raster images and tables, which
scientists as an alternative, where data storage and proces-
respectively have their specific data type and API. The
sing can be done in the cloud, such as Microsoft Azure,
XML is a general-purpose markup language, primarily
Amazon Elastic Compute Cloud, Google Compute Engine,
used to share information via the Internet (WMO ).
Rackspace, Verizon and GoGrid. The commercial cloud
The WaterML2 is a variation of XML specified for water
has a usage-based price policy, making the computing job
observation data, and allowing data exchange across infor-
more cost-effective than implementing local clusters.
mation
systems
(OGC
).
Standard
Hydrologic
Cloud computing is scaleable to suit the job, and does not
Exchange Format (SHEF) was created to store and
require extensive knowledge on configuring local clusters.
exchange hydrometeorological data in the 1980s, and is
US NOAA has launched its Big Data Project collaborating
readable by both humans and machines (Bissell et al. ).
with Amazon Web Service, Google Cloud Platform, IBM,
The variety of data formats may cost scientists much
Microsoft and the Open Cloud Consortium (Department
time dealing with different formats rather than working on
of Commerce ). The NOAA data will be brought to the
scientific problems when utilizing multiple datasets from a
cloud platform together with big data processing services
variety of sources. To enhance the accessibility of hydrologi-
such as Google BigQuery and Google Cloud Dataflow, to
cal data, GEOWOW (Global Earth Observation System of
explore and to create new findings. NOAA’s Big Data Pro-
Systems (GEOSS) interoperability for Water, Ocean and
ject indicated a coming trend of combing the tremendous
Water) contributes to international standardization pro-
volume of high quality data held by the government and pri-
cesses within the Hydrology Domain Working Group, a
vate industry’s vast infrastructure and technical capacity of
joint working group of the Open Geospatial Consortium
data management and analysis.
(OGC) and the WMO. GEOWOW developed for the first time a common global exchange infrastructure for hydrological data based on standardized formats and services.
BIG DATA FOR PRECIPITATION ESTIMATE
GEOWOW aims to evolve the GEOSS in the aspect of water, and is part of the GEOSS Conmen Infrastructure
The available precipitation data
(GCI) (GEOWOW ). In addition, a middleware that connects the data I/O scripts and the data analysis tools
Although computer scientists have attempted to use newly
may be a feasible alternate featuring reusability. Middleware
emerged social network data to estimate rainfall, as men-
is the glue of software, and usually lies between the
tioned in the previous section, it is like a ‘dessert’; the
Uncorrected Proof 12
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
main data sources of rainfall measurement are rain gauges,
future direction, the existing data sources for rainfall
weather radars and satellites, which are the ‘main course’.
measurement have accumulated a vast quantity of data
The ‘dessert’ has some obvious shortages apart from its
which can substantially benefit from the big data technol-
advantage on data cost and quick response. The use of Twit-
ogy. Table 1 shows information of some widely used
ter data to estimate rainfall or flood situation, as mentioned
datasets containing precipitation data. The features of pre-
in the previous section, requires the prevalence of Twitter at
cipitation data from different sources vary significantly due
a local level, e.g., developed urban area with a large number
to the different measuring mechanisms and processing
of users and a wide internet access, which implies the spatial
algorithms.
coverage and resolution of the data can be poor in less developed cities and rural areas. The temporal length of the
Data fusion
Twitter data is significantly less than the meteorological records, which can be traced back to 1861, while Twitter
Hydrologists are pursuing fine and accurate estimates of pre-
was launched in 2006. Despite the low cost and quick
cipitation data in both space and time for drought and flood
response of the new data sources foretelling a possible
management.
Table 1
|
Rain
gauge
observations
are
direct
Information of some datasets containing precipitation
Spatial and temporal coverage and Datasets
Data source
Data size
resolution
GPCC Global Precipitation Climatology Centre monthly precipitation dataset
Gauge based
4.2 GB
Monthly values from 1901/01. Varys, 0.5 , 1.0 and 2.5 global grid
(Beck et al. )
The Next Generation Weather Radar (NEXRAD)
Radar
Comprising 160 sites throughout the US. 1 grid. 1 hour, 3 hour and total storm accumulated data since 1988
(NCEI )
Global Historical Climatology Network Daily Database
Station record
22 GB
Daily since 1861. Contains records from over 80,000 stations in 180 countries and territories
(Menne et al. )
CPC Global Summary of Day/Month Observations
Station record
13.7 GB
Approx. 8,900 actively reporting stations in global daily data since 1979
(Climate Prediction Center National Centers for Environmental Prediction National Weather Service Noaa U. S. Department of Commerce )
GPCP (Daily): Global Precipitation Climatology Project -1DD product
Geostationary infrared satellite
0.78 GB
Daily rainfall accumulation globally on a 1 grid in latitude and longitude starting in October 1996
(Pendergrass )
The Tropical Rainfall Measuring Mission (TRMM)
Satellite
236 GB
3 Hourly from Jan 1st 1998 to mid-2017. 0.25 latitude/ longitude grid over the domain 50 S–50 N
(NASA )
W
73.1 TB
W
W
W
W
W
The Global Precipitation Measurement Mission (GPM)
W
Satellite
N/A
W
Provide half-hourly and monthly precipitation estimates on a 0.1 latitude/longitude grid over the domain 60 N–S
(NASA ) W
W
NCEP Climate Forecast System Reanalysis
Model reanalysis
67 TB
W
6 hourly from 1979. 0.1 latitude/ longitude grid globally
(Saha et al. )
Uncorrected Proof 13
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
measurements of rainfall on the ground, but are often sparse
previous data fusion studies indicate that data fusion pro-
in regions with complex landform, clustered in valleys or
cesses can generally improve data quality over the data
populated areas, and of poor temporal consistency. Thus,
from a single source. Nevertheless, there is an apparent
gauge data may not be able to provide sufficient information
limitation of the previous studies in that they only proposed
about the spatial extent and intensity of precipitation
the methodology and tested it with limited spatial and tem-
(Verdin et al. ). Estimating precipitation from satellites
poral coverage; in other words, the amount of data was
provides an alternative method for collecting rainfall data
limited, so they did not concern themselves much about
with the inherent advantage of detecting the spatial distri-
the efficiency of the algorithm which is the key factor in pro-
bution of the precipitation. They are different in the
cessing big data.
observation mechanism resulting in a substantial difference
In fact, applying data fusion technique to the existing
in the features of observation results. Satellite-based
terabytes of precipitation data is a tough issue for hydrolo-
measurement is intermittent, area-averaged observation,
gists as processing the huge amount of data generated
while rain gauge measurement is continuous and point
every day will be extremely time-consuming. It becomes
observation (Arkin & Ardanuy ). There is a trade-off
more problematic when dealing with the accumulated his-
of accuracy and spatial coverage between each data
torical data which are equally valuable. Owing to the
source. The observations of rain gauges and radar have the
development of big data, the ability of a cluster of computers
best measurement of actual rainfall but with the most lim-
to process large amounts of data has been greatly improved
ited spatial coverage. Geostationary satellites with infrared
primarily by implementing the idea of parallel computing,
sensors are less accurate but the coverage is broad and con-
which is to subdivide the job into small portions and to
tinuous. Between them are the microwave sensors on low
involve a cluster of computers to work simultaneously.
earth orbits which provide more reliable estimates of pre-
Thus, parallel computing posed a requirement on the
cipitation but with incomplete temporal sampling and
fusion algorithm that the data fusion process can be separ-
coarse spatial resolution (Gorenburg et al. ). In the big
ated into individual parts. Data fusion algorithms, e.g., the
data era, it is encouraged to make use of the joint data
Bayesian kriging method proposed by Verdin et al. (),
from various sources. It is promising to fuse the existing pre-
have the disadvantage of snapshot; in other words, are tem-
cipitation data from heterogeneous data sources. As
porally independent, while the time series of precipitation
heterogeneous data sources possess different advantages
are available for possible improvements in accuracy. How-
and disadvantages, they can complement each other in an
ever, this snapshot feature makes the fusion process easy
optimal way (Sander & Beyerer ).
to be separated temporally for parallel computing, which
Verdin et al. () used a Bayesian data fusion model
can effectively speed up the processing procedure.
with ordinary kriging to blend infrared precipitation data
Hadoop MapReduce was designed to handle textual
and gauge data on the Central and South American
data initially, and how it performs on processing high-
region. The method was applied to pentadal and monthly
volume remote sensing image data has been assessed by
total precipitation fields during 2009. This blending
only a few papers, of which the results were positive.
method significantly improves upon the satellite-derived
Almeer () researched the performance of eight pixel-
estimates and is also competitive in its ability. Wang et al.
level image processing algorithms using Hadoop, with the
() assessed the performance of the multiscale Kalman
result that the method is scalable and efficient in processing
smoother-based framework in precipitation fusion. They
multiple large images used mostly for remote sensing appli-
tested the algorithm on 2003 hourly NEXRAD MPE precipi-
cations, and the Hadoop runtime is significantly lower than
W
W
tation data of two spatial resolutions, i.e., 1/8 and 1/32 , 2
the runtime of a single PC. Lv et al. () developed a par-
in the US. Linear
allel model of K-means clustering algorithm based on
weighted algorithm, multiple linear regression, and artificial
Hadoop MapReduce to process satellite remote sensing
neural network were also used to fuse the remote sensing
images, of which the results are acceptable, and the runtime
data with the ground data (Srivastava et al. ). All the
drops. It is reasonable to believe that Hadoop MapReduce
respectively, covering 152,175 km
Uncorrected Proof 14
Y. Chen & D. Han
|
Big data and hydroinformatics
Journal of Hydroinformatics
|
in press
|
2016
on a cluster of machines will work on the data fusion job as
The challenges of big data have also been included in
the parallel computing is not restricted by the type of data.
this paper. Data sharing is one of them, as water-related
Thus, the future of fusing precipitation data with the aid of
datasets have a variety of formats with different observation
big data techniques can be promising. The reasons are, as
methods generated from different organizations. Either a
mentioned above, that the data fusion of heterogeneous pre-
general standardized format for data exchange or an open
cipitation data sources can offer better results than data
sourced data management tool that glues all relevant scripts
from each single source, and the fusion process can be accel-
for read and write of different data formats can benefit the
erated by parallel computing.
research community on handling datasets. Many data portals based on web services are being created for data exchange and encouraging data-based research. Contradic-
CONCLUSIONS
tion is another challenge of big data in that the correlation between datasets is practically more useful than the causa-
The big data era is an upcoming trend that no one can
tion between datasets, while the causation is the purpose
escape from. Scientists are expected to embrace the big
of scientific research. The correlation identified from a
data era rationally without being blurred by the overwhelm-
vast range of datasets ought to help researchers explore
ing trend. The concept of big data originated from the
new potential causation between the phenomena for further
popularization of the internet as digitalizing of information
research, instead of only replacing the logic-based model.
among the world becomes much easier and cheaper for
The real challenge in the near future is how to make the
future data mining purposes. The commercial value, e.g.,
best use of the available data, as currently there is little
precision marketing, data-based decision-making, behind
being done about big data relevant to hydroinformatics.
the expanding datasets makes the term ‘big data’ extremely
Thus, the purpose of the paper is to encourage the research
trendy. The idea of big data is very adaptable, and can be
community to develop new ideas for the big data era.
valuable for academic purposes as well. Hydroinformatics
Precipitation estimation is one possible area to make a
can benefit from the expanding amount of data collected,
start, as data related to precipitation is being collected
generated and available to the research community. Data
from multiple sources, such as rain gauges, weather radars
from smart meters, smart sensors and smart services,
and satellites. Global precipitation data collected from
remote sensing, earth observation systems, Internet of
NEXRAD and GPM can reach tens of TBs, which is a big
Things, etc., will prompt hydroinformatics into the inevita-
data problem. One promising future is to fuse precipitation
ble big data era. The data usage can be categorized into
data from multiple sources – weather radar, satellite
three dimensions: the natural dimension, analysing climate
remote sensing, rain gauge and model reanalysis data – to
change, flood and drought management and the global
generate a rainfall estimation product with a better spatial
water cycle; the social dimension, focusing on the inter-
and temporal resolution and minimized uncertainty. The
action between water environment and human society;
parallel computing, distributed data storage paradigms and
and the business dimension, using data-based decision-
cloud computing platforms developed during the explosion
making systems for optimizing water resource management
of information are essential to accelerate the data processing
systems and future water planning. The data processing
procedure. The implementation of big data in precipitation
tools like parallel computing, distributed storage have been
data fusion and the parallel computing model are the tip
developed to help users handle the large datasets in hun-
of the iceberg in the big data era. The utilization of available
dreds of GBs or even TBs in tolerable time to make real-
data is not limited to improving the precipitation estimation.
time application possible and interactive human–computer
The future should rely on an ‘all data revolution’, in which
analysis feasible. Cloud computing platforms will make it
innovative analytical ideas, utilizing data from all existing
unnecessary to download the data to a local machine and
and new sources, and providing a deeper, clearer under-
run the model locally, but provide superior computing effi-
standing will significantly shift how we recognize the
ciency in the future cloud computing era.
world (Lazer et al. 2014).
Uncorrected Proof 15
Y. Chen & D. Han
|
Big data and hydroinformatics
REFERENCES Abbott, M. B. Hydroinformatics: Information Technology and the Aquatic Environment. Avebury Technical, Aldershot, UK. Abbott, M. Introducing hydroinformatics. J. Hydroinform. 1, 3–19. Almeer, M. H. Hadoop MapReduce for remote sensing image analysis. Int. J. Emerg. Technol. Adv. Eng. 2, 443–451. Apache. What Is Apache Hadoop? [Online]. Apache Hadoop. https://hadoop.apache.org/ (accessed 22 June 2015). Arkin, P. A. & Ardanuy, P. E. Estimating climatic-scale precipitation from space: a review. J. Climate 2, 1229–1238. Beck, C., Grieser, J. & Schneider, U. Global precipitation analysis products. Global Precipitation Climatology Centre (GPCC). DWD. Bennett, J. & Lanning, S. The netflix prize. In Proceedings of KDD Cup and Workshop, New York. ACM, 35. Bissell, V. C., Pasteris, P. A. & Bennett, D. G. Standard hydrologic exchange format (SHEF). J. Water Resour. Plann. Manage. 110, 392–401. Burn-Murdoch, J. Study: Less than 1% of the world’s data is analysed, over 80% is unprotected [Online]. The Guardian. http://www.theguardian.com/news/datablog/2012/dec/19/ big-data-study-digital-universe-global-volume (accessed 18 June 2015). Butler, D. When Google got flu wrong. Nature 494, 155–156. Climate Prediction Center National Centers for Environmental Prediction National Weather Service NOAA U. S. Department of Commerce CPC Global Summary of Day/Month Observations, 1979-continuing. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory, Boulder, CO, USA. Council, N. R. Frontiers in Massive Data Analysis. The National Academies Press, Washington, DC, USA. Dahl, E. PC Drive Reaches 500GB [Online]. PCWorld. http:// www.pcworld.com/article/120102/article.html (accessed 10 May 2015). Dean, J. & Ghemawat, S. MapReduce: simplified data processing on large clusters. Commun ACM 51, 107–113. De Mauro, A., Greco, M. & Grimaldi, M. What is Big Data? A consensual definition and a review of key research topics. In: 4th International Conference on Integrated Information, Madrid, Spain. doi. 2341.5048. Department of Commerce U.S. Secretary of Commerce Penny Pritzker Announces New Collaboration to Unleash the Power of NOAA’s Data [Online]. Department of Commerce. https://www.commerce.gov/news/press-releases/2015/04/ us-secretary-commerce-penny-pritzker-announces-newcollaboration-unleash (accessed 2 December 2015). EarthCube About EarthCube [Online]. Earthcube.org. http:// earthcube.org/info/about (accessed 15 July 2015). ECMWF ECMWF Climate Reanalysis [Online]. ECMWF. http://www.ecmwf.int/en/research/climate-reanalysis (accessed 16 July 2015).
Journal of Hydroinformatics
|
in press
|
2016
Eilander, D. Twitter used to create real-time flood maps [Online]. https://www.deltares.nl/en/news/twitter-used-tocreate-real-time-flood-maps/ (accessed 27 April 2015). ESA Earth Explorers overview [Online]. European Space Agency. http://www.esa.int/Our_Activities/ Observing_the_Earth/Earth_Explorers_an_overview (accessed 12 February 2016). Gantz, J. & Reinsel, D. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future 2007, 1–16. GEOWOW GEOWOW Water [Online]. GEOWOW. http:// www.geowow.eu/water.html (accessed 25 July 2015). Ghemawat, S., Gobioff, H. & Leung, S.-T. The Google file system. ACM SIGOPS operating systems review. ACM, 29–43. Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S. & Brilliant, L. Detecting influenza epidemics using search engine query data. Nature 457, 1012–1014. Gorenburg, I. P., McLaughlin, D. & Entekhabi, D. Scalerecursive assimilation of precipitation data. Adv. Water Resourc. 24, 941–953. Guenther, R. & Radebaugh, J. Understanding Metadata. National Information Standard Organization (NISO) Press, Bethesda, MD, USA. Hartin, E. & Watson, K. Announces New Innovations that Set the Standard for Performance, Reliability, Capacity, Agility and Efficiency for Helping Companies Harness the Power of Data [Online]. http://www.hgst.com/press-room/pressreleases/HGST-unveils-intelligent-dynamic-storage-solutionsto-transform-the-data-center (accessed 10 May 2015). IBM IBM100 - Deep Thunder [Online]. IBM’s 100 Icons of Progress: IBM. http://www-03.ibm.com/ibm/history/ ibm100/us/en/icons/deepthunder/ (accessed 14 December 2015). ITU Internet of Things Global Standards Initiative [Online]. ITU. http://www.itu.int/en/ITU-T/gsi/iot/Pages/default.aspx (accessed 28 July 2015). Kobayashi, S., Ota, Y. & Harada, Y. The JRA-55 reanalysis: general specifications and basic characteristics. J. Meteor. Soc. Japan 93, 5–48. Lampos, V. & Cristianini, N. Nowcasting events from the social web with statistical learning. ACM Trans. Intell. Syst. Technol. (TIST) 3, 72. Laney, D. 3D Data management: controlling data volume, velocity and variety. META Group Res. Note 6, 70. Lansdall-Welfare, T., Sudhahar, S., Veltri, G. & Cristianini, N. On the coverage of science in the media: a big data study on the impact of the Fukushima disaster. In Big Data (Big Data), 2014 IEEE International Conference on. IEEE, pp. 60–66. Leicester, U. O. Big data technology finds ideal river locations to generate hydro-power. [Online]. ScienceDaily. http://www. sciencedaily.com/releases/2015/04/150413075144.htm (accessed 28July 2015). Lv, Z., Hu, Y., Zhong, H., Wu, J., Li, B. & Zhao, H. Parallel K-means clustering of remote sensing images based on
Uncorrected Proof 16
Y. Chen & D. Han
|
Big data and hydroinformatics
mapreduce. Web Information Systems and Mining. SpringerVerlag, Berlin, Heidelberg, Germany. Mayer-Schonberger, V. & Cukier, K. Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt, Boston, MA, USA. Menne, M. J., Durre, I., Vose, R. S., Gleason, B. E. & Houston, T. G. An overview of the global historical climatology network-daily database. J. Atmos. Ocean. Technol. 29, 897– 910. Moore, D. & McCabe, G. Introduction to the Practice of Statistics, 5th edn. W. H. Freeman, New York, USA. NASA Global Precipitation Measurement (GPM) Mission Overview [Online]. Pmm.nasa.gov. http://pmm.nasa.gov/ GPM (accessed 11 May 2015). NASA Readme for TRMM Product 3B42 (V7) [Online]. GES DISC NASA. http://disc.sci.gsfc.nasa.gov/precipitation/ documentation/TRMM_README/TRMM_3B42_readme. shtml (accessed 11 May 2015). National Centers For Environmental Prediction, N. W. S. N. U. S. D. O. C. NCEP/NCAR Global Reanalysis Products, 1948-continuing. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory, Boulder, CO, USA. NCEI NEXRAD Products [Online]. http://www.ncdc.noaa. gov/data-access/radar-data/nexrad-products (accessed 15 May 2015). OGC OGC@WaterML [Online]. Open Geospatial Consortium. http://www.opengeospatial.org/standards/ waterml (accessed 4 July 2015). OPENDEFINITION Defining Open in Open Data, Open Content and Open Knowledge [Online]. Opendefinition.org. http://opendefinition.org/od/ (accessed 5 July 2015). ORACLE. Time Capsule, 1956 Hard Disk [Online]. Oracle. http://www.oracle.com/technetwork/issue-archive/2014/14jul/o44timecapsule-2219543.html (accessed 12 May 2015). Pendergrass, A. The Climate Data Guide: GPCP (Daily): Global Precipitation Climatology Project [Online]. https:// climatedataguide.ucar.edu/climate-data/gpcp-daily-globalprecipitation-climatology-project (accessed 15 May 2015). Perenson, M. Hitachi Introduces 1-Terabyte Hard Drive [Online]. PCWorld. http://www.pcworld.com/article/ 128400/article.html (accessed 10 May 2015). Pierson, L. Civil Engineer Turned Environmental Data Scientist Harnesses Big Environmental Data at UNESCOIHE [Online]. statisticsviews.com. http://www. statisticsviews.com/details/feature/7136441/Civil-EngineerTurned-Environmental-Data-Scientist-Harnesses-BigEnvironmental-D.html (accessed 15 June 2015). Rajaram, H., Bahr, J. M., Bloschl, G., Cai, X., Scott Mackay, D., Michalak, A. M., Montanari, A., Sanchez-Villa, X. & Sander,
Journal of Hydroinformatics
|
in press
|
2016
G. A reflection on the first 50 years of Water Resources Research. Water Resour. Res. 51, 7829–7837. Saha, S., Moorthi, S., Pan, H.-L., Wu, X., Wang, J., Nadiga, S., Tripp, P., Kistler, R., Woollen, J. & Behringer, D. The NCEP climate forecast system reanalysis. Bull. Am. Meteor. Soc. 91, 1015–1057. Sander, J. & Beyerer, J. Bayesian fusion: modeling and application. In Sensor Data Fusion: Trends, Solutions, Applications (SDF), 2013 Workshop on. IEEE, pp. 1–6. Selding, P. B. D. U.S. Government-leased Satellite Capacity Going Unused [Online]. SpaceNews.com. http://spacenews. com/32581us-government-leased-satellite-capacity-goingunused/ (accessed 20 June 2015). SMAP SMAP Overview [Online]. SMAP, JPL. http://smap.jpl. nasa.gov/observatory/overview/ (accessed 5 July 2015). Snijders, C., Matzat, U. & Reips, U.-D. Big data: Big gaps of knowledge in the field of internet science. Int. J. Internet Sci. 7, 1–5. Srivastava, P. K., Han, D., Rico-Ramirez, M. A., Al-Shrafany, D. & Islam, T. Data fusion techniques for improving soil moisture deficit using SMOS satellite and WRF-NOAH land surface model. Water Resour. Manage. 27 (15), 5069–5087. Stewart, R. A., Willis, R., Giurco, D., Panuwatwanich, K. & Capati, G. Web-based knowledge management system: linking smart metering to the future of urban water planning. Australian Planner 47, 66–74. SWITCH-ON About SWITCH-ON [Online]. SWITCH-ON Project. http://www.project.water-switch-on.eu/ (accessed 15 June 2015). Verdin, A., Rajagopalan, B., Kleiber, W. & Funk, C. A Bayesian kriging approach for blending satellite and ground precipitation observations. Water Resour. Res. 51, 908–921. Wang, S., Liang, X. & Nan, Z. How much improvement can precipitation data fusion achieve with a Multiscale Kalman Smoother-based framework? Water Resour. Res. 47 (3). WMO Satellite Data Formats and Standards [Online]. World Meteorological Organization. http://www.wmo.int/pages/ prog/sat/formatsandstandards_en.php (accessed 6 July 2015). Zaharia, A. M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. USENIX Association, Berkeley, CA, USA, 10. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. & Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for inmemory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, Berkeley, CA, USA, pp. 2–2.
First received 21 August 2015; accepted in revised form 22 January 2016. Available online 2 March 2016