Large-Scale Sensor Data Management s2 s1 ... Elastic, easy to scale up ... A cloud system for massive time series manage
Building a Front End Interface for a Sensor Data Cloud
Ian Rolewicz, Michele Catasta, Hoyoung Jeung, Zoltan Miklos, and Karl Aberer Swiss Federal Institute of Technology (EPFL)
Backgrouds Frontend of TimeCloud Experiments Conclusions
Microsoft SensorMap Web-based visualization of real-time sensor data
Planetary Skin NASA-Cisco climate change monitoring platform – $39 billion – online collaborative platform to process data from satellite, airborne and sea- and land-based sensors around the globe
Swiss Experiment Collaborative environmental research project
Large-Scale Sensor Data Management s1 s3
time
t1 t2 t3
s2 time
sensors
s1
s2 s3
5.9 6.1
5.8 ?
6.1 6.2
36.2
6.0
6.3
internet
t1 t2 t3
sensors
s1
s2
5.9
5.8
6.1 36.2
? 6.0
Systems are typically distributed, federated
Pitfalls Users are generally not computer geeks – Managing servers?
Difficult to upgrade systems distributed – Patch? new version?
Deterministic, inflexible – More users/data this month?
Hard to process distributed complex processing – Servers: must run for obtaining data!
Cloud-Based Sensor Data Management No maintenance cost High availability of data Fast complex data processing – Centralized environment
Elastic, easy to scale up Easy to patch, version up
TimeCloud A cloud system for massive time series management – Being developed at the distributed information systems laboratory, EPFL – Consists of frontend and backend
Basic functionalities – Tables, graphs, password-protected, group-based data share
Advanced built-in support (ongoing) – Detecting/notifying dead sensors, data cleaning – Dynamic metadata creation/join – User-subscribed R/MATLAB execution
Third-party software (ongoing) – SensorMap, SwissEx Wiki
Backend Scalable, fault-tolerant – Built upon Hadoop, Hbase, and GSN
Adaptive data storage – Partition-and-cluster (PaC) store
Model-based cache – Minimize data transmission
Model-coding join – Fast distributed join using bitmap
Frontend
Goals Simple, intuitive, easy to use Going beyond just displaying data Minimize backend workload Minimize data transmission
Key Approach: Model-Based Processing • Probabilistic processing • Error estimation • Data cleaning • Prediction • Interpolation • Compression • Fault-tolerance …
Continuous Moving Queries Give a (in car) pollution update every 30 mins Aggregate Queries COX emitted yesterday in Lausanne center
Model-based middle layer
user-defined models
DBMS (storage of raw sensor values)
Mobile Sensor Data (Pollution Values)
incomplete, inaccurate, correlated sensor readings
Models • Regression models (e.g. linear) • Approximation models (e.g., Chebychev) • Correlation models (e.g., GAMPS) • Probabilistic models (e.g., HMM) • Interpolation models (e.g., Kriging) • Signal processing (e.g., DFT)
2e
Model-Based Views in DBs
MauveDB [SIGMOD’06]
Challenges in MSD
FunctionDB [SIGMOD’08]
Model-based Processing in Frontend Model-based views – Approximate results first, instead of actual data – Only when users ask actual data (e.g., a button in GUI), fetch actual data – Less data transmission, fast visualization
Model caching – Cache model parameters – Reuse for table vis. -> graphs vis. , and vice versa
Incremental visualization – Bring only what you see
Implementation Web-Based interface Display tables and graphs – Visualizations implemented with Protovis – Visualization zoo library for plotting graphs
Python with – the Django Framework and the YUI 2 library.
Backend Data Model
NULLs not stored in HBase → better for sparse data Column families stored in separate files
Frontend Screenshot Model-based approximated data
Frontend Screenshot Full precision
Frontend Screenshot Model-cached graph plotting
Frontend Screenshot Other graph plotting
Experiments
Performance Measure Settings – Testbed on a cluster of 13 Amazon EC2 servers, each having: • • • •
15 GB Memory 8 EC2 Computing Units 1.7 TB Storage 64-bit platform
– One of them: HBase Master + Front End – 12 others: HBase Region Servers
Run – 1000 random reads over real sensor data stored in TimeCloud
Query Processing Times
Network Usages
Graph #
KB transferred (original)
KB transferred (approximated)
1
112.3
23.3
2
124.5
28.0
3
126.6
25.9
4
120.2
25.1
5
119.9
26.8
6
124.4
27.7
Conclusions Introduced an advanced frontend for TimeCloud – Simple, intuitive, and easy to use – But going beyond just displaying data
Model-based processing – Minimize data transmission over networks – Minimize backend workload
Future work – Various model support – Design of additional visualizations
Thank you