DIGITAL FILTER IMPLEMENTATION IN HADOOP DATA MINING SYSTEM Dariusz Czerwiński Lublin University of Technology (POLAND)
1
Agenda • • • •
Introduction Aim of the work Filter implementation Conclusions
2
Introduction • Data mining – very important in IT industry and science • Data Mining Applications in DSP: – classification – clustering – segmentation – mining and sequential analysis – multi-dimensional visualization 3
Cause of research (1) • Commercial NI Diadem Measurement Data Mining System – superb post processing capabilities – 2 billion data points limit
4
Cause of research (2) • Quench measurement system for 2G HTS tape
5
Cause of research (3) • Long time measurements – 15 seconds - 6 million measured data points – 12 minutes - above 231 (2 billion), over 7 GB CSV output file
6
Aim of the work • Implement smoothing filter (Savitzky-Golay) in Hadoop Data Mining System RevoRscale → Amazon Ec2 RHadoop R Enterprise →
• R environment?
Oracle DB
Renjin SaaS → Google App Amazon Beanstalk Heroku
Omegahat
RAmazonS3 → Amazon S3
7
Testbed • Host machine (PC – CPU - AMD Athlon X2 240, 2 cores, 6 GB RAM, 250 GB HDD SATA), host system Windows Professional x64, VMware Player v.7.0 with VMware Tools installed • Guest - OS Cloudera CDH 5.3.0.0 (Centos 6.4 x64) with 4 GB RAM, 2 cores, 40 GB
8
RHadoop idea • rhdfs - basic connectivity to the HDFS file system • rhbase - provides basic connectivity to HBASE • plyrmr - common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop • rmr2 - package that allows to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. • ravro - adds the ability to read and write avro files from local and HDFS 9
MapReduce idea
Source: http://developer.yahoo.com
10
R environment setup • R environment additions • > install.packages(c(“rJava”, “Rcpp”, “RJSONIO”, “bitops”, “digest”, “functional”, “stringr”, “plyr”, “reshape2”, “caTools”))
• R Hadoop connectors • >install.packages(“/home/cloudera/Downloads/ rhdfs_1.0.8.tar.gz”, repos = NULL, type=”source”) • >install.packages(“/home/cloudera/Downloads/ rmr2_3.3.0.tar.gz”, repos = NULL, type=”source”) • >install.packages(“/home/cloudera/Downloads/ signal_0.7-4.tar.gz”, repos = NULL, type=”source”) 11
Filter implementation • >hdfs.init() • >my.data=read.csv("/home/cloudera/filter /sample.csv") • >I=my.data[,"I0"] • >I.index=to.dfs(I) • >sg= values(from.dfs(mapreduce( input=I.index, map=function(k,v) sgolayfilt(v)))) 12
Experimental results
Measured data of instantaneous current
Filters comparison 13
Conclusions • Filter implemented in test environment gave very good results and allows for big data handling • Experimental results showed, that it is possible, with positive attempt, to build the digital filter in Data Mining System using R programming language and environment • There is ongoing work to implement Savitzky-Golay digital filter using reduce stage for introducing the filter equation and convolution coefficients and compare the results with earlier one 14
Contact Dariusz Czerwiński
[email protected]
Institute of Computer Science Lublin University of Technology
• Thank you for attention!
Nadbystrzycka 36B 20-618 Lublin Poland
15