big data.pdf - Google Drive

2 downloads 27 Views 189KB Size Report
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to op
Overview Big Data is a terminology being given to very large data sets which can be analyzed computationally to show us patterns or trends in the random data. Today whole IT Industry is re-structuring the way they used to maintain their database. This data could be anything right from email IDs, numbers of employees, clients or blood groups of patients, database collection of driving license numbers of whole world. Big Data in simple words is a technique to manage the important and scattered database and analyze its behavior. This technology is the latest technology on which whole world is moving onto. Enormous Jobs and Opportunities to start own business will be created in the field. IBM Says: Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is Data Science.

Prerequisite The Workshop content consists of an approximately equal mixture of lecture and hands-on lab. This will be a minimum 1 / 2 days workshop. All students have at least moderate knowledge in Basic of C Programming Knowledge. Recommendation: It is strongly recommended to bring your own LAPTOP during the training on which you can install and run programs if you would like to do the optional, hands-on experiments/exercises after the trainings/ workshops.

Introduction to Big Data •

What is Big data



Big Data opportunities



Big Data Challenges



Characteristics of Big data

Introduction to Hadoop •

Hadoop Distributed File System



Hadoop Distributed File System



Industries using Hadoop.

• •

Data Locality. Hadoop Architecture.



Map Reduce & HDFS.



Using the Hadoop single node image (Clone).

The Hadoop Distributed File System (HDFS) •

HDFS Design & Concepts



Blocks, Name nodes and Data nodes



HDFS High-Availability and HDFS Federation.



Hadoop DFS The Command-Line Interface.



Anatomy of File Read

• •

Anatomy of File Write Block Placement Policy and Modes



More detailed explanation about Configuration files.



Metadata, FS image, Edit log, Secondary Name Node and Safe Mode.



How to add New Data Node dynamically.



How to decommission a Data Node dynamically (Without stopping cluster).



FSCK Utility. (Block report).

• •

How to override default configuration at system level and Programming level. HDFS Federation.



ZOOKEEPER Leader Election Algorithm.



Exercise and small use case on HDFS.

Map Reduce •

Functional Programming Basics.



Map and Reduce Basics



How Map Reduce Works



Anatomy of a Map Reduce Job Run



Legacy Architecture ->Job Submission, Job Initialization, Task Assignment, Task



Execution, Progress and Status Updates



Job Completion, Failures

• •

Shuffling and Sorting Splits, Record reader, Partition, Types of partitions & Combiner



Optimization Techniques -> Speculative Execution, JVM Reuse and No. Slots.



Types of Schedulers and Counters.



Comparisons between Old and New API at code and Architecture Level.



Getting the data from RDBMS into HDFS using Custom data types.



Distributed Cache and Hadoop Streaming (Python, Ruby and R).

Introduction to R •

History of R



An Insight into R



Data Structure and Data Type

Data Management and Data Cleaning •

Missing Value Treatment



Outlier Treatment



Sorting Datasets

• •

Merging Datasets Creating new variables



Binning variables



Reading datasets from other environments into R ( importing )



Writing datasets from R environment to other environments (exporting )

Data Visualization in R •

Bar Chart



Dot Plot



Scatter Plot ( 3D )



Spinning Scatter Plots

• •

Pie Chart Histogram ( 3D ) [including colorful ones



Overlapping Histograms



Boxplot



Plotting with Base and Lattice Graphics



Plotting and Coloring



Geo Charts



Motion Charts



Case Study with Data Management

Duration: The duration of this workshop will be two consecutive days, with eight hour session each day in a total of sixteen hours properly divided into theory and hands on sessions. Certification Policy:

§

Certificate of Merit for all the workshop participants from Engineer Indya

§

At the end of this workshop, a small competition will be organized among the participating students and winners will be awarded with a 'Certificate of Excellence'.

§

Certificate of Coordination for the coordinators of the campus workshops. Eligibility: It's a basic level workshop so there are no prerequisites. Any one interested, can join this workshop. Fee: Rs. 1200/-(inclusive of all Taxes) per participant.