A survey on ETL strategy for Unstructured Data in

Proceedings of RK University’s First International Conference on Research & Entrepreneurship (Jan. 5th & Jan. 6th, 2016)

ISBN: 978-93-5254-061-7 (Proceedings available for download at rku.ac.in/icre)

RK University’s First International Conference on Research & Entrepreneurship (ICRE 2016)

A survey on ETL strategy for Unstructured Data in Data Warehouse using Big Data Analytics Hemali Saradava1,*, Aakash Patel2, and Rajanikanth Aluvalu 3 1,2

Research Scholar, School of Engineering, RK University, Rajkot-Bhavnagar Highway, Rajkot-360020, Gujarat, India.

3

Dept. of Computer Engineering, RK University Rajkot-Bhavnagar Highway,Rajkot-360020, Gujarat, India. *Corresponding author: Hemali Saradava ([email protected])

ABSTRACT In today’s world the Digital Data generated is of 3 types, namely Structured, Semi-Structured and Unstructured. Standard techniques and tools are available to handle structured data. 90% of data generated in today’s digital world is either semi-structured or unstructured. During the initial days of data analysis, users used to convert unstructured data into structured data and perform analysis. Nowadays people started developing tools and techniques to handle unstructured data. We can closely associate unstructured data with the term Big Data which refers to very large data sets that are difficult to analyze with traditional tools. Because unstructured data is typically large, dirty and noisy, it requires more computing power. I will improve ETL process on unstructured data in data warehousing using map reduce paradigm. The proposed method will process the data in parallel as small chunks in distributed clusters and aggregate all the data to obtain final processed data. SUMMARY Improving ETL process for unstructured data into data warehouse for Big Data Analytics.

Keywords: HDFS, Map-reduce, Pig Latin, Hadoop, Data Cleansing, Profiling


INTRODUCTION Now a days bulk of data is being generated during use of internet. That data need to be managed properly so that we can use it further for more research purpose. Main challenge of big data is that some of that data is of no interest so it requires more filtering time. And also there is automatically generated data in some systems so it is difficult to manage and record it. We need to sort that data based on various requirements so that it can be reused again for analysis purpose.(1) Digital data is of three type that are unstructured, semi-structured and structured. Among this unstructured data is the data that does not follow any data model, it means that it has no structure or little structure and also it is difficult to extract information from this type of data and store that into data warehouse. Unstructured data is generated everywhere. For example online forms, word document, PowerPoint presentations, images, videos, company records and all social media is generating unstructured data. Semi-structured data is also having no data model but have some kind of structure. For example emails, zipped files, HR records and XML data is semi-structured data. Structured data is totally organized kind of data, this type of data is used directly for business analytic.(2) Table 1 shows the comparative study of all the 3 types of digital data. Big data analytics is used to get space for storing and processing large data sets. It is also used to work with distributed stored data sets with faster processing. Data warehouse is used for storing data for further data analysis. The data in the warehouse is read-only and the update or refresh occurs only on a periodical basis.(3) It has main 4 characteristics namely Subject-oriented, integrated, non-volatile and time variant. Subject oriented means data is collected according to subjects. Integrated means that the data must be consistent. Time variant says that generally historical data is taken into account for the data warehouse applications. So they allow access to more detailed information, as required. Non-volatile states that data warehouses are static.(4) In our daily usage about 80-90% of the data is unstructured data and it cannot be directly processed so we need to convert unstructured data to structured data for business analysis. To deal with data from various sources ETL process is used.(4) ETL is a 3 stage process namely Extract, Transform and Load that allows integration and analysis of data stored in different sources.(5) As shown in figure 1, Extraction phase includes collecting data from multiple sources of database. This process is used to build data warehouse. Then next is the transformation phase in which the data is reformatted and cleansed to detect and rectify errors to meet the information needs as all the data from various sources are in different format and then loading phase will do sorting and load that data into the final target database.(6) There are many ETL tools available to manage data warehouse with some capabilities and advantages. As data grows exponentially, we require ETL to work with high scalability so we use map reduce to perform distributed computing. Map reduce framework is having two phases : map phase will split all the data and subdivide into small parts and process all parts in parallel. Then reduce phase will shuffle data and sort according to the requirements. The output of this phase will be stored into HDFS file. As shown in figure 2, Map reduce framework uses JobTracker and TaskTracker to complete this task. Map reduce is

2


used for searching, indexing and tokenization. It is used due it’s feature of scalability in distributed environment.(7) HDFS is a Java based file system used to store data and access scalable and fault tolerant data. HDFS is data management layer of Hadoop. It is a distributed file system used to store bulk of data. HDFS uses NameNode and DataNode to store metadata and application data accordingly. Clients directly contact to NameNode and then it is directed to appropriate DataNode to read required content.(8)

RELATED STUDY A research work says that generated volume of data is increasing day by day. Among these 90% of the data is unstructured and need to be managed properly for our required needs. For huge data older systems are not sufficient. So we use map-reduce to process data in key-value pair with Hadoop. With the concern of big data, the three main challenges being faced are volume, velocity and variety. It is very difficult to perform the operation on unstructured data. So the unstructured data is structured and processed by using Map-Reduce technique and collaborative filtering is used to generate recommendations based on user preferences. The sentiment analysis technique is used to analyze the sentiments of a user based on Text Analysis. The resulting data set is structured with a particular order according to the user requirements.(3) Data Warehouse is different from operational database for handling large amount of data. To deal with this data we have ETL process with some framework. ETL process will map data from different data sources and load into data warehouse. Briefly ETL process will extract data from various sources, transform it, clean it and then load into data warehouse or data mart. But due to its difficulties and lack of formal model we propose new ETL model with some advance features that represent all activities and understood by Data Warehouse designer in all environment.(4) ETL process is used to extract the data from multiple sources then transform it to fit your analytical needs, and load it into a data warehouse for further analysis. Apache Hadoop has been used as the primary standard for managing big data. When the source data sets are large, fast and unstructured, traditional ETL can become the bottleneck, because ETL becomes too complex to develop, too expensive to operate, and also it takes too long to execute. According to a study, 80% of the development effort in a big data project goes into data integration whereas only 20% goes towards data analysis.(7) ETL is responsible for the extraction of the data, its cleaning and then loading it into the the desired target. Building ETL processes is expensive regarding time,money and the effort taken into account. ETL tools extract data from several sources such as database tables, flat files, ERP, internet, etc. and then apply complex transformation to them. Finally in the end, the data is loaded into the target database. These target databases are either fact tables or dimension tables in the context of the Data Warehouse. There are 2 types of ETL tools available. On one side, there is a collection of payable ETL which includes Data Stage and Information while on the other side, there is a collection of commercial ETL which are available free of cost.(10)

3


Hadoop is provided by Apache foundation which almost satisfies most of the goals of Big Data Analytics and supports HDFS (Hadoop File System) as its file system. Using map reduce paradigm it deals with the large data sets. The components of Hadoop like HDFS, Pig, Hive all are available from apache foundation as open source license. On a dedicated server meta data can be stored in the HDFS and termed as NameNode and the data regarding application can be stored to other servers which are known as DataNode. Other than NameNode and DataNode there is a CheckpointNode and BackupNode which are used at the time of performing operations on files. HDFS can be accessed by user application using HDFS client. Supported operations by HDFS are namely reading file, writing file, deleting file and updating file. For operations to be performed HDFS uses the pipeline which connect the end nodes. For balancing the clusters balancer is used which works with predefined threshold input. Blocks with uniqueID are there in HDFS which are allocated by NameNode which specifies the DataNode list that replicates the blocks. Blocks are placed in such manner so that efficient utilization of bandwidth can be done. Blocks scanning is performed by the block scanner. Losing data is very low if we uses the HDFS as it replicates the same blocks of data for three times and stores it at different storage backup.(11)

RESEARCH OBJECTIVE Nowadays more amount of data is being generated with use of technologies, it is getting complex to handle such a large amount of data with RDBMS. So we are using DW for storing data and this is mostly unstructured, so for a specific user it is time consuming to get required data from data warehouse. ETL process is used along with map reduce and HDFS. Using this we can process data in parallel and fast manner for further business analysis.

PROPOSED MODEL With more use of internet and services, more amount of log data is generated and it is very complex to handle that data. So we perform ETL process on that data. As shown in figure 3, Extract phase is performed to do data cleansing and profiling. Then transform phase is done with Map-Reduce so that data will be partitioned and processed parallel as well as fast. Node failure can not affect the executing task. Result of transform phase will be stored in HDFS. For further processing Pig Latin is used for querying data, output will be loaded in Data Warehouse for Business Analysis.

CONCLUSION The research work discussed focuses on the problems of handling large volume of data for Big Data Analytics. The proposed model helps us in overcoming the overhead of converting unstructured data into structured data. In the proposed model, ETL process is performed with HDFS and Map Reduce. Map Reduce being parallel programming model optimizes the performance of ETL process. We run the Pig Latin scripts on the result data and perform the required Big Data Analytics. The proposed model is the need of the hour for various data intensive organizations.

4


FIGURES Figure 1. Three phases of ETL process[7]

Figure 2 . Map Reduce procedure

5


Figure 3. Proposed System

6


TABLES

Table 1: Comparative study of digital data

STUCTURED DATA - Data is stored in the form of rows and columns CHARECTERI STICS

- Conforms to a data model - Attributes in a group are the same

SOURCES

SEMI STRUCTURED DATA

UNSTRUCUTRED DATA

- Does not conform to - Not in any particular format or sequence any data model but contains tags and - Does not conform to elements (metadata) any data model - Attributes in a group - Not easily usable by may not be the same a program - Similar entities are - Does not follow any grouped rules or semantics

- Databases

- E-mail

- Web pages

- Spreadsheets

- XML

- SQL

- Zipped files

- PowerPoint presentations

- OLTP systems, etc.

- Mark-up languages, etc.

- Videos, Images - Reports - Surveys, etc.

CHALLENGES FACED

- Limited storage

- Storage cost

- Contains only homogeneous data

- Limited tools available - No ready tool available for querying. - Data heterogeneity.

- Indexing and searching - Security (varied sources of data) - Retrieve information - Lack of technical expertise

7


REFERENCES [1] P. Saravana Kumar, M. Athigopal, S. Vetrivel, Extract Transform and Load Strategy for Unstructured Data into Data Warehouse Using Map Reduce Paradigm and Big Data Analytics in IJIRCCE, December 2014 [2] Challenges and Opportunities with Big Data by A community White Paper developed by leading researchers across the United States [3] Subramaniyaswamy Va, Vijayakumar Vb, Logesh Rc and Indragandhi Vd, Unstructured Data Analysis on Big Data using Map Reduce in ScienceDirect [4] Shaker H. Ali El-Sappagh, Abdeltawab M. Ahmed Hendawi , Ali Hamed El Bastawissy, A proposed model for data warehouse ETL processes, Journal of King Saud University (2011). [5] Sweety Patel Department of Computer Science, Fairleigh Dickinson University, USA, Mrudang D. Pandya Ganpat University, How is Extraction important in ETL process?, Ganpat Vidyanagar, Mehsana, Gujarat. [6] Satkaur, Anuj Mehta, A Review Paper on scope of ETL in retail domain, International Journal of Advanced Research in Computer Science and Software Engineering . [7] White Paper, Extract, Transform, and Load Big Data with Apache Hadoop in Big Data Analytics [8] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Yahoo!, Sunnyvale, California USA [9] Ramesh Nair and Andy Narayanan, Benefiting from Big Data Leveraging Unstructured Data Capabilities for Competitive Advantage [10] N. Nataraj, Dr. R.V. Nataraj, Analysis of ETL Process in Data Warehouse, Bannari Amman Institute of Technology, Sathyamangalam. [11] Web reference: http://www.dcs.bbk.ac.uk/~ptw/teaching/ssd/notes.html

8

A survey on ETL strategy for Unstructured Data in

A survey on ETL strategy for Unstructured Data in

Suggest Documents

A Strategy for Mapping Unstructured Mesh ...

A shared context approach for supporting experts in Data ETL ...

Benefitting from Big Data Leveraging Unstructured Data ... - Strategy

Implementation of Change Data Capture in ETL Process for Data

ETL Queues for Active Data Warehousing - CiteSeerX

Data warehouse framework for unstructured biogas data

Data Profiling for ETL Processes - CiteSeerX

Data Profiling for ETL Processes - CiteSeerX

ETL Queues for Active Data Warehousing - CiteSeerX

A Survey of Unstructured Text Summarization Techniques

(ETL) in Data Warehouse - Engg Journals Publications

A Survey of Unstructured Mesh Generation Technology

A proposed model for data warehouse ETL processes - CyberLeninka

Unravelling Unstructured Data: A Wealth of Information in Big Data

A proposed model for data warehouse ETL processes - cs.Virginia

A Content-Driven ETL Processes for Open Data - Semantic Scholar

Towards Efficient Search on Unstructured Data ... - users.cs.umn.edu

A Balanced Tree-Based Strategy for Unstructured Media Distribution ...

A Strategy for Mapping Unstructured Mesh Computational ... - CiteSeerX

A proposed model for data warehouse ETL processes - CiteSeerX

Considering unstructured data for OLAP: a feasibility study using a ...

On-Demand ETL Architecture for - Semantic Scholar

A Survey on Data Security in Data Warehousing

An Architecture for Unstructured Data Management