Implementation of Change Data Capture in ETL Process for Data

0 downloads 0 Views 2MB Size Report
Data Capture (CDC) in distributed system using Hadoop Dis- ... and Apache Spark can reduce the amount of data in the ETL process. ... programming was used for simple processes such as transfer of data ..... was processed through the program created to run the CDC .... 3rd: The Definitive Guide to Dimensional Modeling.
Implementation of Change Data Capture in ETL Process for Data Warehouse Using HDFS and Apache Spark Denny∗ , I Putu Medagia Atmaja† , Ari Saptawijaya‡ , and Siti Aminah§ ∗ Faculty

of Computer Science, Universitas Indonesia, Email: [email protected] † Faculty of Computer Science, Universitas Indonesia, Email: [email protected] ‡ Faculty of Computer Science, Universitas Indonesia, Email: [email protected] § Faculty of Computer Science, Universitas Indonesia, Email: [email protected]

Abstract—This study aims to increase ETL process efficiency and reduce processing time by applying the method of Change Data Capture (CDC) in distributed system using Hadoop Distributed File System (HDFS) and Apache Spark in the data warehouse of Learning Analytics system of Universitas Indonesia. Usually, increases in the number of records in the data source result in an increase in ETL processing time for the data warehouse system. This condition occurs as a result of inefficient ETL process using the full load method. Using the full load method, ETL has to process the same number of records as the number of records in the data sources. The proposed ETL model design with the application of CDC method using HDFS and Apache Spark can reduce the amount of data in the ETL process. Consequently, the process becomes more efficient and the ETL processing time is reduced approximately 53% in average. Index Terms—change data capture, data warehouse, distributed system, big data, extract transform load

Depok, Indonesia Depok, Indonesia Depok, Indonesia Depok, Indonesia

Data transformed Transformas i Load

Log SCELE

Log Server Apache

Raw data

Ekstraksi

Data transformed

Data Warehouse

Dashboard Learning Analytics

SIAK

Fig. 1. ETL process in Data Warehouse for Learning Analytics at Universitas Indonesia.

I. I NTRODUCTION Learning analytics systems in Universitas Indonesia apply data warehouse as a single repository to analyze learning activities in online environment systems. These systems employ data warehouse to cluster learning activity patterns in learning management systems and to predict high risk students. Data from various sources will be processed before they are moved into data warehouse as shown in Figure 1. First, data from various sources such as academic systems (known as SIAK in Universitas Indonesia - UI), Moodle-based learning management systems (known as SCELE in UI), and authentication providers are integrated into a data warehouse system. At the moment, there are four instances of SCELE running in Universitas Indonesia. Then, data warehouse integrates data from various sources with different data formats into a single view and uniform format. Lastly, data with proper formats will be loaded into data warehouse repository. These processes are commonly known as Extract, Transform, and Load (ETL). ETL process involves extracting data from multiple, heterogeneous data sources, transforming and cleansing data,

and ultimately loading data into a data warehouse (Jorg dan DeBloch, 2008). Since the size of data in our systems continue to grow, the ETL process is taking longer over time. This increase usage of computing resources is caused by the full load approach implemented in our ETL processes. The use of the full load method to transfer processed data into the data warehouse causes longer ETL processing time. This method requires ETL to process the same amount of data as the data from the source. The process becomes inefficient because all of the data from the warehouse has to be changed every time the ETL process is run, even if the data had been previously processed. The same data thus repeatedly goes through the ETL process. The Change Data Capture (CDC) method can be used to deal with the problems in this ETL process. The CDC method can replace the inefficient full load method, especially if the ETL process is run periodically. Nevertheless, CDC application on ETL processing still has the potential to increase processing time because with several CDC method

approaches, there are processes where the amount of data processed is as much as the data source. One of the techniques that could be used to reduce processing time with the CDC method is applying distributed processes. The use of Apache Spark tools for parallel processing and Hadoop File System (HDFS) for distributed storage could enhance the system’s capacity to process large amounts of data. Apache Spark can process data in large amounts using a relational scheme that can be manipulated to achieve maximum performance. This differs from MapReduce, which requires manual and declarative optimization to achieve maximum performance (Armbrust et al., 2015). In this study, Apache Hadoop was used to facilitate distributed storage in order to be able to run parallel processing. MapReduce programming was used for simple processes such as transfer of data from the database to the HDFS using Apache Spoon. Apache Spoon is a add-on tools for Hadoop to transfer data from data source into Hadoop environment. The main purpose of this research is to reduce time in ETL process by using CDC with distributed approach. CDC technique can reduce the amount of data that will be processed in ETL. CDC also can filter updated data to be processed. So, the process in ETL will be more efficient. Meanwhile, Apache Spark and HDFS will be used to support CDC technique running in distributed system to increase their performance. This paper is organized as follows. Sections II and III discusses ETL and CDC, respectively. Then, our design and implementation for distributed ETL are discussed in Section IV. The results of our experiments are discussed in Section V. The comparison between existing model and proposed model is discussed in Section VI. II. ETL P ROCESS IN DATA WAREHOUSE The data from the operational system comprise of different types and structures, so the data from the operational system cannot be directly used in the data warehouse. Thus, the data from the operational system need to be processed prior to entry into the data warehouse. Data processing aims to integrate and to clean the data, and to change it into the predetermined structure. Processing of operational data prior to use in the data warehouse is known as Extract, Transform, Load (ETL). Extraction is the initial process in ETL, where data from the source is transferred into the ETL environment for processing. Transformation is the task of changing the data structures into the predetermined format as well as to improve data quality. Load is the term used to refer to the transfer of transformed data into the data warehouse or repository. Load is also known as delivering. According to Kimball dan Caserta (2004), delivering is the process of transferring transformed data into a dimensional model accessed by the user. According to the structure of the data warehouse using the star scheme, the load process can be classified into two processes: the load process for fact tables and for dimension tables. The ETL process as a backend needs to fulfill three important roles, viz., to send the data effectively to the data warehouse user, to enhance value to the data during the

Fig. 2. CDC methods for immediate data extraction (adapted from Ponniah (2010)).

cleaning and conforming phase, and to protect and document data lineage (Kimball dan Ross, 2013). These three important roles of the ETL in the data warehouse system take most of the whole development time. Kimball dan Ross (2013) stated that 70% of time and energy are spent only to transform data from the source to the data warehouse. III. C HANGE DATA C APTURE (CDC) As previously mentioned, Change Data Capture (CDC) methods can be used to improve ETL process. CDC has several definitions. One of the definitions is that CDC is a method to integrate data based on the results of identification, capture and delivering only to changes of data in the operational system (Tank et al., 2010). Another definition of CDC is that it is a technique to monitor the operational data source that focuses on data changes (Kimball dan Caserta, 2004). Based on the two definitions, it can be concluded that CDC is a method to determine and detect changes in the data that occurred during a transaction in the operational system. In general, CDC can be used to support the ETL system. The goal is to reduce the amount of data processed in the ETL system. The ETL process can run more efficiently because it only processes data that have been changed. This also enables more frequent updates from operational databases to the data warehouse. A. CDC Methods In general, applications of CDC can be categorized into immediate data extraction and deferred data extraction (Ponniah, 2010). Immediate data extraction allows extraction in real time when a change occurs in the data source. Meanwhile, in the deferred data extraction approach, the extraction process is performed periodically (in specified intervals). Therefore, the data extracted are those that have been changed since the last extraction. There are three methods of CDC application for immediate data extraction. Figure 2 depicts the three methods, which uses the transaction log, database triggers, and capture in source application. The CDC method using log transaction is a method that utilizes the log recorded in the RDBMS

Fig. 3. CDC methods for deferred data extraction (adapted from Ponniah (2010)).

from the data source in the form of a database. This method works by utilizing every event from the Create, Update, Delete (CRUD) operation that is recorded by the RDBMS in a file log. Systems that utilize this method will look for data that has been changed or added by reading the contents of this file log. The second method utilizes database triggers. A database trigger is a procedural function that RDBMS generally has to take action if CRUD has been undertaken on particular data. This function can be utilized to propagate updates if there are changes in the data. The third method is capture in source application. This method utilizes this application or system from data sources that have the capacity to apply CDC. This method is quite effective to reduce the load in the ETL process, especially during the extraction phase. However, this method is limited to applications that can be modified. This method is not applicable to proprietary applications. CDC application for deferred data extraction is generally classified into two type of methods. Figure 3 shows the two methods. One looks at the data based on timestamp, and the other compares files. CDC application using the first method utilizes the time column that most data from the source have. In several cases, the time column can provide more detailed information, such as when the data were entered and changed. This information can be easily utilized to detect data changes. Unlike the first method, the second method uses a more flexible technique because it does not depend on the attributes of the data source. The technique used compares data from previous extractions with the current data to detect changes in each data attribute in order to detect changed data. B. CDC Limitation The use of the CDC method is similar to the extraction process in that it needs to adjust to the characteristics of the data source. This condition causes different applications of CDC techniques using the same method. Several CDC methods that have been previously mentioned cannot be applied to all types of data. Thus, each method has limits that influence the CDC process.

One of the limitations in applying the CDC method is that it is too dependent on the system of the data source. Several CDC methods require configuration from the side of the operational system. Such method includes the use of timestamp, trigger, and transaction logs. CDC application using the timestamp method can only be used if the data has a time record for all the changes that occurred. If not, there needs to be changes in the data structure to add a timestamp column, which is one of the weaknesses of this method (Mekterovic dan Brkic, 2015). As with timestamp, CDC trigger application requires access to the data source system to create a function that can detect changes in the data. In principle, this method works by utilizing the trigger function that many RDBMS have. Aside from the drawback of having to create a function, this method is also limited in the choice of data, as the data have to be in the form of a database managed by RDBMS. The other method utilizes the RDBMS transaction log. This method utilizes the database log as the data source to record each change to the database. The drawback of this method is that the RDBMS has to be monitored in running its function to record a log of every transaction. This is done to prevent loss in transactions (Mekterovic dan Brkic, 2015). To overcome limitations of these methods, changes can be detected by comparing data which is commonly referred to using the term snapshot differential. Initially, this process accesses data using full extract from the beginning to the end of the condition. This condition requires large amounts of resources from the CDC and influences the performance and time for CDC processes. In line with an increase in the amount of data, more time and computations required to conduct the CDC process. Other approaches are delta view and parallel processing. The application of delta view is performed on data sources that are in the form of a database. The principle of this approach is by creating a view that comprises of a key of a record that will be inserted in the ETL process. The delta view will be utilized to store the key of the updated record, deleted record and inserted record that will be used as information to the changes in the ETL process (Mekterovic dan Brkic, 2015). Even though this approach still requires access to the data source, the delta view approach does not change the data structure. Thus, this application is easier to conduct. On the other hand, parallel processing approaches can overcome the problem of the CDC method to process large amounts of data. The principle of this method is to reduce load in the system to classify the process into several resource for simultaneous processing. Thus, the CDC process can be performed in less time. C. CDC in Distributed Systems A distributed system is a group of subsystem components that are connected in a network that can communicate with each other. In a distributed system, each hardware and software communicate with each other and coordinate their processes with message passing (Coulouris et al., 2012). The main goal in creating a distributed system is to divide the resources to

enhance system performance. In implementation, there are two actions that can be done in a distributed system, which are the use of distributed file system to increase storage capacity and parallel processing to increase throughput. Parallel processing can be used to increase the performance of the CDC process by dividing a large process into smaller processes to be conducted simultaneously on different computers (nodes) in a cluster. Each node can run the process without waiting for the completion of processes in other node. This reduces the time needed to complete a task. In addition to saving time, parallel processing can also save resources used as no large task are performed on a single computer. There are two ways to implement parallel processing in the CDC method, which are the MapReduce programming paradigm with Apache Hadoop and utilizes Spark SQL from Apache Spark. MapReduce is a programming paradigm that undertakes parallel computational processes using two main functions: map and reduce. In essence, the principle of MapReduce is the same as parallel computational processes in general. It begins by dividing the data input into several parts, processing each part, and at the end combining the results of the processes in one final result of the process (Coulouris et al., 2012). Figure 4 demonstrates the process with MapReduce with map function to take data in the form of a group of key-value as input to be processed and reduce function to receive input from the map results that are processed to obtain an output from that process. The CDC method can be implemented using MapReduce by adopting the divide and conquer principle similar to that conducted in the study by Bala et al. (2016). The data are divided into several parts, each to be processed separately. Then, each data processed will enter the reduce phase which will detect changes in the data. Alternative to MapReduce, parallel processing in the CDC method can be implemented by employing Spark SQL. Spark SQL is a module from Apache Spark that integrates relational processes with functions in API Apache Spark (Armbrust et al., 2015). Spark SQL has the capacity to utilize query such as data processing using database. Spark SQL can be utilized to run the CDC method using commonly used operations such as JOIN, FILTER, and OUTER-JOIN, so that CDC processing using Spark SQL can be more easily implemented compared to using MapReduce. The use of MapReduce and SparkSQL for parallel processing cannot be done without using distributed storage. This is because each process in the node needs to be able to access the data processed, so each data has to be available and accessible at each node. Distributed storage keeps the data by replicating the data into several nodes in a cluster, so that the data can be accessed at any node in a cluster. A commonly used platform for distributed storage is HDFS (Hadoop File System). CDC method using parallel processing can greatly reduce the data processing time needed to detect changes. But, it requires lots of configuration and preparation before it is ready to be implemented. In this study, this process was implemented using a distributed system infrastructure (HDFS and Apache

�------------------------· Map

- --------

'

c - - - - - • • • -, Redu e

Intermediary result

Result

Input Data Intermediary result Result Intermediary result

,_ ----------------------_ l

Fig. 4. MapReduce framework (Coulouris et al., 2012).

Fig. 5. The previous ETL process (above) and proposed incremental ETL process using distributed system (below).

Spark). The process was run using library SparkSQL from Apache Spark and MapReduce programming paradigm from Apache Hadoop. IV. D ESIGN AND I MPLEMENTATION The proposed ETL process uses incremental extract method and only process changed data. As shown in Figure 5, the current ETL process uses full extraction from databases, then perform transformation and loading on the whole extracted data. The whole ETL process are performed using Pentaho Data Integration (PDI). Meanwhile, the proposed ETL process extract new and changed data using map-reduce framework and Apache Spoon. The transformation and loading process are performed using PDI. This section elaborates our big data cluster environment and the implementation of CDC. A. Server Configuration The distributed system implemented was peer-to-peer. Peerto-peer is an architectural design where each process has the same role, whereby nodes interact without differentiating between client and server or computer where an activity is run (Coulouris et al., 2012). The main reason for using this design is to maximize resource, because with this architecture, all nodes undertakes the process simultaneously. Figure 6 displays hostname and IP addresses of each server. B. Apache Hadoop Configuration Hadoop is a framework that provides facilities for distributed storage and processing of large amounts of data using MapReduce programming paradigm. One of the important

Fig. 7. Spark Architecture (Apache Spark, 2013). Fig. 6. Configuration of distributed system.

characteristics of Hadoop is that data partition and computation are conducted in several hosts, and Hadoop can run the application in parallel (Shvachko et al., 2010). Hadoop consists of two main components, viz., Hadoop Distributed File System (HDFS), to develop a distributed storage, and MapReduce, to support parallel computational process. In this study, the most active component used was HDFS, because the parallel processes used Apache Spark and Sqoop. HDFS consists of two parts, namenode and datanode. The role of the namenode is to maintain the tree structure of the file system and metadata (White, 2015). In addition, the namenode also maintains the block location of each file allocated on the datanodes. Unlike the namenode, the datanode is a blocks storage site that is managed by the namenode. These datanodes store data and report to the namenode on the block where the data is stored. Thus, the data stored in these datanodes can only be accessed through the namenode. The namenode is generally located in a separate server to the datanode. However, in this study, the cluster approach used was peer-to-peer in order to maximize available resource. Thus, all servers were configured as datanode and one server as a namenode, as shown in Figure 6. C. Apache Spark Configuration Spark is a platform used for the process conducted in a cluster. When processing large amounts of data, speed becomes a priority. Thus, Spark was designed for rapid data processing. One of Spark’s features is its capacity for data processing in memory (Karau et al., 2015). Aside from that, Spark can be implemented using several programming languages such as Java, Scala, Python, and R. In this study, the language and library used was Python version 3.0. Spark architecture is generally classified into three parts: Spark core, cluster manager, and supporting components (MLib, Spark SQL, Spark Streaming, and Graph Processing) as illustrated in Figure 7. Spark core is part of Spark with basic functions, such as task scheduling and memory management. In Apache Spark there are two roles in the server that have to be configured. There are master and worker as shown in Figure 6. The master has a function to manage the process and

allocate the resource into the worker server, while the workers execute the process. In this experiment, each server has a role as worker and only one server serving as a master and also as a worker. Because of that, all resources in this experiment will be used and every server are in the same level. Cluster manager is needed by Spark to maximize flexibility in cluster management. Spark can work with a third party cluster manager such as YARN and Mesos. Spark also has its own cluster manager, the standalone scheduler. In this study, Spark was configured using standalone cluster. D. Implementation In this study, the method used was by comparing data and finding differences between two data. The CDC method using snapshot difference was divided into several stages as illustrated in Figure 8. The first process was to take the data from the source. The data were taken by full load, which took the entire data from the beginning to the final condition. The results of the extraction was entered into the HDFS. The data was processed through the program created to run the CDC process using the snapshot difference technique. Snapshot difference was implemented using outer-join function from SparkSQL. The program was run by looking at the null value of the outer-join results of the two records, which will be the parameter for the newest data. If in the process damage to the data source was detected, the program could automatically change the data reference to compare and store the old data as a snapshot. V. E VALUATION AND A NALYSIS The ETL model using the CDC method was tested the same way to test the existing ETL method. Testing was done using script with data from the SCELE log database and the Apache Web Server log files, which consist of 1,430,000 rows of data. Script simulated addition to the database server by iteration as well as based on day, just as in the testing of the existing ETL model. A. Running Time of ETL Process Based on the testing, extraction, transformation, and load processes in the first and second experiments did not incur

600000

500000 data_source

Number of records

400000 data_processed

300000

200000

100000

0 1

51

101

151

201

251

Trials number

Fig. 8. Our CDC method with snapshot difference approach Fig. 10. The number of records extracted compared to the ones processed in the loading stage with snapshot differential. 400 350 500

300 450

Time (s)

250

extract_time

transform_time

load_time 400

200 TIME_WITH_EXISTING_MODEL

350

TIME_SNAPSHOT

150 300

Time (s)

100 50

250

200

0 5,000

505,000

1,005,000 150

Number of records

100

400 50

350

0 1

extract_time

300

transform_time

51

101

151

201

251

301

351

401

Day Number

load_time

Time (s)

250

Fig. 11. Comparison of the running time of existing ETL method and CDCbased ETL method.

200

150

B. Evaluation on the number of data processed in the loading stage

100

50

0 1

101

201

301

401

Day number

Fig. 9. Running time of the ETL process with snapshot difference in the experiment with the source of database increased by 5,000 records (above) and daily (below).

a significant increase in running time. This is because there was less data processed compared to without CDC. Figure 9 demonstrates the graphs of the first and second trials. In the first trial, the graph demonstrates a relatively constant extraction, transformation, and load process due to a steady amount of data increase. This differs from the second trial, which incurred more fluctuations, albeit not to a significant extent. This is due to an irregular amount of data increase. The test results from the CDC method using snapshot difference was similar to testing using the first approach. Figure 10 shows that the extraction process also tend to increase along with the increase in the amount of data, and the processing time for transformation and load stayed relatively constant.

Unlike the testing on the existing ETL model, the results of testing using the ETL model with the CDC method showed no increase nor reduction in the data processed, even though the number of record in the data source has increased. The graphs in Figures 9 and 10 display the comparisons between the number of data in the data source and the data processed during the transformation phase for the two CDC method approaches in the first and second trials. The graphs demonstrate that in the two experiments, the increased data source did not influence the amount of data processed. VI. C OMPARISON OF E XISTING ETL M ETHOD AND CDC- BASED ETL M ETHOD Our experiment shows that the ETL process proposed in this study improves the running time significantly compared to the previous ETL model process. This can be seen in the graph in Figure 11, which shows the growth of running time in the previous ETL model process is much higher compared to the CDC-based ETL method. When the amount of data was considerably small, the running time of the previous ETL model process is faster compared to the CDC-based ETL process. Nevertheless, once the amount of data from the source increased, the growth rate of running time for

the previous ETL process is much higher. When data from sources increase until 1,430,000 rows, current ETL process spent time approximately 457 seconds. Meanwhile, the CDCbased ETL process only requires approximately 133 seconds. These proposed method can reduce the running time by 324 seconds. In this case, the ETL processing time was reduced by approximately 53%. This difference would be higher when we use larger datasets in the future. VII. C ONCLUSIONS Applying the CDC method using HDFS and Apache Spark on ETL model reduces the increase of running time for ETL processing in data warehouse Learning Analytics at Universitas Indonesia. The cluster configuration with HDFS and Apache Spark as well as ETL model design in this study can reduce the total ETL processing time and reduce the number of data processed in the transformation and load stages. This makes the ETL process more efficient in terms of data processing, because it only processes new and changed data. To overcome the computation required in change detection in large datasets, distributed storage and computation server were used in the proposed method. Moreover, this approach does not require changes to the existing systems, such as implementing database triggers. Furthermore, the data stored in the distributed file system and change data detected used by the proposed ETL method can be utilized a backup data and audit trail respectively. However, the ETL model requires more complicated implementation especially in the CDC method. This is caused by the types of approaches in the CDC method have to be adjusted to the structure of the data to be processed and the conditions of the operational system. If the approach does not fit, the ETL process will not run efficiently. This research continues to acquire data from more applications, such as web access logs. The frequency of data updates will be increased to real time access. R EFERENCES Apache Spark (2013). Spark overview. https://spark.apache.org/docs/latest/. Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., dan Zaharia, M. (2015). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1383–1394, New York, NY, USA. ACM. Bala, M., Boussaid, O., dan Alimazighi, Z. (2016). Bigetl:extracting-transforming-loading approach for big data. Int’l Conf. Par. and Dist. Proc. Tech. and Appl., 8(4):50–69. Coulouris, G., Dollimore, J., dan Kindberg, T. (2012). Distributed Systems: Concepts and Design Fifth Edition. Icss Series. Addison-Wesley. Jorg, T. dan DeBloch, S. (2008). Towards generating etl processes for incremental loading. IDEAS ’08 Proceedings of the 2008 international symposium on Database engineering & applications, pages 101–110.

Karau, H., Konwinski, A., Wendell, P., dan Zaharia, M. (2015). Learning Spark. O’Reilly Media,Inc. Kimball, R. dan Caserta, J. (2004). The Data Warehouse ETL Toolkit : Practical Techniques for Extracting,Cleaning,Conforming,and Delivering Data. The Data Warehouse Toolkit. Wiley Publishing, Inc. Kimball, R. dan Ross, M. (2013). The Data Warehouse Toolkit 3rd: The Definitive Guide to Dimensional Modeling. The Data Warehouse Toolkit. Jonh Willey & Sons, Inc. Mekterovic, I. dan Brkic, L. (2015). Delta view generation for incremental loading of large dimensions in a data warehouse. In 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pages 1417–1422. Ponniah, P. (2010). Data Warehousing Fundamentals For IT Professionals Second Edition. Jonh Willey & Sons, Inc. Shvachko, K., Kuang, H., Radia, S., dan Chansler, R. (2010). The hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1–10. Tank, D. M., Ganatra, A., Kosta, Y. P., dan Bhensdadia, C. K. (2010). Speeding etl processing in data warehouses using high-performance joins for changed data capture (cdc). pages 365–368. White, T. (2015). Hadoop: The Definitive Guide, Fourth Edition. O’Reilly Media,Inc.

Suggest Documents