INTEGRATION OF HETEROGENEOUS DATABASES IN ACADEMIC ENVIRONMENT USING OPEN SOURCE ETL TOOLS Azwa A. Aziz, Abdul Hafiz Abdul Wahid, Nazirah Abd. Hamid, Azilawati Rozaimee Fakulti Informatik, Universiti Sultan ZainalAbidin (UniSZA), 21300 Kuala Terengganu, Terengganu, Malaysia {azwaaziz,nazirah,azila}@unisza.edu.my
[email protected]
ABSTRACT Data warehouse (DW) can be considered as the most significant tool for strategic decision making. However, in academic environment, the components of data warehouse still have not been completely utilized. The aim of this paper is to design and implement ETL processes through the data integration of heterogeneous databases using TOS in academic environment. The usage of TOS for designing and semi automatically implementing ETL tasks in Java enables a fast way of adapting to new data sources. We already manage to integrate various sources of heterogeneous academic data sample and populate them in one central repository using TOS. It is important to ensure the capabilities of open source ETL are equal to any commercial products. Consequently, it will help in implementing DW projects with lower cost.
KEYWORDS Extraction, transformation, and loading(ETL), Business Intelligence (BI), Data Warehouse (DW), Heterogeneous DBMS, Talend Open Studio (TOS)
1
INTRODUCTION
Data warehouse (DW) can be considered as the most significant tool for strategic decision making in business. A welldeveloped DW can dramatically improve an organization’s decision-making capabilities. In the early years, the costs for the development of a DW were very expensive. However, lately the costs for developing and maintaining a DW has significantly lowered,
thus it has progressively becoming a functional tool that can be used as repository of information to support managerial decision making [1], [2], [3], [4]. In academic environment, however, only recently show some interest in integrating DW in decision making processes. Academic institutions still exploring the possibilities and benefits of data warehouse, therefore, in the development of decision support systems, the components of data warehouse still have not been utilized completely [5]. The factors affecting the optimal management of the institution especially in decision making are the same factors that involve in the business processes, so the management of an academic institution can be considered as critical as the management of a large business company [6]. In ensuring to achieve the optimal management of the institution, data warehouse can be integrated with Business Intelligence (BI). The goal of data warehouse is not only to utilize BI but also to do it effectively. BI is a term that refers to a variety of software applications that can be used to analyze an organization’s raw data. BI is made of several related activities, including data mining, online analytical processing, querying and reporting [7]. The main goal of BI is to produce correct and accurate information in order to make effective decisions. BI gives the users the ability to transform data into usable
433
information, thus taking apparently useless data and turning it into valuable information. The aim of this paper is to explicate the design and implementation of extraction, transformation, and loading (ETL) processes during the initial design and deployment stage through the data integration of heterogeneous databases using open source tools in academic environment. In academic environment, ETL can be considered as a valuable process because information from academic institutions involve data that are coming from many dissimilar sources such as academic systems, co-curriculum systems, hostel system and many more. In this research, ETL tools are used to extract data from different sources and then clean the data and make it uniform for the transformation process. The output from transformation process is loaded into the data mart. The data is merged into the data mart to give the decision makers the power to look through the data from different locations. This will increase the ability of decision makers to filter the data [8]. The paper is organized as follows. Section 2 describes some related research on components of data warehouse. Section 3 explains the design of the proposed system and the experimental design is described in Section 4. We present the result analysis and discussion in Section 5 and lastly we conclude this work in Section 6. 2
LITERATURE REVIEW
There are many research that have been done to discuss and explain the practices, tools and standards that involve in ETL, DW, BI and any related technologies. There are many strategies that can be applied in the deployment of DW. In [9], authors proposed a framework for design, development and deployment of a DW. The
framework involved combining meta-model with an ontology. The main outcome from the framework showed that it could upgrade the interoperability function in ETL processes. Reed et al. [10] proposed a robust yet economical means for undertaking any stated goal by utilizing Pentaho tools for ETL. The goal of this research was to combine different databases into a single data repository. This data repository could be applied to view and examine domestic violence victim and offender data across organizations to provide reports. Other than that, by using Pentaho, the authors intended to remove data conflicts and to generate a demographic profile throughout the criminal justice system. Final output of data mart signified the integrated and reliable information from different data sources. Dell’aquila et al. [6] explained the practices in designing and modeling of an academic DW. The objectives of this specific academic DW were to produce an exclusive structure of analysis and report for administrative structures, such as departments for the students and also to supply real time data for external agencies. The outcome of this research was the DW provided a centralized source of information accessible through diverse academic units. Piedade and Santos [11] discussed the concepts, practices and architectures of Student Relationship Management (SRM) system. The main objective of this research was to provide a technical tool that could support higher education institution to gain knowledge that was vital to the decisionmaking procedure. In order to validate the proposed concepts and activities, they adopted a research methodology which involved the comprehension of a set of interviews. The results of this research involved two stages. The first stage verified
434
that there was no suitable technology that existed to support SRM concepts and practices. The second stage proved that the proposed framework permitted the definition of the SRM system’s architecture and its main functionalities. Sahay and Mehta, [12] developed a system to support higher education institutions in evaluating and predicting critical matters related to student success. The objective of this study was to use data mining techniques for classification, categorization, estimation, and visualization. They also intended to use predictive models to predict the critical student issues in order to determine a prioritized list of critical factors. Thomsen and Pedersen conducted two investigations on ETL tools in the years of 2005 and 2008 [13], [14]. From the investigations, the authors discovered that there were many existed open source ETL tools, such as OpenSrcETL, OpenETL, CloverETL, KETL, Kettle, Octopus and Talend. Most of the tools could meet the fundamental requirements of data processing such as could support the function of extracting data from heterogeneous data sources and load the data into ROLAP or MOLAP system. The authors stated that most of these open source ETL tools were not very powerful except for Talend Open Studio (TOS). Talend operates in an open source model, where services and ancillary features are offered on a subscription basis. From an affordability standpoint, Talend opens up the marketplace for transformation and integration to all customers, regardless of size and data integration needs [15]. 3
CONCEPTUAL FRAMEWORK
Our proposed framework is based on previous works that discussed about concepts and practices of DW and ETL.
In DW, it is a common practice to separate between back room and front room entities. The back room is holding and handling the data while the front room allows for data accession. The back room can be labeled as data management or data preparation section of the related material in an academic environment. In the existing application, several Databases Management System (DBMS) were used to support transactions systems such as MySQL, Informix, Oracle, and Microsoft Access. To integrate those heterogeneous DBMS was a complex task to do especially by using 3GL languages. To enforce system developer to use a single DBMS for all applications also not an option as they had to choose their DBMS for specific purpose. Thus, ETL plays a vital role in every integration projects in order to speed up the development and moreover to achieve good results. For ETL tasks, an open source application known as Talend Open Studio (TOS) was used to build tasks. The tasks included transferred data from multiple external heterogeneous data sources, then transformed, cleaned, and loaded the data into the application’s repository [16]. In this environment, the front room section enabled a user or client application to access the data that had been detained in the warehouse. The key task of the front room was to map the heterogeneous low level data, usually stored in a DW to other forms [17].The front room managed the queries that performed at the outside and then scheduled and planned them in order to accomplish the results for performance issues or can be referred as Business Intelligence. Golfarelli et al. [18], described Business intelligence as the process of turning data into information and then into knowledge. The front room may offer techniques of data mining, text mining or classical statistical methods that can be
435
performed on Data Marts (DM) and multidimensional cubes.
planed, to enable an early user interaction in establishing the warehouse. 4 SIMULATION DESIGN
Figure 1: The main components of the proposed framework.
The proposed ETL framework consists of several main features. One of the features is the conversion of given file formats or database formats. This feature needs to be fitted to the structure that is needed by the loading processes, which store the data in the repository. Figure 1 shows ETL process embedded in the environment, excluding any parts of the front room. In data source stage, data comes from multiple data sources such as student personal details and academic records. This information can be accessed in different file format, including simple flat files, more complex XML files or as a database including Microsoft Excel and Access, MySQL and IBM Informix. The loading section makes use of Java code and create ETL job by TOS tool which is used in order to create a simple appliance of importing data from particular available columns as identified in data preparation. These files can be later reused in the ETL module to read, customize alteration and store the data. The requirement for data conversion consumes about 70-80% of the time used to build a DW [19], meanwhile the conversion and transformation steps including Java classes created by TOS, are the first software components to be considered and
The purpose of this simulation is to prove that TOS is as capable to perform jobs as any commercial ETL tools. To prove the proposed framework, we use dummy data for simulation from two different DBMS which are Oracle and MYSQL to test ETL process in academic environment. Other than that, a text file in Microsoft Excel format has been added as source data to integrate with both DBMS. Oracle has been chosen as target database because most of enterprise companies are using Oracle in their enterprise applications. In this simulation, we have designed a multidimensional table that consists of fact and dimension tables. aim of the design is to perform analysis on student result based on ongoing assessments and demographic analysis using BI tools. A fact table is created in target database contains of students results, specifically programming subjects, known as fctStu. Two dimension tables are connected to fctStu which are dimPro and dimAsses. dimPro consist of student personal information such as name and gender and also geographic information while dimAssescontain of detail assessment results of particulars subject. Figure 2 shows Entity Relationship Diagram (ERD) of target systems. The design in DBMS involves Oracle and MYSQL. In MYSQL, a table known as r2is created to store a record of students’ results with information such as student names and gender. Meanwhile, in Oracle, a table known as stuinfo is created to store students’ personal information such as address, state and parent incomes. Whereas, Microsoft
436
Excel file contains of student assessment marks for the hold semester.
Table 1: Sources system attributes DBMS
TABLE
ATTRIBUTE
MYSQL
DB NAME RPS
r1
ORACLE
STUPRO
stuinfo
ExcelFile
finalmark
Assesme nt Mark
name matricNo gender semester subjectCode gred matricNo name gender address state spmres parinc matricNo Section A Section B Section C
Figure 3: Creating a connection to ORACLE and
Figure 2: Multidimensional design for target system (Student results fact table).
These three heterogeneous data sources then populated to the target databases in star schema by using TOS. Table 1shows detail of source column in each DBMS and text files. The first step in developing ETL jobs is to ensure successful connection has been established between TOS to respective databases. A GUI interface is developed to help in performing task as in Figure 3. Then, by using Structured Query Language (SQL) Builder, a test connection can be conducted by viewing data through TOS. The SQL statement can be manipulated to choose entities with particular attribute to perform analysis. Figure 4 shows the results in stuinfo table that have been generated using SQL statement to Oracle DBMS. The result is shows all data in.
MYSQL DBMS
In Figure 5 shows that SQL Builder interfaces generate result from query made to MYSQL DBMS. It shows a detail of r1records from sources systems.
437
perform data transformation when extracting data from sources. Figure 7 shows tMap interface when populating ETL jobs from simulation in academic environment.
Figure 4: Result from stuinfo (ORACLE DBMS)
Once connection has been successfully established, ETL jobs can be designed by using ETL jobs menu. This menu provides friendly interface where we can simply drag and drop to map a data from data sources to the target database/DW as shown in Figure 6.
Figure 7: Mapping using tMap
The final process is to compile and execute the jobs. Logs files are provided to give feedbacks either successful or fail run on the running jobs. In this simulation, all data from heterogeneous sources have been extracted to target database for respective DW tables. 5 EXPECTED OUTCOME
Figure 6: ETL job from source to target
The implementation of DW is crucial when dealing with various application systems that have their own characteristics. Nowadays, most organizations need one central repository that able to summarize all transactions data that will be used as guidelines for make decision. A successful project of DW implementation has been proven in various sectors.
TOS provide several functionalities to perform data extraction and transformation. One of basic functionalities is tMAP which is used to develop a simple mapping from source to target. tMap provides a feature to
However, that main challenge of DW projects is the cost that involve in implementing the technology. Open source ETL is an option to reduce the cost for DW projects. The usage of TOS for designing
438
and semi automatically implementing ETL tasks in Java enables a fast way of adapting to new data sources. We already manage to integrate various sources of heterogeneous academic data sample and populate them in one central repository using TOS. We hope the same results will be achieved if the real life data is used. It is important to ensure the capabilities of open source ETL are equal to any commercial products. Consequently, it will help in implementing DW projects with lower cost. 6 CONCLUSION & FUTURE WORK This paper explained the design and implementation of an open source ETL tools in heterogeneous databases integration. It is our contribution to provide DW architecture with open source technologies in academic environment. We expect the architecture to evolve as the project matures, which should help to fit open source technologies into data warehouse. The main challenge to complete this research is to implement the framework in real life cases and to accommodate extra practical problems including BI. 7 REFERENCES 1. 2.
3.
4. 5.
6.
Immon, W. H.: Building the Data Warehouse. John Wiley & Sons, 1996. Chaundhuri, S., Dayal, Ganti, U. and V.: Database technology for decision support systems, IEEE Computer, Vol. 34, No 12, 2001. Jarke, M., Lenzerini, Vassiliou, M. Y. and Vassiliadis, P.: Fundamentals of Data Warehouses. Springer-Verlag, 2003. Kimball, R. and Ross, M.: The Data Warehouse Toolkit, 2nd edition. John Wiley & Sons, 2002. Wierschem, D., McMillen, J. and McBroom, R. :What Academia Can Gain from Building a Data Warehouse. EDUCAUSE Quarterly, Vol. 26, No. 1, 2003. Dell’aquila, C., Tria, F. D., Lefons, E. and Tangorra, F.: An Academic Data Warehouse. In :Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007.
7.
Mulcahy, Ryan.: Business Intelligence Definition and Solutions. CIO.com. N.p., n.d. Web. 10 Oct 2011. 8. Kimball, R., Reeves, L., Ross, M., and Thronthwaite, W. :The Data Warehouse Lifecycle Toolkit. Wiley, New York, 1998. 9. Hoang, A. T. and Nguyen, B.: An Integrated Use of CWM and Ontological modeling approaches towards ETL Processes. In: IEEE International Conference on e-Business Engineering. 2008. 10. Reed, S. E., Na, D. Y., Mayo, T. C. , Shapiro, L. W. Joseph, Duty, B., Conklin, J. H. Donald Brown, E. : Implementing and Analyzing a Data Mart for the Arlington County Initiative to Manage Domestic Violence Offenders. In :Proceedings of the 2010 IEEE Systems and Information Engineering Design Symposium University of Virginia, Charlottesville, VA, USA, April 23, 2010. 11. Piedade, M. B. and Santos, M. Y.: Student Relationship Management: Concept . Practice and Technological Support. 978-1-4244-22890/08/2008 IEEE. 2008. 12. Sahay, A. and Mehta, K..: Assisting Higher Education in Assessing, Predicting, and Managing Issues Related to Student Success: A Web-based Software using Data Mining and Quality Function Deployment. Academic and Business Research Institute Conference, Las Vegas, 2010. 13. Thomsen, C. and Pedersen, T. B. : A Survey of Open Source Tools for Business Intelligence. In: International Journal of Data Warehousing and Mining, 2009. 14. Thomsen, C., Pedersen, T.B. and Lehner, W..: RiTE: Providing On-demand Data for Right-time Data Warehousing. In: Proc. of ICDE, 2008. 15. Inmon, W.H. The Evolution of Integration. A White Paper by W. H. Inmon. 2007. 16. Talend Open Studio, [Online]: http://www.talend.com. 17. Fayyad, U. M., Piatetsky-Shapiro, G. and Smyth, P..: From data mining to knowledge discovery: An overview, in Advances in Knowledge Discovery and Data Mining:AAAI Press, 1996. 18. Golfarelli, M., Rizzi, S. and Cella, I..: Beyond Data warehousing: what’s next in business intelligence. In: Proceedings of the 7th ACM international workshop on data warehousing and OLAP, November 2004. 19. Schönbach, C., Kowalski-Saunders, P. and Brusic, V..: Data warehousing in molecular biology. Brief. Bioinform. vol. 1, no. 1, pp. 190198, May 2000.
439