Integrating Legacy System into Big Data Solutions: Time to make the Change Sanjay Jha CQUniversity Sydney, NSW 2000, Australia
[email protected]
Abstract—Storing, analyzing and accessing data is a growing problem for organizations. Competitive pressures and new regulations are requiring organizations to efficiently handle increasing volumes and varieties of data, but this doesn't come cheap. And as the demands of Big Data exceed the constraints of traditional relational databases, evaluating legacy infrastructure and assessing new technology has become a necessity for most organizations, not only to gain competitive advantage, but also for compliance purposes. The challenge is how well the organization's legacy infrastructure integrates Big Data. It is without a doubt that one way or another Big Data must be accommodated by legacy systems. Legacy systems contain significant and invaluable business logic of the organization. Organizations cannot afford to throw away or replace this business logic. These legacy systems are assets of the organization. These invaluable assets of encoded „business logic‟ represent many years of coding, development, real-life experiences, enhancements, modifications, debugging etc. Most of the legacy systems were developed without the process models or data models—now needed to support and integrate Big Data. To integrate Big Data into legacy system, modernization of legacy system is required. There are many approaches for modernization of legacy systems but none of them are focused on integrating Big Data into legacy systems. Legacy systems hold valuable data too important to be lost in the process of modernization. However, addressing the issues and scope related to incorporating Big Data with legacy systems allows mature legacy systems to become part of groundswell changes. There are many areas unaddressed about integration of Big Data into legacy systems. Incorporating data from new sources, specifically “live” sources, into existing legacy systems is a technical challenge. Moreover, the sheer volume of Big Data can be daunting. Our paper presents the scope of integrating Big Data into modernization of legacy systems. Keywords-Big Data; Legacy System; Software Modernization
I.
Liam O‟Brien Geoscience Australia, Canberra, ACT 2609, Australia
[email protected]
Meena Jha CQUniversity Sydney, NSW 2000, Australia
[email protected]
INTRODUCTION
A legacy system is an old method, technology, computer system, or application program that continues to be used, typically because it still functions for the users‟ needs, even though newer technology or more efficient methods of performing a task are now available. In theory, it would be great to be able to have immediate access to use the most advanced technology. But in reality, most organizations have legacy systems to some extent. A legacy system may be problematic, due to compatibility issues, obsoletion or lack of security support [1]. =
Marilyn Wells CQUniversity Rockhampton, QLD 4701, Australia
[email protected]
Legacy systems are mostly written in 3GL programming languages such as COBOL, RPG, PL1, FORTRAN, BASIC, PASCAL, C, etc. Changing technology is pushing the modernization of legacy system in several ways. One of the reasons that the situation is changing so rapidly is the emergence of integrating infrastructures. With improved integration we have seen the World Wide Web (the Web) and electronic commerce flourish. Where once information systems were isolated and difficult to access, they can now be accessed using the Web and interfacing software. There are lot of data and information generated. The organizations are capturing and sharing more data from more sources than ever before. As a result all organizations are facing the challenge of managing high volume and high velocity data streams quickly and analytically [2]. The emergence of changes through new technologies, applications, and social phenomena creates novel business requirements and system complexities. Some of these changes create new driving business forces and new organizational structures. These changes also force businesses to be conducted in different ways. Some of the examples of emerging technological changes which are forcing businesses to run in different ways are: Facebook, LinkedIn, Google, Twitter etc. [2]. As the phenomenal growth takes place in data processing power, data visualization, data storage, network speeds, mobility, accessing real time data, and higher semantic capabilities, legacy systems need to integrate these technological changes for organizational to maintained their competitive advantages, which software developers could not predict at the time of development. One of the examples of technological change is social networking and online blogging. These technological changes were not being predicted as mainstream activities a decade ago. But these technological changes actually define the evolution of technologies, infrastructures, applications, users, communities, societies and knowledge creation for software developers [3]. Big data can be stored, acquired, processed, and analysed in many different ways. Every Big Data source has different characteristics, including the frequency, volume, velocity, type, and veracity of the data. When Big Data is processed and stored, additional dimensions come into play, such as security, policies, structure, and governance. Choosing an architecture and building an appropriate Big Data solution is challenging because so many factors have to be considered. It becomes even more challenging when legacy systems are to be modernized in a way so that they feed into a Big Data solutions. There are many reasons to it such
1
as: legacy system is resistant to change, difficult to maintain, no documentation, no software architecture available.
Traditional solutions do not meet new demands regarding complexity [4].
As organizations grow and expand in more than one direction through mergers, acquisitions, identified problems, or through requirement generation, they are also likely to gain technologies that duplicate existing capabilities, workflows in need of significant overhaul, and legacy systems whose contributions to the business value is still very critical. Organizations in such situations must address the issue of integrating Big Data (Live data, data from internal sources, data from external sources) which can drive the alignment between business‟ operating needs and the processes, applications data and infrastructure required to support the ever-more dynamic requirements. As the demand for managing information increases, they need to focus the efforts on integrating business processes and data. The term modernization of legacy systems reflects the capability to integrate a variety of different system functionalities such as business process and data [6].
Because of globalization an organization may have number of centers at different geographical locations. Most of the organizations maintain many billions of lines of code as of today associated with thousands of heterogeneous information systems at different centers. While modernizing the legacy system and integrating Big Data the organizations face two problems:
There are many System Development Life Cycle (SDLC) methodologies such as waterfall, spiral, JAD, RAD etc. These SDLC do not address the issue of integrating Big Data into legacy systems. Many approaches to modernize legacy systems have been developed. Also to the date most approaches to modernize legacy system do not explicitly address how to integrate big data into legacy systems. The list of related work would be too long. However, the state of the art may be found in several publications [5, 6, 7, and 8]. The current situation in legacy system modernization can be summarized as follows:
Over the years the legacy systems have been continuously modified to implement changing needs including: functional requirements, business rules, and data architectures. These legacy systems contain invaluable organizational information. Jha and O‟Brien [5] have recognized the value of the legacy systems and identified the need to use modernization of legacy systems to recover and reuse functional requirements, software artefacts, and system components to modernize legacy systems. There is also a need to recognize and to develop an information base for data migration planning which can incorporate Big Data. In this paper we discuss the scope to integrate Big Data into legacy systems and outline a methodology for this integration. We need to address first what is required for the integration of Big Data into the modernization of legacy system and then we can focus on how this can be achieved with the identified scope. For this we need to address impact of legacy system complexity on Big Data, overview of the existing legacy modernization approaches, describing data categories and how Big Data is different to it, what data administration strategies need to be addressed while modernizing legacy system, and the scope of integration of legacy systems into Big Data solutions. The remainder of the paper is structured as follows. Section 2 describes the impact of legacy system complexity on Big Data. Section 3 describes the overview of the existing legacy modernization approaches. Section 4 describes data categories and Big Data. Section 5 outlines what data administration strategies need to be addressed while modernizing legacy system, Section 5 describes integrating legacy System into Big Data solutions. Finally Section 6 concludes the paper.
Knowledge Based Software Reuse (KBSR) Process and Repository for systematic legacy system modernization [5].
Reusing Code for Modernization [6].
Most of the database reverse engineering literature examines solutions for migration of relational databases [7].
Database reverse engineering is sufficiently mature to be applied in practice [7]. There is a lack of literature on successful modernization processes. Many modernization projects fail as outlined by the Standish Group [8]. Redevelopment approaches are considered risky for most organizations [8, 9]. The reverse engineering of procedural components of a large application is still unsolved [10, 1]. Wrapping solutions are short-term solutions that can complicate legacy system maintenance and management over time [9].
II.
The generation and storage of data can continue to grow exponentially.
Data is found in various forms everywhere.
There are often functionally duplicative systems and the cost of operating these systems consumes an enormous portion of total spending on information systems.
Despite this spending the organizations are often unable to obtain the correct information from the data source stored in the various existing databases due to a lack of standardized data and data structures across systems.
IMAPCT OF LEGACY SYSTEM COMPLEXITY ON BIG DATA
A. Operational Complexity Because of mergers and acquisitions organizations have acquired legacy systems which contain invaluable data. In an organization we can have number of information systems such as: Personal data system, Pay System, Health care system, Procurement system etc. Organizations do have separate subordinate level organizations to maintain separate, functionally duplicative systems supporting operational requirements. To
There are three technological reasons which require attention when integrating Big Data to legacy system. They are:
2
collect, analyze or process data at higher organizational levels, management collected it from lower levels. The collection process depends on subordinate organizations feeding information upward--often manually. The format of the upward feed must be meticulously specified at each level. Outside of limited specific mission critical instances, the consistency and accuracy of data flows has been practically impossible to maintain and control. This data cannot be integrated to Big data and it does hamper if managers or executives want to make some decision based on available data. The impact of legacy system on Big Data integration includes the following:
The inventory of physical evidence is difficult to collect and analyze. Even if documentation exists, it is often outdated or of poor quality. In addition, personnel with the required knowledge are often no longer available.
The existing interface data elements defined for transferring data instances do not represent completely identified, sharable data structure and semantic requirements among systems. In order to obtain a complete picture of the data sharing requirements, modernization of legacy system must determine not only how the interface data elements were generated and used among systems but also identify other noninterface data elements which are synonymous among systems and are therefore sharable.
design and operation limit as this cannot be integrated to Big Data. III.
OVERVIEW OF LEGACY MODERNIZATION APPROCHES
Legacy modernization is the practice of updating aging applications and systems to interact with newer technologies. While the scope of a modernization effort is not always fixed or well defined, software engineers typically strive for greater application agility so they can rapidly respond to business requests for change [14]. When planning a modernization effort, it should be carefully considered how best to leverage existing assets. And, it must be considered how best to support future initiatives, about which the organizations may yet know very little. Software modernization is more challenging than most software engineers suspect. We identified several approaches to the modernization of legacy system. Modernization can be done at different levels. At lower levels, modernization takes the form of transforming the code from one language into another. At higher levels, the structure of the system may be changed as well as to make it, for instance, more object-oriented. At still higher levels, the global architecture of the system may be changed as part of the modernization process. Although, design pattern and generic programming have been very successful in new software development, we are not aware of any work from other research groups, who have studied the effectiveness of these techniques on modernizing legacy systems for Big Data solutions. The following gives an overview of the various modernization approaches.
B. Technical Complexity From one perspective, legacy systems are time-tested having value proven by long use representing decades of effort and customization, becoming reliable parts of an overall IT strategy along the way. These entrenched software systems often resist evolution because their ability to adapt has diminished through factors not exclusively related to their functionality. According to Lehman‟s first Law [12] software must be continually adapted or it will become progressively less satisfactory in „realworld‟ environments. This is due to the continuous change of user requirements and technical environment. Many legacy systems have been very large investments for organizations, and they contain invaluable business logic and knowledge. Koskinen, et al. have done an empirical study on software modernization decision criteria and found that the feasibility of a legacy system to be evolved, maintained, and integrated with other systems is improved due to modernization [13]. The ending of technological support and expected system lifetime reflect mandatory system modernization [13]. Data handling system could be an outdated database system e.g. home grown database management system, flat file system FILEMAN etc. Program managed memory overlays was an innovative use of flat-file technology, using a table-driven approach to separate process from data as database technology has done. However, a fixed record length limits the number of fields available to accommodate the growing user requirements. The same field can be used for multiple uses with different meanings depending on the user groups. Allowing individual user discretion rather than enterprise standards, permitted a restrictive physical data limit to serve more customers, but this approach is reaching its
Errickson-Connor [15] proposed the steps of a software modernization process where legacy code is transformed to new languages and new environments. She suggests that in the first stage legacy code needs to be cleaned up by removing program anomalies before it can be transformed. The second stage involves the tasks of software restructuring such as identifying business rules, isolating business rules, and extracting business rules as reusable services. When the code corresponding to a business rule is extracted, it is ready for transformation into components in stage three. The fourth stage manages these reusable components in a software environment. Zhang Li, et al., [16] have provided a modernization process called Tollgate Model Transformation with which the legacy system can be adapted to a services oriented architecture (SOA) system whose granularity is changeable. The Tollgate Model Transformation process is based on the wrapping technique. The Aberdeen Group [17] did a survey of the legacy application modernization and found that companies are looking to the SOA approach to create distributed applications to help them both modernize their legacy applications and to make their composite applications more flexible and therefore giving their businesses more agility. Some companies, however, are looking to simply get rid of legacy applications on mainframes and UNIX servers to get rid of the legacy problem. Fuhr et al. [18] have described an approach using modeldriven techniques to extend IBM‟s SOMA method towards migrating legacy systems into Service-Oriented Architectures (SOA). The approach was applied to the migration of functionality of GanttProject towards a Service-Oriented
Identify applicable sponsor/s here. If no sponsors, delete this text box. (sponsors)
3
Architecture. As result, fully functional Web Services were generated whose business functionality where implemented by transforming legacy code. The approach addresses the semiautomatic migration of legacy software systems to ServiceOriented Architectures, based on model-driven techniques and code transformation.
to exchange and reuse information internally as well as externally. C. Wrapping The wrapping approach is also called black-box modernization. This provides a new interface to a legacy component. In other words wrapping removes mismatches between the interface exported by a software artefact and the interfaces required by current integration practices. Wrapping involves surrounding existing data, individual programs, application systems, and interfaces to give a legacy system a „new and improved‟ look or improve operations [20, 21]. The wrapped component acts as a server, performing some function required by an external client, which does not need to know how the service is implemented [22]. Wrapping permits re-using components and leveraging the massive investment done in the legacy system over many years. This approach enhances security and interoperability but does not have any effect on maintainability and reusability.
Other legacy system modernization approaches that have been developed by researchers and software practitioners include black-box approach such as wrapping, white-box approach which requires program understanding and reverse-engineering, screen scraping, data wrapping, legacy integration using Common Gateway Interface (CGI), data contextualization, architecture driven modernization, COTS based modernization, etc. Some of the concepts of one modernization approach overlap with the concepts of another modernization approach. Examples are wrapping which is also called the black-box approach and whitebox approach requires program understanding. There is no clear and concise classification of these modernization approaches to be found in the existing literature which integrates Big Data. Some of the identified legacy modernization approaches can be categorized as follows:
D. Migration Legacy migration allows legacy systems to be moved to new environments that allow such systems to be easily maintained and adapted to new business requirements, while retaining the functionality and data of the original legacy systems without having to completely redevelop them. Ganti and Brayman [23] propose general guidelines for migrating legacy systems to a distributed environment. Using these guidelines, the business is first examined and the business processes found are reengineered as required. Migration of legacy system includes system migration and component migration. Component migration involves migrating small components to new platforms whereas system migration is migrating the complete system to a new platform [24]. The few successful migration reports found in the literature [25], [26] describe ad-hoc solutions to the problem at hand. Migration involves complete understanding of the legacy system, its interfaces and legacy data [27]. Migrating legacy systems to services enables both, the reuse of already established and proven software components and the integration with new services, including their orchestration to support changing business needs. In order to gain most benefit from a migration, a comprehensive approach supporting the migration process and enabling the reuse of legacy code is required [18]. One of the objectives of migration of legacy system to a newer platform is to improve interoperability. Interoperability is the ability of a system or different systems to operate successfully by communicating and exchanging information with other external systems written and run by external parties. An interoperable system makes it easier to exchange and reuse information internally as well as externally. Communication protocols, interfaces, and data formats are the key considerations for interoperability. Standardization is also an important aspect to be considered when designing an interoperable system. Understanding of the legacy system, its interfaces and legacy data are migration challenges.
A. Blackbox Modernization Approach Black-box modernization approach provides a new interface to a legacy component. Black-box modernization does not require the understanding of the system and treats a running system as a black box. A new interface is designed so that the functionality of the legacy system can be achieved using the new interface. Black box modernization includes techniques such as screen scraping, database gateway, XML integration, CGI integration, object-oriented wrapping of legacy systems [19]. In the absence of legacy system knowledge this approach can be used. Black-box approach can be used where a system is already very stable and only needs to be interoperable with another external system. Security and reliability are the other quality attribute which can be improved using black-box modernization. Reliability is the ability of a system to remain operational over time. Reliability is measured as the probability that a system will not fail to perform its intended functions over a specified time interval. Maintainability or reusability are not the objective when using this modernization approach. B. Whitebox Modernization Approach White-box modernization approach requires an understanding of legacy system internals. If this understanding is unavailable some work needs to be done to understand the internals of the legacy systems. White-box modernization includes source code restructuring. Source code restructuring keeps the external behavior of the system intact and improves maintainability and performance of the system [8]. White-box modernization requires program understanding. Maintainability and reusability are the main objectives of using this modernization approach. Interoperability is another quality attribute which can be improved by White-box modernization approach. Interoperability is the ability of a system or different systems to operate successfully by communicating and exchanging information with other external systems written and run by external parties. An interoperable system makes it easier
4
E. Screen Scrapping Carr [28] has suggested a technique for modernization called screen scraping. Screen scraping consists of wrapping old, textbased interfaces with new graphical interfaces. The old interface is often a set of text screens running on a dumb terminal. In contrast, the new interface can be a PC-based, graphical user interface (GUI), or even a hypertext mark-up language (HTML) light client running in a Web browser. This technique can be extended easily, enabling one new user interface to wrap a number of legacy systems. From the perspective of the legacy system, the new graphical interface is indistinguishable from an end user entering text in a screen. From the end user‟s point of view, the modernization has been successful as the new system now provides a modern, usable graphical interface. However, from the IT department‟s perspective, the new system is as inflexible and difficult to maintain as the legacy system. Screen scraping is basically a “makeover” for legacy systems. This kind of modernization can be effective for stable systems where the principle objective is to improve usability and not maintainability.
of specific semantic and metadata context that explicitly describes what a data value means. Ricardo, et al., [31] proposed the Data Contextualization Technique. This technique recovers the linkages between pieces of legacy source code and the fragments of database schemas used for that piece. This is to place the data in context in order to provide detailed metadata to integration systems. The context of a piece of data includes its semantics (“To what specific concept does this piece of data refer ?”), its syntax (“How is this piece of data structured?”), and other related metadata such as information about the quality of the data [80]. Data contextualization makes data format standardized. This helps in the exchange of information but does not address integration of legacy systems to Big Data solutions. I. Architecture Driven Modernization In June 2003, the Object Management Group (OMG) formed a Task Force on modeling in the context of legacy software systems. Initially, the group was called the Legacy Transformation Task Force, but then the name was unanimously changed to the Architecture-Driven Modernization (ADM) Task Force. Reengineering and MDA (Model-Driven Architecture) have converged on ADM. ADM is the concept of modernizing existing systems with a focus on all aspects of the current systems architecture and the ability to transform current architecture to target architecture [32]. ADM is the process of understanding and evolving existing software assets for the purpose of software improvement; modifications; interoperability; re-factoring; restructuring; reuse; porting; migration; translation into another language; and enterprise application integration [33]. ADM usually involves one or more components of the IT architecture. Each component of an IT portfolio has its own trajectory of evolution from the as-is state to the to-be state (i.e. an element of the existing solution evolves into an element of the target solution). Figure 2.3 depicts various trajectories across the knowledge curve that reflects transformations within architectural perspectives. The increasing cost of maintaining legacy systems together with the need to preserve business knowledge has turned modernization of legacy systems into an important research field. ADM provides several benefits such as return on investment (ROI) improvements on existing information system, reducing development and maintenance cost, extending the life cycle of the legacy systems, and easy integration with other systems. The work of the ADM Task Force in OMG has led to the development of several standards. The cornerstone within this set of standards is the Knowledge Discovery Metamodel (KDM). KDM allows standardized representation of knowledge extracted from legacy systems by means of reverse engineering [34]. KDM provides a common repository structure that makes possible the exchange of information about existing software assets in legacy systems.
F. Data Wrapping Data wrapping improves connectivity and allows the integration of legacy data into modern infrastructures. Legacy systems often exchange information developed on different systems, where sources and receivers have implicit preconceived assumptions about the meaning of data. It is thus not uncommon for system A and system B to use different terms to define the same thing. However, in order to achieve a useful exchange of data, the individual systems must agree on the meanings of the exchanged data. In other words, the legacy systems must ensure interoperability. Altman, et al., [29] have suggested data wrapping. Data wrapping enables accessing legacy data using a different interface or protocol approach than those for which the data was designed initially. Data wrapping improves the interoperability of the system. G. Legacy Integration using CGI Eichmann [30] has suggested legacy integration using Common Gateway Interface (CGI). The CGI is a standard for interfacing external applications with information servers, such as HTTP or Web servers. Legacy integration using CGI is often used to provide fast web access to existing assets including legacy systems on mainframes and transaction monitors. The GUI communicates directly with the core business logic or data of the legacy system instead of wrapping it as in screen scraping. This approach adds value to interoperability but does not integrate Big Data. It also does not improve maintainability of the legacy system. H. Data Contextualization Data contextualization is a technique which can be used in a modernization approach when the reverse engineering stage is being carried out. Systems that attempt to integrate and analyze data from multiple data sources are greatly aided by the addition
5
J. COTS Based Modernization Kotonya, et al., [35] have described a COTS based modernization approach called COMPOSE which is a component–based approach to extending legacy systems. The COMPOSE method embodies a cyclical development process that integrates verification into every part of the process to ensure that there is an acceptable match between components and the system being built. It also includes negotiation in each cycle as an explicit recognition of the need to trade-off and accept compromise in successful component-based system development. This ensures that even the earliest stages of system development are carried out in a context of off-the shelf component availability, system requirements and critical architectural concerns. The lack of practical methods and tools has hampered more widespread use of COTS in modernizing legacy systems [36, 37].
intelligence, where the simple summing of a known value reveals a result, such as order sales becoming year-to-date sales.
K. Knowledge Based Software Reuse (KBSR) Process and Repository for systematic legacy system modernization The KBSR Process [5] involves two necessary software reuse phases to help software engineers develop or modernize a software system with reuse. These phases are:
Develop the KBSR Repository (for reuse), and Use the KBSR Repository in the modernization of a system (with reuse)
KBSR Process is based on software reuse and it makes software reuse as an integral phase in software development or in legacy system modernization. It suggests that all reusable software artefacts, components, assets etc. should be made easily available to software engineers. A reuse repository stores all the knowledge base of reusable software artefacts, reusable components, previous software development experiences, etc. The KBSR Repository aims to give software engineers easy access to reusable software artefacts and reusable components. The knowledge used for software development is categorized and saved in the KBSR Repository for reuse. The knowledge extracted from legacy system is also categorized and saved in the KBSR Repository for modernization with software reuse. The KBSR Repository contains all categories of reusable software artefacts, reusable components and hence it provides software reusable assets. Reuse repositories are one critical element of successful software reuse processes [6]. Software engineers and developers can access reusable software assets from this KBSR Repository. IV.
DATA CATEGORIES AND BIG DATA
Big data refers to large datasets that are challenging to store, search, share, visualize, and analyze. At first glance, the orders of magnitude outstrip conventional data processing and the largest of data warehouses. It is often said that data volume, velocity, and variety define Big Data, but the unique characteristics of Big Data is the manner in which the value is discovered. Big Data is unlike conventional business
Data Categories
Structure
Volume/ Velocity/ Complexity
Examples
Master Data
Structured
Low
Employees, Customers, offices, products, assets
Transactional Data
Structured and Semistructured
MediumHigh
Sales order, shipping documents, credit card payments
Reference Data
Structured and Semistructured
Low-Medium
Code lists, status codes, market data
Metadata
Structured
Low
Data name, definition of a data entity
Analytical Data
Structured
MediumHigh
Data in data warehouse, data marts
Documents and Contents
Unstructured
MediumHigh
Medical images. Maps, video, medical records
Historical Data
Structured
MediumHigh
Point-in-time reports, database snapshots, version information
Temporary Data
Structured
Low-Medium
A copy of a table that is created during a processing session to sped up lookups
Big Data
Structured, Semistructured and Unstructured
High
Machine/user generated content social media, web, software logs, cameras etc.
Table: 1 Data Categories
6
The growth of Big Data is a result of the increasing channels and variety of data in today‟s world. Some of the new data sources are user-generated content through social media, web and software logs, cameras, information-sensing mobile devices, aerial sensory technologies, genomics, and medical record. Organizations have realized that there is a competitive advantage in this information and that now is the time to put this data to work.
use integration techniques such as ELT/ ETL/ Change Data Capture to transfer data into a DBMS data warehouse or operational data store, and then offer a wide variety of analytical capabilities to reveal the data. Legacy System Functional Area
Data Administration Strategies
Modernized System
Data integration consideration s
IT strategists, planners, developer, architects have been trying to find the relevant information in the unstructured data for many years. Big Data differs from other data categories in many dimensions. Table 1 shows how Big Data is different to the other categories of data. Data categories are grouping of data with common characteristics. While integrating Big Data capabilities into legacy system we need to address the issues related to Big Data. Big Data can be structured, semi-structured or unstructured having high volume, velocity and complexity as shown in Table 1 data categories.
Payment System
Health Care System
Health care system
Payment System
Data Modeling
Employees system
Data Migration
Data Architecture Planning
Employees System
V.
Figure 1: Modernized system using legacy system functional areas and data administration strategies
DATA ADMINISTRATION STRATEGIES FOR INTEGRATING LEGACY SYSTEM INTO BIG DATA SOLUTIONS
While integrating Big Data into legacy system modernization the challenge is to create the target system which will incorporate Big Data to other data sources for rapid use and rapid data interpretation requirements. Modernized system should have all functional and data requirements of legacy systems. Legacy system becomes modernized system by:
Extracting and adding the functional requirements from legacy systems- removing duplicated process
Separating data from processes
Incorporating standard structures for data reuse and data sharing.
Data Standardizat ion
Some of the analytical capabilities include: dashboards, reporting, EPM/BI applications, summary and statistical query, semantic interpretations for textual data, and visualization tools for high-density data. Applying data administration strategies standardization across different application systems in an organization is possible.
Structured Data
Enterprise Integration
Data Warehouse
Analytical Capabilities
Figure 2: Traditional data processing architecture capabilities
Most of the legacy systems were developed without process models or data models which are now required to support data standardization. The organization requires that modernized systems use logical data models to represent data requirements.
Big Data really like 1024 GB = 1 Terabyte, 1024 Terabytes = 1 Petabyte and that is massive data. Google processes almost 20 Petabytes of data every day. Traditional data processing architecture capabilities are a waste of time. The processing capabilities for Big Data architecture are to meet the volume, velocity, variety, and value requirements. There are different technology strategies for real-time and batch processing requirements. For real time, key value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as “MapReduce”[38], filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed directly, loaded into other unstructured databases, sent to mobile devices, or merged into traditional data warehousing environment and correlated to structured data as shown in Figure 3. MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster [38]. Figure 4 depicts the various functionalities required to be completed for integrating legacy systems to Big Data solutions. Legacy system needs to be modernized in order to handle big data architecture capabilities.
Data models must be developed to represent the policies, strategies, and Big Data issues. Under the framework development of the models include, business functions, policies, rules, and data elements. This approach ensures that all data structures including Big Data can be identified and linked to supported processes. Modernized system should integrate data administration strategies to legacy system functional areas or functionality. In an organization there could be number of legacy systems such as payment system, health care system, employees system as shown in Figure 1. There is a need to employ data administration strategies such as data integration considerations, data modeling, data standardization, data migration and data architecting planning. Figure 1 show that modernized system should reuse functionality provided by legacy system and data administration strategies. To employ the data administration strategies we also need to understand a well formed logical architecture for structured data. Structured data collected from different source
7
Unstructured Data
Key Value Data Store
Map Reduce
complexities of a new system, so the project plan and the execution of the project must be responsive to unanticipated issues. It must be agile and flexible [45]. System Integration and Data Administration: System integration is defined as the process of bringing together the component subsystems into one system and ensuring that the subsystems function together as a system. It is also the process of linking together different computing systems and software applications physically or functionally to act as a coordinated whole. There are different methods of integration such as Vertical integration, Horizontal Integration, Star integration, and common data format integration. Vertical Integration is the process of integrating subsystems according to their functionality by creating functional entities also referred to as silos. The benefit of this method is that the integration is performed quickly and involves only the necessary vendors; therefore, this method is cheaper in the short term [46]. On the other hand, cost-of-ownership can be substantially higher than seen in other methods, since in case of new or enhanced functionality, the only possible way to implement (scale the system) would be by implementing another silo [47]. Horizontal Integration is also called Enterprize Service Bus (ESB) is an integration method in which a specialized subsystem is dedicated to communication between other subsystems. This allows cutting the number of connections (interfaces) to only one per subsystem which will connect directly to the ESB. The ESB is capable of translating the interface into another interface. This allows cutting the costs of integration and provides extreme flexibility. With systems integrated using this method, it is possible to completely replace one subsystem with another subsystem which provides similar functionality but exports different interfaces, all this completely transparent for the rest of the subsystems. The only action required is to implement the new interface between the ESB and the new subsystem. The horizontal scheme can be misleading, however, if it is thought that the cost of intermediate data transformation or the cost of shifting responsibility over business logic can be avoided . Star Integration [46] or also known as Spaghetti Integration is a process of integration of the systems where each system is interconnected to each of the remaining subsystems. When observed from the perspective of the subsystem which is being integrated, the connections are reminiscent of a star, but when the overall diagram of the system is presented, the connections look like spaghetti, hence the name of this method. The cost varies because of the interfaces that subsystems are exporting. In a case where the subsystems are exporting heterogeneous or proprietary interfaces, the integration cost can substantially rise. Time and costs needed to integrate the systems increase exponentially when adding additional subsystems. From the feature perspective, this method often seems preferable, due to the extreme flexibility of the reuse of functionality. A common data format is an integration method to avoid every adapter having to convert data to/from every other applications' formats, Enterprise application integration (EAI) systems usually stipulate an application-independent (or common) data format. The EAI system usually provides a data transformation service as well to help convert between
Data warehouse as structured data for analytical capabilities
Figure 3: Big Data Architecture Capabilities We require reverse engineering and forward engineering to accomplish the task of modernization of legacy system in which Big Data can be incorporated. The activities involved are: Reverse Engineering Forward Engineering System Integration and data administration Reverse Engineering: The reverse engineering supports the integrated analysis and redesign/development activities required to modernize the selected legacy system. Due to massive and complex nature of software, modernization must be conducted in multiple phases. Reverse engineering can be divided as: White-Box approach and Black-Box approach [39]. In WhiteBox approach, the business-logic is to be extracted through analysis of the legacy system. This approach can be separated into two domains of Database Reverse Engineering (DBRE) and Procedure Reverse Engineering (PRE). Database Reverse Engineering is the part of system maintenance work that produces a sufficient understanding of an existing database system and its application domain to allow appropriate changes to be made. Database Reverse Engineering deals with a subset of the problems addressed by software reverse engineering. Database Reverse Engineering recovers domain semantics of an existing database and represents them as a conceptual schema that corresponds to the possible (most likely) design specifications of the database. These design specifications are required when integrating Big Data into the application software. While the first domain DBRE seems to be mature enough to be considered for the development of DBRE tools, the second (PRE) is still an unsolved problem [40, 41]. Procedure Reverse Engineering deals with analysing and understanding the old code. Analysing and understanding the old code is a difficult task. Some architecture reconstruction tools have been developed to aid in the understanding of such code but these tools are human interactive and interpretive [42, 43]. Jha and O‟Brien have used software architecture reconstruction tool to document the software architecture of a legacy system [44]. Forward Engineering: Forward engineering is the traditional process of moving from high-level abstractions and logical, implementation-independent designs to the physical implementation of a system. Forward engineering follows a sequence of going from requirements through designing its implementation [269]. All Software development Life Cycle is based on Forward Engineering. The traditional software development life cycle has five phases: Analysis, Design, Coding, Implementation, and Maintenance. Forward Engineering is based on these five phases. One of the newer and more effective ones is called Agile Development. The basic philosophy of Agile Development is that neither team members nor the users completely understands the problems and
8
Big Data architecture challenge is to meet the rapid use and rapid data interpretation requirements while at the same time correlating it with other data. At present integration of traditional data and Big Data is a challenge.
application-specific and common formats. This is done in two steps: the adapter converts information from the application's format to the bus's common format. Then, semantic transformations are applied on this (converting zip codes to city names, splitting/merging objects from one application into objects in the other applications, and so on). Reverse Engineering Legacy Data
VI.
The way data is being processed today is becoming a legacy system with an advent of technological change called Big Data.. Big Data movement is fueling the business transformation. Walmart is famous for their use of data to transform the business model and develop the software that could track consumer behavior in real time from the bar codes read at Walmart‟s checkout counters [28]. Big Data refers to many sources and types of data both structured and unstructured. Big Data deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media, Facebook, Twitter, Web sites, mobile devices, cameras etc. The ending of technological support and expected system lifetime reflect mandatory system modernization and integration of Big Data to legacy systems for decision making. This paper has discussed the scope of integrating Big Data into modernization of legacy by identifying what is required for the integration of Big Data into modernization of legacy system. Using Reverse Engineering, Forward Engineering and System Integration and Data administration, Big Data needs to be integrated into legacy system to make legacy evolvable. Legacy system modernization is an inevitable process due to software evolution. The scope of a modernization is not always fixed or well defined, software engineers typically strive for greater application agility so they can rapidly respond to business requests for change. Understanding the scope of modernization for integrating Big Data, we will be developing a methodology how this can be achieved.
Adding new generated requirements
Legacy Software
To be requirements
Extracted Data and Business Logic
Forward Engineering Modernized Data
Modernized Software
Modernized Data Architecture
Modernized Software Architecture
System Integration, Data Administration
Shared Data
CONCLUSION AND FUTURE WORK
Big Data
Integrated Enterprize application Software
REFERENCES Figure: 4 Scope of Modernization for Integrating Big Data [1]
For many use cases, Big Data needs to capture data that is continuously changing and unpredictable. And to analyze this data we require an architecture that supports this Big Data analyses. In retail, a good example is capturing real time inflow of the customers with the intent of delivering store promotion. To track effectiveness of display and promotion, customer movements and behavior must be interactively explored with visualization and query tools. It shows that visualization and query tools must be the part of the architecture or the visualization and query tools must be used as an extension to the existing modernization approach.
[2] [3] [4]
[5]
In other examples, the analysis cannot be completed until the structured data is correlated with unstructured data. In the example of consumer sentiment analysis, capturing a positive or negative social media comment has some value, but associating it with the most or least profitable customer makes it far more valuable. So, the needed capability with Big Data is context and understanding. This must be represented in the Big Data architecture.
[6]
[7]
[8]
9
J. Bisbal, D. Lawless, B. Wu, and J. Grimson, "Legacy Information Systems: Issues and Directions", IEEE Software, no. 16(5), pp. 103-111, 1999 Liebowitz Jay, Boca Raton, “Big Data and Business Analytics”, CRC Press, 2013. Schmarzo Bill, Hoboken, “Big Data : Understanding How Data Powers Big Business”, Wiley, 2013. J Bloem, m vanDoorn, S Duivestein, T van Manen, E van Ommeren, S Sachdeva, “ No More Secrets with Big Data Analytics”, Book Production, 2013, The Sogeti Trend Lab VINT. Jha Meena, O‟Brien Liam, “Comparison of Modernization Approaches: With and Without the Knowledge Based Software Reuse Process”, The second International Conference on Advances in Computer Science and Engineering (CSE 2013), Loa Angeles, CA, USA, July1-2, 2013. Jha Meena, Maheshwari Piyush, “Reusing Code for Modernization of Legacy Systems”, Proceedings of IEEE Conference on Software Technology and Engineering Practice (STEP) 2005, Budapest, Hungary, 24-25 September, 2005. J. L. Hainaut, " Database Reverse Engineering", Doctoral Dissertation, University of Namur- Institute d‟Informatique, 211B-5000 Namur, Belgium, 1998. M. Jha and L. O'Brien, "Re-engineering Legacy Systems for Modernization: The Role of Software Reuse", accepted in The Second
[9]
[10]
[11]
[12] [13]
[14] [15] [16]
[17]
[18]
[19]
[20]
[21] [22]
[23] [24] [25]
[26] [27]
[28]
International Conference on Advances in Computer Sciences and Electronics Engineering, New Delhi, India, 23-24 February 2013. B. K. Kang and J. Bieman, "Using Design Abstractions to Visualize, Quantify, and Restructure Software", The Journal of Systems and Software, vol. 42, no. 2, pp. 172-187, 1998. A. v. Deursen, P. Klint, and C. Verhoef, "Research Issues in Software Renovation", in Proceedings of the Fundamental Approaches to Software Engineering FASE99, Berlin 1999. M. Rahgozar and F. Oroumchian, "An Effective Strategy for Legacy System Evolution", Journal of Software Maintenance and Evolution: Research and Practice, vol. 15, no. 5, pp. 325-344, 2003. W. Schafer, R. Prieto-Diaz, and M. Masao, Historcal Overview: Software Reusability: Ellis Horwood, 1994. J. Koskinen, J. J. Ahonen, and H. Sivula, "Software Modernization Decision Criteria: An Empirical Study", in Proceedings of the Ninth European Conference on Software Maintenance and Reengineering (CSMR'05), Washington, D.C. USA, 2005. "Legacy Modernization Survey", Attachmate September 11, 2009. B. Erricson-Connor, "Truth and Consequences", ZJournal, pp. 38-43, August/September 2003. Z. Li, X. Anming, Z. Naiyue, H. Jianbin, and C. Zhong, "A SOA Modernization Method Based on Tollgate Model", in 2009 International Symposium on Information Engineering and Electronic Commerce, Ternopil, Ukraine, 16-17 May 2009. Z. Li, X. Anming, Z. Naiyue, H. Jianbin, and C. Zhong, "A SOA Modernization Method Based on Tollgate Model", in 2009 International Symposium on Information Engineering and Electronic Commerce, Ternopil, Ukraine, 16-17 May 2009. A. Fuhr, T. Horn, V. Riediger and A. Winter, "Model-Driven Software Migration into Service-Oriented Architectures", Journal of Computer Science - Research and Development 28(1) Page(s): 65-84, 2013. S. Comella-Dorda, K. Wallnau, R. Seacord, and J. Robert, "A Survey of Legacy System Modernization Approaches", Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA 15213, April 2000. N. Weiderman, L. Northrop, D. Smith, S. Tilley, and K. Wallnau, "Implications of Distributed Object Technology for Reengineering," Technical Report CMU/SEI-97-TR-005, Carnegie Mellon University, Pittsburgh, June 1997. P. Winsberg, "Legacy Code:Don't Bag it, Wrap it", Datamation vol. 41, no. 9, pp. 36-41, 1995. H. M. Sneed, "Encapsulating Legacy Software for Use in Client/Server Systems", in 3rd Working Conference on Reverse Engineering, Monterey, California, USA, pp. 104-119, November 1996. N. Ganti and W. Brayman, Transition of Legacy Systems to a Distributed Architecture: John Wiley & Sons Inc., 1995. M. L. Brodie and M. Stonebraker, Migrating Legacy Systems: Gateways, Interfaces & the Incremental Approach: Morgan Kaufmann, 2007. D. Aebi, "Data Re-engineering - A Case Study", in 1st East-European Symposium on Advances in Database and Information Systems (ADBIS'97), St.Petersburg, Russia, September 1997. A. J. O'Callaghan, Practical Experiences of Object Technology: Cheltenham: Stanley Thornes in association with UNICOM, 1996. J. Bisbal, D. Lawless, B. Wu, and J. Grimson, "Legacy Information System Migration:A Brief Review of Problems, Solutions and Research Issues", Trinity College, Ireland, Dublin 1999. D. F. Carr, "Web-Enabling Legacy Data When Resources Are Tight", Internet World August 10,1998.
[29] R. Altman, Y. Natis, J. Hill, J. Klein, B. Lheureux, M. Pezzini, R. Schulte, and S. Varma, "Middleware: The Glue for Modern Application", Gartner Group, Strategic Analysis Report 26 July, 1999. [30] D. Eichmann, "Application Architectures for Web Based Data Access", Proceedings of the workshop Web Access to Legacy Data, Fourth International WWW Conference, Boston, Massachusetts, USA, 11-14 December, 1995 [31] R. Perez-Castillo, I. Garcia-Rodrigez de Guzman, M. Piattini, and O. Avila-Garcia, "On the use of ADM to Contextualization Data on Legacy Source Code for Software Modernization", in 16th Working Conference on Reverse Engineering, Antwerp, Belgium, 15-18 October 2008. [32] OMG. (2007, 09/06/2009). ADM Task Force by OMG [33] P. Newcomb, "Architecture-Driven Modernization (ADM)", in Proceedings of the 12th Working Conference on Reverse Engineering (WCRE’05), Washington, DC, USA, 2005. [34] OMG. Architecure-Driven Modernization (ADM): Knowledge Discovery Meta-Model (KDM), v1.1. [35] G. Kotonya and J. Hutchinson, "A COTS-Based Approach for Evolvng Legacy Systems", presented at the Sixth International Conference on Commercial-Off-the-Shelf (COTS) -Based Software Systems (ICCBSS'07), Alberta, Canada, February26- March 2, 2007. [36] G. Kotonya and J. Hutchinson, "Managing Change in COTS-Based Systems", in Proceedings of 21st IEEE International Conference on Software maintenance (ICSM) Budapest, Hungary, 25-30 September, 2005. [37] J. M. Voas, "The Challenges of Using COTS Software In ComponentBased Development", Computer, vol. 44, pp. 31-37, 1998. [38] Jeffrey Dean, Sanjay Ghemawat, “ MapReduce:Simplified Data Processing on Large Clusters”, 11th USENIX Symphosium on Operating Systems Design and Implementation, October 6-8, 2014 Broomfield, CO. (To appear). [39] M. Jha, L. O‟Brien, and P. Maheshwari, "Identify Issues and Concerns in Software Reuse", Journal of Information Processing, 2008. [40] S. Latha and A. S. Thanamani, "Service Oriented Architecture – Technologies, Approaches for Integration and Automation of Legacy System in Heterogeneous Environment using Reusability technique", Journal of Computing, vol. 2, no. 12, pp. 64-70, December 2010. [41] J. L. Hainau, "Database Reverse Engineering", Doctoral Dissertation, University of Namur Institute of d'Informatique, Namur, Belgium,, 1998. [42] A. Climitile, A. De Lucia, G. A. Di Lucca, and A. R. Fasolino, "Identifying Objects in Legacy Systems", in 5th International Workshop on Program Comprehension (WPC '97), Dearborn, MI, USA, May 1997, pp. 138-147. [43] M. Rahgozar and F. Oroumchian, "A Practical Approach for Modernization of Legacy Systems", in First EuroAsian Conference on Advances in Information and Communication Technology, ICT 2002, Vienna, 2002, pp. 149-153. [44] Jha Meena, Maheshwari Piyush, Phan Thi Koi Anh, Technical Report on the “Comparison of Four Architecture Reconstruction Toolkits”, UNSW TR-0435, 2004. [45] J. W. Satzinger, R.B. Jackson and S. D. Burd, “Systems Analysis and Design in a Changing World”, Course Technology Cengage Learning, Sixth Edition , 2011. [46] Gold-Bernstein, Beth; Ruh, William A (2005), Enterprise integration: the essential guide to integration solutions, Addison Wesley, ISBN 0321-22390-X. [47] Lau, Edwin "Multi-channel Service Delivery", OECD e-Government Studies e-Government for Better Government, Paris: OECD, p. 52, 2005.
10