102-2007: Performance-Driven Data Integration – An Optimized ...

SAS Global Forum 2007

Data Warehousing, Management and Quality

Paper 102-2007

Performance Driven Data Integration: An optimized approach Vijay Venugopal, sanofi-aventis U.S., Bridgewater, NJ Jim Fang, sanofi-aventis U.S., Bridgewater, NJ Srinivas Bhat, sanofi-aventis U.S., Bridgewater, NJ ABSTRACT Characteristically, the data warehouse is a central repository of vast amounts of disparate data. The process of cleansing, extracting, transforming and organizing this data into an enterprise intelligence platform (data mart) presents unique challenges. An optimized approach towards performance driven data integration should eliminate redundancy and incorporate accuracy, timeliness, automation and quality. Such a process/system brings greater agility, better decision-making and reduced cost and risk to performing insightful statistical analysis and reporting. This paper presents an optimized approach to create SAS data sets directly from a traditional RDBMS data warehouse. The authors compare multiple data loading strategies like loading from a flat file, loading from denormalized and normalized data sources, loading from materialized and non materialized views. Benchmark numbers and significant performance gains of up to 50% are highlighted in conjunction to these strategies. Further, the paper discusses merits of performing QC on non traditional data sources like views. Lastly, several optimization techniques enabling rapid data transfers are also suggested.

INTRODUCTION Today’s proliferating systems in organizations require simultaneous improvement of approaches to data warehousing, standardizing, cleansing and integration before performing meaningful statistical analyses. While these environments serve as an enterprise intelligence platform bringing greater agility, better decision-making, and reduced cost and risk to performing insightful analysis and reporting, data should be efficiently integrated concomitantly considering utilization of time and space and ensuring quality of data. Historically, integration was performed with a handshake of flat files which was exchanged between multiple data warehouses and data marts. This method highlighted major inefficiencies: 1. Redundant data was maintained in multiple environments 2. Performance suffered due to lengthy and often manual ETL processes, 3. Updated snapshot of the data was not always cascaded to the downstream systems and 4. Ultimately, quality of the data suffered due to this daisy chain effect. When investigated, a few different strategies showed significant performance enhancements over the traditional data integration methodologies. Integration, when performed with views proved to be the most optimized approach towards performance driven data integration.

DON'T THINK YOU ARE ON THE RIGHT ROAD JUST BECAUSE IT’S A WELL-TRODDEN PATH The paradigm for relational data in the data warehouse is storing data at a low level of granularity in third normal form. It is then possible to denormalize the data and integrate with data marts where the user can depend on the accuracy and timeliness of the data. Such architecture serves as an enterprise intelligence platform bringing greater agility, better decision-making, and reduced cost and risk to performing insightful analysis and reporting. In stark contrast to data which is solely in the form of historical reports that gives hindsight but limited insight, data in this environment is compatible with and translatable to other operations primary of which are data mining and perceptive analyses. Depending on how data integration is performed, enterprises invariably end up with data being consolidated in the data warehouse and redundantly held in data marts. While this implicit two-step throughput model is a conventional approach to data integration as shown in Figure 1, the question is if the enterprise has the need for both the redundant ETL development and data storage. Further more, conventional data integration methods rely on the handshake mechanism of extracting and exchanging flat files between the data warehouse and the data mart. While this ineffectively adds unnecessary complexity to the handshake process, it underscores the inflexibility to perform QC on the flat files as well. This was the impetus for researching an optimized approach towards performance driven data integration which eliminated redundancy and incorporated accuracy, timeliness, automation and quality.

1



Conventional Data Integration Data Warehouse Source data

Other Sources

SAS Data Mart Cleanse Apply Business Rules Summarize QC Results

Flat Files (.txt, CSV)

•

Transpose / Pivot

• •

Extract

USERS

Create Flat Files Reports

Flat Files (.txt, CSV)

Applications Target

Source

Access Mechanism

Figure 1: Conventional Data Integration

DON’T COMPLAIN…UNLESS YOU ARE WORKING ON CHANGING IT ELIMINATING REDUNDANCY

In order for the analytical data marts to analyze the data available in the data warehouse, fact and dimension data was extracted and exchanged using flat files. This manual step was prone to errors. This further led to redundancy of data which existed in multiple places in the enterprise: the data warehouse, flat files and data marts. Change - Data extract and exchange mechanism needed to be streamlined. Goodbye flat files, hello views. The following code snippet demonstrates the data load. /* Sample program using SAS/ACCESS for loading data directly from the ORACLE server to the SAS data mart*/ libname prodserv oracle user=XXX pw=XXX path=111.111.111.111 schema=staging; run; data lib1.output_data; set prodserv.input_data; run; INCORPORATING ACCURACY AND PERFORMANCE

The entity-relationship data model in a traditional data warehouse puts attributes of one entity into separate tables that are associated by duplicate key values in each table. Although this technique favors the data warehouse, it is detrimental to an analytical data mart. The data mart architecture usually comprises tables with shorter rows which yield better performance than those with longer rows because disk I/O is performed in pages, not in rows. The shorter the rows of a table, the more rows occur on a page. The more rows per page, the fewer I/O operations it takes to read the table sequentially, and the more likely it is that a non-sequential access can be performed from a buffer. Change – Data warehouse, which stores data in third normal form, does not additionally have to denormalize just to integrate with data marts. Data views can flexibly address the needs of the data mart as depicted in Figure 2 below.

2



ENSURING AUTOMATION AND QUALITY

Data integration across multiple systems present automation challenges. Manual intervention is required in order to ascertain the quality of data extracted in flat files. Technological limitations further inhibit thorough and efficient QC of the data extracted in these files. This often shifts the burden of QC to the downstream data marts. Though late, these systems often implement resource intensive steps to ascertain the quality of the data. Change – Data warehouse can efficiently QC views before releasing data to the downstream data marts. Issue resolution is expedited, quality is ensured and data integration is performance driven (Figure 2)

Performance Driven Data Integration Data Warehouse

SAS Data Mart Pivot / Transpose Summarize Apply Business Rules QC Results

Source Data

Other Sources

USERS

Reports

Data Base Views Applications

Source

Target

Access Mechanism

Figure 2: Performance Driven Data Integration

TRUE GENIUS RESIDES IN THE CAPACITY FOR EVALUATION OF UNCERTAIN, HAZARDOUS, AND CONFLICTING INFORMATION The basic lack of flexibility is at the heart of the weakness of Data Integration via flat file. Benchmark numbers buttresses the rigidity of this resource intensive approach. Some of the other techniques evaluated were: LOADING FROM DE-NORMALIZED VS. NORMALIZED DATA VIEWS

Benchmark numbers indicate that using the power of an analytical data mart like SAS to transform and present data from diverse data sources in accordance with established business rules significantly outperformed the data warehouse performing such functions. The exhibit below highlights the performance enhancement gained when the data warehouse provided normalized views and the data marts transposed and denormalized the data before being loaded. /* Sample program to denormalize weekly data from tall format to wide format*/ data weekly_data; set lib1.weekly_data; length nweek 8; nweek = input(week_ending_date,mmddyy10.); wending = put(input(week_ending_date,mmddyy10.),mmddyy10.); call symput("wending",wending); run; /* Sorting and removing duplicates*/ proc sort nodupkey data=weekly_data(keep=nweek week_ending_date) out=uniqweek;

3



by descending nweek; run; /* Sequencing the rows*/ data uniqweek; set uniqweek(drop=week_ending_date); new_week = _N_; run; /* Sorting the rows in descending order*/ proc sort data=weekly_data(drop=week_ending_date) out=weekly_data; by descending nweek; run; proc sort data=uniqweek out=uniqweek; by descending nweek; run; /* Merging and Ordering the rows*/ proc sql; create table out as select a.*, b.new_week from weekly_data a, uniqweek b where a.nweek=b.nweek order by id1,id2,id3; quit; /* Transposing the rows from tall format to wide format */ proc transpose data= lib1.denorm_weekly_data (rename=(new_week=week_ending_date)) out=transnorm1(drop=labelvar) prefix=prefix name=namevar label=labelvar; by id1 id2 id3; id week_ending_date; var var1; idlabel week_ending_date; run; /* End of Program */ LOADING FROM MATERIALIZED VS. NON MATERIALIZED DATA VIEWS

Benchmark numbers indicate that the overall time taken to integrate data created in materialized views in the data warehouse was equivalent to the overall time taken in the integration via flat files. The creation of materialized views offset the potential performance enhancement. Figure 3 below highlights the performance enhancement gained when the data warehouse provided non materialized views. Data Warehouse Data Type

Records/Columns .txt Extract FTP Time (mts)

Dimention Customer Profile Dimention Customer Address Product Sales Fact 1 (Plantrak) Sales Fact 2 (Hospital Sales)

2.6 Mil / 47 13.5 Mil / 28 52 K / 39 12.3 Mil / 102 77 K / 217

40.00 90.00 1.30 195.00 10.00

1.00 1.00 1.00 1.00 1.00

QC

SAS Load time

SAS

QC

Txt Load

1.11

42.11

6.21

97.21

0.07

2.37

9.49

205.49

0.10

11.10

View Load time

38.00 88.26 0.31 170.00 32.20

SAS Txt Vs Tall View Load Time

4.11 8.95 2.06 35.49 -21.10

Figure 3: Performance Comparison

4

DW Tall Mat View Vs Extract Txt Time (mts)

70.00 70.00 94.00 5.23

111.49 5.87

SAS Materialized View

View Vs Mat View

8.16 21.32

-40.16 -3.06



CONSOLIDATING QC UPSTREAM AND VERIFYING COUNTS USING PROC SUMMARY IN THE DATA MART VS. ROW COUNT VERIFICATION

Data integration completes with the verification of quality of data integrated across systems. Routine QC steps check for row counts. But data integrated via normalized views renders this conventional process ineffective. The following code snippet uses practical proc summary steps to QC integrated data. /* Sample program to QC the data set using PROC Summary*/ options mprint nosymbolgen nomlogic; /* Macro which summarizes data across columns and rows*/ %macro control_tot(mkt=,mkt_id=); /* Invoking PROC Summary*/ proc summary data = libname.dataset_name nway; var tdol: tun:; output out=row_sum sum=; run; /* summing columns*/ data column_row_sum; set row_sum; sm_tdol= sum(of tdol01-tdol106); sm_tun= sum(of tun01-tun106); run; proc print data= column_row_sum; %mend; /* Macro which summarizes data across columns and rows*/ %control_tot(mkt=market_code,mkt_id=4); CONVENTIONAL ETL VS. MULTI THREADED SAS OPTIMIZATION TECHNIQUES ENABLE RAPID DATA TRANSFERS

Beginning with SAS 9.0, threaded reads enable the completion of jobs in substantially less time than compared to each task being handled sequentially. Threaded reads divide resource-intensive tasks into multiple independent units of work and execute those units in parallel. We used the DBSLICE= data set option which specifies user-supplied WHERE clauses to partition a DBMS query into component queries for threaded reads. We also used the DBSLICEPARM= data set option and the DBSLICEPARM= LIBNAME statement option controls the scope of DBMS threaded reads and the number of threads. OVERALL COMPARISON – MATERIALIZED VS. NON MATERIALIZED, DE-NORMALIZED VS. NORMALIZED

The following exhibit Figure 4 provides a comparision of the time and resource utilization across multiple data integration techniques. T ime and R eso urce C o mp arisio n C hart

250.00

200.00

150.00

Txt Load View Load t ime Tall Load Time

100.00

Mat erialized View 50.00

0.00 2. 6 M i l / 47

13. 5 M i l / 28

12. 3 M i l / 102

P r escr i ber

P r escr i ber

P l ant r ak SE D

Cust omer P r of i l e

Cust omer

77 K / 217

102 K / 68

277 K / 21

DDD T hr ombot i c Account Cust omer Account Cust omer P r of i l e

Addr ess

Addr ess Subj ect Ar ea

Figure 4: Comparison of time and resource utilization

5

52 K / 39 P r oduct



WHEN PERFORMANCE EXCEEDS AMBITION, THE OVERLAP IS CALLED SUCCESS Below are highlights of the performance driven data integration approach. 1. Improved performance and reduced network traffic due to the ability to pass database queries, joins and functions to a target data source for processing. 2. Optimized reading and extracting data from parallel DBMS Servers with enhanced multithreaded read support. 3. Faster load times with support for native bulk load utilities. 4. Space savings since files do not have to be created / maintained 5. Elimination of Manual steps like extract / transform / load / ftp 6. The power of SAS software to cleanse, transform, analyze and present data from diverse data sources in accordance with established business rules 7. Ramp up time for new data sources is less since the load script conforms to the source format which does not have to change 8. Robust QC and cleaner data since QC is performed on Data Warehouse Views ahead of data being loaded to SAS Data Mart

CONCLUSION Our investigation was performed on data sources which ranged in Giga Bytes (>10MM rows and >20 columns). A few different strategies showed significant performance enhancements over the traditional data integration methodologies. Integration, when performed with views on de-normalized data highlighted significant performance gains of up to 50% in comparison to loading from flat files or materialized views. Also, this strategy afforded space and time savings and consolidated QC upstream (by verifying counts using proc summary in the data mart vs. row count verification in the conventional ETL). Multi threading and utilizing SAS Optimization techniques further enabled rapid data transfers.

REFERENCES Virgile, Robert. 1998. Efficiency: Improving the performance of your SAS applications. Cary, NC: SAS Institute Inc. Burlew Michele. 2002. Reading external data files using SAS. Cary, NC: SAS Institute Inc. SAS Institute Inc. (1990), SAS Programming tips: A guide to efficient SAS processing. Cary, NC: SAS Institute Inc.

ACKNOWLEDGMENTS The authors acknowledge the leadership team at sanofi-aventis for their encouragement while the investigation was underway and support in completing the paper.

RECOMMENDED READING Delwiche, Lora D., and Susan J. Slaughter. 2000. The Little SAS® Book. Cary, NC: SAS Institute Inc. Cody, Ron. 1999. Cody’s Data Cleaning Techniques Using SAS® Software. Cary, NC: SAS Institute Inc.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Name: Enterprise: Address: City, State ZIP: Work Phone: E-mail:

Vijay Venugopal sanofi aventis 55 Corporate Drive Bridgewater, NJ 08807 (908)981-6115 [email protected]

Jim Fang sanofi-aventis 55 Corporate Drive Bridgewater, NJ 08807 (908)981-6080 [email protected]

Srinivas Bhat sanofi aventis 55 Corporate Drive Bridgewater, NJ 08807 (908)981-6087 [email protected]

All Results were obtained on the multi-user and multi-CPU HP UNIX server running SAS 9.1.3. Actual numbers may vary depends on the server load. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.

6

102-2007: Performance-Driven Data Integration – An Optimized ...

102-2007: Performance-Driven Data Integration – An Optimized ...

Suggest Documents

On-Demand Service-Based Big Data Integration: Optimized for

Integration of An Optimized E-Shaped Patch Antenna into Laptop ...

Data Integration Needs an Industrial Revolution - CiteSeerX

Data Integration Needs an Industrial Revolution - CiteSeerX

OPTIMIZED INTEGRATION OF UVAs SURVEYS ...

An Optimized Method of Partial Discharge Data Retrieval Technique ...

An Optimized Method for Concealing Data using Audio ... - CiteSeerX

Simulation of an Optimized Data Packet Transmission in a Congested ...

Towards an Optimized Big Data Processing ... - Distributed Systems

an optimized fingerprinting system for highly constrained data - Le2i

an optimized workflow enactor for data-intensive grid ... - Laboratoire I3S

An Optimized Multicast-based Data Dissemination ... - Semantic Scholar

An Optimized Method for Concealing Data using Audio Steganography

An Optimized Data Obtaining Strategy for Large-Scale Sensor ...

EStore: An Effective Optimized Data Placement ... - IEEE Xplore

An Optimized Multicast-based Data Dissemination Middleware: A ...

Data Integration Infographic - Oracle

Pentaho Data Integration/Kettle

UNCERTAINTY IN DATA INTEGRATION

Data Integration [PDF]

Informatica Data Integration Hub

BIG DATA INTEGRATION

Spatial data integration - CiteSeerX

Supporting Information An Optimized Structure