Data Warehouse Architecture

234 downloads 178 Views 388KB Size Report
The Big Debate: Inmon Versus Kimball. ▫ In the beginning there were basically two approaches to modeling the data warehouse. ▫ Inmon popularized the term  ...
Data Warehouse Architecture Lecture #3 IMRAN KHAN IBA

What & Why Architecture? “An architecture is a set of rules to adhere to when building something” „ Because a data warehouse can become quite large and complex, using an architecture is essential for success „ Strict rules for how to architect a data warehouse do not exist „ over the last 15 years a few common architectures have emerged

Imran Khan IBA-FCS City Campus

What & Why Architecture? According to research conducted in 2006 by The Data Warehousing Institute (TDWI) , five possible ways to architect a data warehouse: 1.

2. 3. 4. 5.

Independent data marts—Each data mart is built and loaded individually; there is no common or shared metadata. This is also called a stovepipe solution. Data mart bus—The Kimball solution with conformed dimensions. Hub and spoke (corporate information factory)—The Inmon solution with a centralized data warehouse and dependent data marts. Centralized data warehouse—Similar to hub and spoke, but without the spokes; i.e. all end user access is directly targeted at the data warehouse. Federated—An architecture where multiple data marts or data warehouses already exist and are integrated afterwards. A common approach to this is to build a virtual data warehouse where all data still resides in the original source systems and is logically integrated using special software solutions. Imran Khan IBA-FCS City Campus

Imran Khan IBA-FCS City Campus

Imran Khan IBA-FCS City Campus

The Big Debate: Inmon Versus Kimball „

In the beginning there were basically two approaches to modeling the data warehouse. „

Inmon popularized the term data warehouse „

„

Strong proponent of a centralized and normalized approach

Kimball took a different perspective with his Data Marts and Conformed Dimensions.

Imran Khan IBA-FCS City Campus

Differences between the Inmon and Kimball approach 1.

2.

3.

Data warehouse versus data marts with conformed dimensions Centralized approach versus iterative/decentralized approach Normalized data model versus dimensional data model

Imran Khan IBA-FCS City Campus

Conceptual DW Architectures „ Direct

data mart

…Short

term, quick results

„ Architected,

enterprise data

warehouse …Long

term foundation for future development Imran Khan IBA-FCS City Campus

Data Mart „

Data Mart … A subset

of data (from the data warehouse) designed to answer specific business questions … Also called: Departmental Data Warehouse (Silverston & Graziano) „ Dimensional Data Warehouse (Kimball) „

Imran Khan IBA-FCS City Campus

Direct Data Mart Transformation Routines (ETL) Source 1

Sales Data Mart

Source 2

Financial Data Mart

Source 3

Customer Service Data Mart Imran Khan IBA-FCS City Campus

Direct Data Marts „ Pros: …Build

individual data marts faster

„ Good

for prototyping

Imran Khan IBA-FCS City Campus

Direct Data Marts „

Cons: … Requires „

redundant coding

Must transform each source multiple times … … …

Once for each data mart New data marts require new transform for each source New sources require multiple transformations

… If

business rules change, must change code in multiple routines … Increased number of routines may require more processing power … Multiple points of failure for ETL „

Data marts can get out of sync Imran Khan IBA-FCS City Campus

Architected Data Warehouse „ Core

enterprise data warehouse design …Based

on corporate (logical) data model …May include an ODS „ Specific …Based

departmental data marts on business needs Imran Khan IBA-FCS City Campus

Architected Data Warehouse Sales Data Mart

Source 1

Source 2

Enterprise Data Warehouse

Financial Data Mart

Customer Service Data Mart

Source 3

Imran Khan IBA-FCS City Campus

Architected Data Warehouse „

Pros: … Reduced „

long term maintenance

Complex source transformations occur once … …

From source to staging area (or ODS) If business rules change, code changes required in only one place

… Reduces

points of failure … Second set of ETL routines handle simple aggregations and data segmentation „

Easier to create new data marts

… Enterprise

DW becomes source of historical data Imran Khan IBA-FCS City Campus

Architected Data Warehouse „ Cons: …Requires

more disk space …Requires 2 sets of ETL routines

Imran Khan IBA-FCS City Campus

Corporate Information Factory Information Workshop

Library & Toolbox

Workbench

Information Feedback

External

API

Data Warehouse

ERP

Internet

API

API

Legacy

API

Other

Data Acquisition

CIF Data Management

Data Delivery

Operational Data Store TrI

Operational Systems

Systems Management

Exploration Warehouse

DSI

Data Mining Warehouse

DSI

OLAP Data Mart

DSI

Oper Mart

DSI

Meta Data Management Data Acquisition Management

Operation & Administration

Service Management

Imran Khan IBA-FCS City Campus

Change Management

Multi-Tiered Architecture other

Metadata

sources Operational

DBs

Extract Transform Load Refresh

Monitor & Integrator

OLAP Server

Serve

Data Warehouse

Analysis Query Reports Data mining

Data Marts

Data Sources

Data Storage

OLAP Engine Front-End Tools

Imran Khan IBA-FCS City Campus

Source Data Component „

Production Data „

„

Internal Data „

„

“private” spreadsheets, documents, customer profiles, and sometimes even departmental databases.

Archived Data „

„

data comes from the various operational systems of the enterprise e.g. financial systems, manufacturing systems, systems along the supply chain, and customer relationship management systems.

Some data is archived after a year. Sometimes data is left in the operational system databases for as long as five years.

External Data „

For example, the data warehouse of a car rental company contains data on the current production schedules of the leading automobile manufacturers. This external data in the data warehouse helps the car rental company plan for its fleet management. Imran Khan IBA-FCS City Campus

Data Staging Component „ „ „ „

Data Extraction Data Transformation Data Loading Data staging provides a place and an area with a set of functions to „ Clean „ Change „ Combine „ Convert „ Deduplicate „ Prepare source data for storage and use in the data warehouse.

Imran Khan IBA-FCS City Campus

Type of Meta Data „ „ „

Operational metadata Extraction & Transformation metadata End-user metadata

Why is metadata especially important in a data warehouse? „ „ „

First, it acts as the glue that connects all parts of the data warehouse. Next, it provides information about the contents and structures to the developers. Finally, it opens the door to the end-users and makes the contents recognizable in their own terms.

Imran Khan IBA-FCS City Campus

OLAP Server Architectures „

Relational OLAP (ROLAP) …

Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces … Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services … greater scalability „

Multidimensional OLAP (MOLAP) …

Array-based multidimensional storage engine (sparse matrix techniques) … fast indexing to pre-computed summarized data „

Hybrid OLAP (HOLAP) …

„

User flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers …

specialized support for SQL queries over star/snowflake schemas Imran Khan IBA-FCS City Campus

OLAP:On-Line Analytical Processing „

„

an environment for the analysis of multi-dimensional data … dice … rotate … drill-down … rollup OLAP provides advanced database support involving attribute selection, attribute encoding, row sampling, data cleansing and allows the use of multiple different search engines … easy to use user-interface … open system architecture using local processing power Imran Khan IBA-FCS City Campus

Roll-up, Drill-down, Slicing, Dicing Drill-Down pop92

| state | | NOR_EAS NOR_CEN SOUTH WEST Total | ------------------------------------------------------------------------------------LAR_CITY | 3.62% 8.59% 15.68% 13.28% 41.17% | MED_CITY | 3.35% 5.36% 5.18% 7.02% 20.91% | SMA_CITY | 2.58% 5.66% 4.85% 5.16% 18.25% | SUP_CITY | 8.30% 3.54% 2.54% 5.29% 19.67% | ------------------------------------------------------------------------------------Total | 17.84% 23.15% 28.25% 30.75% 100.00% |

| state |E_N_CEN E_SO_CE MID_ATL ... --------------------------------------------------------LAR_C | 5.46% 2.76% 2.09% ... MED_C | 3.84% 0.44% 1.38% ... SM_C | 4.12% 0.92% 1.49% ... SUP_C | 3.54% 0.00% 8.30% ... --------------------------------------------------------Total | 16.96% 4.12% 13.26% ...

Dicing | state | | MID_ATL NEW_ENG NOR_EAS | -----------------------------------------------------------------50000~60000 | 12.26% 13.69% 25.96% | 60000~70000 | 10.93% 7.13% 18.05% | 70000~80000 | 10.52% 14.83% 25.35% | 80000~90000 | 4.89% 9.56% 14.45% | 90000~99999 | 2.79% 13.40% 16.19% | -----------------------------------------------------------------MED_CITY | 41.39% 58.61% 100.00% |

pop92

pop92

Imran Khan IBA-FCS City Campus

pop92

Slicing | state |MID_ATL NEW_ENG NOR_EAS | --------------------------------------------------------LAR_C | 11.72% 8.56% 20.28% MED_C| 7.76% 10.99% 18.75% SM_C | 8.34% 6.11% 14.45% SUP_C | 46.52% 0.00% 46.52% --------------------------------------------------------Total | 74.34% 25.66% 100.00%

|

| | | | |

Data Warehouse Design Process „

„

„

Top-down, bottom-up approaches or a combination of both … Top-down: Starts with overall design and planning (mature) … Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view … Waterfall: structured and systematic analysis at each step before proceeding to the next … Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process … Choose a business process to model, e.g., orders, invoices, etc. … Choose the grain (atomic level of data) of the business process … Choose the dimensions that will apply to each fact table record … Choose the measure that will populate each fact table record

Imran Khan IBA-FCS City Campus

A practical approach (blend of top down & bottom up) The steps in this practical approach are as follows: 1. Plan and define requirements at the overall corporate level 2. Create a surrounding architecture for a complete warehouse 3. Conform and standardize the data content 4. Implement the data warehouse as a series of supermarts, one at a time

Imran Khan IBA-FCS City Campus

Imran Khan IBA-FCS City Campus