Efficiency Evaluation of Data Warehouse ... - Semantic Scholar

16 downloads 19537 Views 94KB Size Report
We present an efficiency model for data warehouse operations and analyze a data set .... the number of queries was difficult to estimate as business intelligence.
Efficiency Evaluation of Data Warehouse Operations Michael V. Mannino The Business School, University of Colorado at Denver and Health Sciences Center [email protected]

Sa Neung Hong College of Business, Seoul City University [email protected] Injun Choi Department of Industrial Engineering, Pohang Institute of Science and Technology [email protected]

Abstract We present an efficiency model for data warehouse operations and analyze a data set to evaluate the model. The model contains salient variables to evaluate the efficiency of an organization’s data warehouse operations for refresh processing and query production. The variables in the model include resource consumption (labor usage and computing budgets), system usage measures (users and queries), quality measures (data age and availability), and a size measure (change data amount). We report on the evaluation of a data set collected from USA and non USA-based (mostly Korean) organizations using input-oriented Data Envelopment Analysis. The analysis indicates wide dispersions in reported data, large differences in labor budgets between efficient and inefficient firms, few organizations efficient in both refresh processing and query production, and the difficulty of providing some data (computing budgets and flexibility ratio). Follow-up interviews with selected organizations provided insights about the value of efficiency comparisons of information technology organizations and suggestions to improve the model.

Keywords: Data warehouse, data envelopment analysis, data quality, operation efficiency

1. Introduction Data warehouse, a term coined by William Inmon in 1990, refers to a central data repository where data from operational databases and other sources are integrated, cleaned, and archived to support decision-making. A data warehouse provides management with convenient access to large volumes of internal and external data. Because of the potential benefits, most medium to large organizations operate data warehouses. Many of these organizations have operated data warehouses for five years or more with continuing development to increase the size and scope of the data warehouses. As data warehouse technology and deployment matures, efficient operation becomes a priority. Due to complex data requests, large volumes of data, incompatibilities among data sources, and other complicating factors, operating a data warehouse may involve high costs for complex hardware/software architectures and significant labor support. To measure and improve efficiency, organizations should strive to compare the efficiency of delivering information products as compared to peer organizations. In this paper, we present a model to evaluate the relative efficiency of organizations operating significant data warehouses. The model contains salient variables to evaluate the relative efficiency IT organizations providing refresh processing and query production, two major operations for data warehouses. The variables in the model include traditional resource consumption (labor usage and computing budgets), system usage measures (users and queries), data quality measures (timeliness and availability), and a size measure (change data amount). To assess the model, we analyze a data set using Data Envelopment Analysis (DEA), comparing efficiencies for significant subsets of the model. In addition to the formal analysis, we also provide anecdotal evidence about the difficulty of evaluating the efficiency of information technology operations at the sub firm level. As far as we are aware, this paper proposes the first efficiency models for data warehouse operations along with analysis of a significant data set. DEA has been used to study information technology investment efficiency [Banker et al., 1990; Lin and Shao, 2000; Shafer and Byrd, 2000; Wang et al., 1997] and software production efficiency [Paradi et al. 1997] but not information technology

operations. The results of this study have important implications for quantitative evaluation of IT service organizations, particularly those providing complex data products. The quantitative evaluation supports tradeoffs between quality levels and cost extending evaluations that evaluate quality levels only.

2. Measuring Efficiency of Data Warehouse Operations The emphasis in our model is to measure efficiency of an IT organization providing a significant information service, usually for internal usage. The observational unit is the IT organization, not the data warehouse. Although the ideal data warehouse architecture involves a single data warehouse, many IT organizations operate a small number of data warehouses rather than one large data warehouse. Thus, an important element of the efficiency model is to account for economies of scale involved in managing one or more data warehouses. In cases in which separate IT organizations manage data warehouses, each IT organization is a separate observation. The efficiency model involves two major processes of data warehouse operation. Refresh processing involves the periodic extraction of change data from data sources (operational databases or external data sources), significant data transformations (cleaning, integrating, and standardizing), and loading the transformed data into a data warehouse. The query production process involves extracting data from a data warehouse, data marts, data cubes, or other storage structures to support user queries. The query production process includes both the monitoring and setup of query executions as well as help desk activities to support interpretation of data and formulation of queries. Since the focus of the efficiency model is operations, new software development to extend data warehouse capabilities is excluded from the model. The efficiency model involves significant variables about the refresh and query production processes as shown in Table 1. All variables involve monthly periods except for data age and availability. The latter variables are easier to measure on a daily basis.

Table 1: Explanation of Efficiency Model Variables Variable Labor budget (input) Computing budget (input) Data age (output) Change data (output) Availability (output) Queries (output)

Flexibility ratio (output) Users (output)

Meaning Labor to support data warehouse operations

Units of Measure Monthly direct budget ($)

Sum of software, hardware, and communication budgets to support data warehouse operations Indicates the daily refresh interval for the data warehouse. Amount of change data as extracted from data sources before transformation Hours of service for user queries

Monthly budget ($)

Number of data requests either directly through ad hoc queries or indirectly through execution of planned reports. Indicates the relative number of ad hoc queries to scheduled queries Users who login to a data warehouse site at least once per month

Number of queries per month

Weighted daily refresh interval in hours GB per month Hours per day

Ratio of unplanned to planned queries per month Number of active users per month

Some of the variables in the model require elaboration beyond Table 1. Data age is a measure of data staleness reflecting the time lag between the time an event is recorded in an operational database and the time it is available for data warehouse users. If there are multiple refresh processes, a weighted data age should be computed reflecting the data age of each batch of change data. In the DEA model, the inverse of data age is used instead of the raw data age because DEA models maximize outputs. Availability is used rather than the time to complete refresh processing because longer refresh processing does not necessarily imply unavailability. With specialized hardware architectures, refresh processing can

occur while the warehouse (or at least part of the warehouse) is available for user queries. The flexibility ratio variable indicates the extent to which ad hoc queries are supported and used. Planned queries are easier to accommodate than ad hoc queries. The variables in the efficiency model were determined as a result of a previous field study of data warehouse refreshment policies, careful review of the literature, and a pilot study to assess the feasibility of data collection. The pilot study indicated the difficulty of collecting variables for more detailed models. Initially, we had three different models with a number of additional variables. As a result of difficulties in data collection, we combined the different models into a single model and reduced the number of variables. Still some variables remained difficult to estimate. The most difficult variable to estimate was the computing budget because organizations had shared computing centers and isolating data warehouse costs was difficult. In some cases, the number of queries was difficult to estimate as business intelligence tools may generate additional queries beyond the queries submitted directly by users. In addition to studying the overall efficiency of data warehouse processing, we investigate efficiency of three alternative models. Two alternative models involve refresh processing and query production separately. The input variables in both models remain the labor and computing budgets, but the output variables are changed to reflect the different orientation of refresh processing (data age, change data, and availability) and query production (queries, flexibility ratio, and users). Separating the two models supports identification of inefficiencies and improved understanding of data warehouse processing among the organizations in the study. The fourth model involves the most reliable input (labor budget) and the most essential outputs of refresh processing (change data) and query production (number of queries). The small sample size (42) may provide more reliable results with the reduced efficiency model. Efficiency was assessed using Data Envelopment Analysis (DEA) [Cooper et al. 1978, Banker et al. 1984], a linear programming technique for measuring the relative efficiency of decision making units (DMU) with multiple inputs and outputs. Two major aspects of DEA usage are orientation (input or output) and returns to scale (constant or variable). We chose input-oriented DEA because the outputs of operating a data warehouse are often fixed by organizational constraints. Returns to scale refers to increasing or decreasing efficiency based on size. We chose variable returns to scale (VRS) because it better suites the heterogeneous organizations in our study with different sizes, industries, data warehouse characteristics.

3. Analysis of Operational Efficiency Data was solicited from three groups in the second half of 2005. The Center for Information Technology Innovation (CITI) at the University of Colorado at Denver is group of Chief Information Officers from organizations with significant operations in the Denver, Colorado area. Surveys were sent to this group in June 2005 with a response rate of 10 out of 20 organizations. The BI Network (www.BEYE-Network.com) posted a survey announcement on its semi-monthly newsletter several times in the fourth quarter of 2005. Twelve responses were received as a result of the newsletter announcement although 15 other responses were missing information and not usable. In Korea, responses were solicited from organizations providing banking, insurance, and financial services. Twenty one responses were received. To summarize the characteristics, we divide the organizations by region and industry as shown in Table 2. The Korean group is more homogenous than the other two groups with all organizations providing financial services, banking, or insurance. In the other two groups, there is no dominant industry. The USA-Europe group has somewhat larger organizations managing older and more data warehouses per organization as compared to the other two groups.

Table 2: Summary of DMU Characteristics Region-Industry USA-Europe Aerospace Banking Comp. Services Energy Fin. Services Government Health Care Manufacturing Retail Subtotal Korea Banking Fin. Services Insurance Subtotal Other Banking Consulting Pack. Goods Telecom Subtotal Total

Characteristic Revenue DW Age

Count

Employees

DWs

Tables

1 2 1 1 4 3 1 2 2 17

10,000 5,250 300 300 7,575 5,933 3,000 8,750 10,000 6,453

$10,000 $562 $175 $15,000 $3,000 $558 $375 $9,000 $3,000 $3,785

10 4 7 21

4,780 4,125 3,579 4,255

1 1 1 1 4 42

3,000 100 3,000 3,000 2,275 4,956

Size

5.0 4.3 1.5 1.5 5.0 2.7 5.0 4.3 5.0 4.0

1.0 2.0 6.0 5.0 4.3 8.3 1.0 8.0 3.0 4.8

800 2,525 600 100 8,444 180 600 2,525 140 2,753

1,000 800 40,000 500 2,931 1,593 75 5,555 1,550 4,311

$1,129 $1,694 $1650 $1,410

3.8 4.1 3.5 3.8

1.5 1.0 1.3 1.3

4,314 3,024 1,320 3,070

4,076 6,250 7,588 5,661

$750 $25 $3,000 $7,500 $2,819 $2.506

5.0 5.0 5.0 1.5 4.1 3.9

1.0 3.0 1.0 1.0 1.5 2.7

400 500 350 150 350 2,683

800 5,000 600 24,000 7,600 5,299

At an aggregate level, the efficiency scores for four models show a sharp distinction between efficient and inefficient DMUs as seen in Table 3. The full model has the most efficient DMUs as expected because of the size of the model. The reduced model contains few efficient DMUs and a large number of highly inefficient DMUs. In all models, few DMUs are between the efficient and very inefficient rankings in Table 3. The Korea-Other subset dominates the USA-Europe subset across the four models with a higher proportion of efficient DMUs across all four models.

Table 3: Efficiency Rankings by Model and Region 1 Ranking Efficient (= 1) Near Efficient (> 0.90) Somewhat inefficient (0.75 to 0.90) Inefficient (0.50 to 0.75) Very inefficient (< .50)

Full UE:6, KO:12 UE:0, KO:2 UE:0, KO:1 UE:2, KO:3 UE:8, KO:8

Refresh UE:3,KO:11 UE:0, KO:2 UE:0,KO: 1 UE: 1, KO:1 UE:12,KO:11

Query UE:5, KO:8 UE:0, KO:1 UE:0, KO:1 UE:2, KO:3 UE:9,KO:13

Reduced UE:1, KO:3 UE:0, KO:0 UE:0, KO:1 UE:2, KO:4 UE:13,KO:18

Statistics on model variables provide another perspective about differences between efficient and inefficient DMUs as shown in Table 4. In the full and query production models, efficient DMUs have larger average inputs but much larger output values than inefficient DMUs. For example in the full model, the average labor budget and computing budget of efficient DMUs is much larger than inefficient DMUs, but the number of users and number of queries of efficient DMUs is much larger than inefficient DMUs.

1

Each cell contains the count USA-Europe DMUs (UE) and the Korea-Other DMUs (KO).

For the refresh and reduced models the pattern is reversed in the input values. In the output variables, the efficient DMUs dominate with much larger amounts of change data.

Table 4: Statistics by Model and Efficiency Status 2 Model-Stat Full-Eff Average Std. Dev. Full-Ineff Average Std. Dev. Refresh-Eff Average Std. Dev. Refresh-Ineff Average Std. Dev. Query-Eff Average Std. Dev. Query-Ineff Average Std. Dev. Reduced-Eff Average Std. Dev. Reduced-Ineff Average Std. Dev.

Variable Av (O) CD (O) 20.9 3,055 3.1 10,828

LB (I) $229,409 901,502

CB (I) $403,531 1,505,871

DA (O) 53.1 91.1

NQ (O) 476,324 1,092,538

FR (O) 167.0 445.6

NU (O) 1,343.3 1,935.6

$110,747 161,606

$92,562 203,056

74.2 160.6

18.1 5.2

194 238

120,922 229,934

17.3 45.8

408.9 597.6

$19,480 14,752

$54,695 84,908

60.6 100.2

21.6 2.8

3,759 12,003

$258,190 794,412

$355,072 1,333,937

66.3 148.9

18.1 4.9

5 201

$300,530 1,031,192

$523,075 1,722,217

607,292 1,230,671

221.7 502.8

1,715.7 2,105.0

$93,210 150,479

$83,736 184,731

113,977 218,937

14.6 41.7

375.0 549.1

$25,416 31,986

10,912 19,601

1,436,194 1,784,396

$186,420 674,976

292.5 624.0

135,292 284,009

To obtain insights about the utility of the study, we provided feedback to each participating organization and invited qualitative feedback. Based on responses from eight organizations in the study, we have the following insights: • No respondents measure efficiency for internal data warehouse operations or other IT operations. All of the respondents use some measures that are monitored over time for trends. • All respondents expressed interest in having peer comparisons about operational efficiency. The respondents indicated that efficiency comparisons would be most useful if linked to data already provided to IT consulting firms. Efficiency comparison trends would be more useful than single period scores. • The respondents indicated that peer selection should be based on more than industry and firm characteristics. Peer selection should include the architecture and tool sets used in data warehouse operations. • Respondents had different interest in the efficiency models. Some organizations were more interested in the query production side as it is more aligned with user satisfaction. Others were more focused on refresh processing to ensure reasonable operations in comparison to peers.

2

The Eff label includes DMUs with an efficiency score greater than or equal to 0.90. The Ineff label includes all

DMUs with a DEA less than 0.90.





Low efficiency scores were received positively when the reasons for the scores were explained. For example, one organization had a low score because of the flexibility ratio. The organization was pleased because their goal was to standardize query production by eliminating ad hoc queries. The organization was pleased that their flexibility ratio was much lower than their peers. Two organizations reported that they do not track some important kinds of ad hoc queries. The study results reminded the organizations about the need to track ad hoc queries more carefully.

5. Conclusion We presented an efficiency model for evaluating data warehouse operations and evaluated the relative efficiencies of a preliminary set of data warehouses. The efficiency model supports evaluation of IT organizations with significant data warehouse operations for refresh and query production processes. The variables in the models include traditional resource consumption (labor usage and computing budgets), system usage measures (number of queries, number of users, and flexibility ratio), data quality measures (data age and availability), and a size measure (change data amounts). We analyzed the refresh efficiency model for an international data set using input-oriented DEA with variable returns to scale. This work is our continuing effort to quantitatively evaluate organizational performance of data warehouse operations. Future work involves linking IT organizational efficiency to balanced scorecard reporting and working with a commercial firm to collect a larger and more standardized set of data.

References Banker, R., Charnes, A., and Cooper, W. “Some models for estimating technical and scale inefficiencies in DEA,” Management Science, 30 ( 9), 1984, 1078-1092. Banker, R., Kauffman, R., and Morey, R. “Measuring Gains in Operational Efficiency from Information Technology: A Study of Positran Deployment at Hardee’s Inc.,” JMIS 7 (2), 1990, 29-54. Charnes, W and Rhodes, E. “Measuring the efficiency of decision making units,” European Journal of Operation. Research 2, 1978, 429-444. Lin, W. and Shao, B. “Relative Sizes of Information Technology Investments and Productive Efficiency: Their Linkage and Empirical Evidence,” Journal of the AIS 1, 2000. Paradi, J.C., Reese, D.N., and Rosen, D. “Applications of DEA to measure efficiency of software production at two large Canadian banks,” Annals of Operations Research 73, 1997, 91-115. Shafer, S. and Byrd, T. “A Framework for Measuring the Efficiency of Organizational Investments in Information Technology using Data Envelopment Analysis,” Omega 28, 2000, 125-141. Wang, C., Gopal, R., and Zionts, S. “Use of Data Envelopment Analysis in Assessing Information Technology Impact on Firm Performance,” Annals of Operation Research 73, 1997, 191-213.