Expert Systems with Applications 38 (2011) 640–651
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Establishing expert system for prediction based on the project-oriented data warehouse Jui-Sheng Chou a,*, Hsien-Cheng Tseng b a b
Department of Construction Engineering, National Taiwan University of Science and Technology, 43 Sec. 4, Keelung Rd., Taipei 106, Taiwan Yulon Nissan Motor Co., Ltd, Taiwan
a r t i c l e Keywords: Data warehouse Multivariate analysis Prediction Expert system Project management
i n f o
a b s t r a c t Many studies have generated cost estimating relationships (CERs) for transportation projects via data analysis. Some studies collected data from databases, while others sourced data from conventional paper-based formats. When cost data were not in a consistent format, many studies failed to discuss the streamlining of pattern recognition, ranging from generating a problem statement, data warehouse and prediction modeling to information management. This study adopts a standard procedure of identifying CERs for transportation projects. For the proposed dimensional data warehouse, a pavement maintenance and rehabilitation project was selected as a case study for extracting data and concealed prediction rules. Linear and log-linear statistical approaches were adopted to create most advantageous models, defined based on their explanatory power and mean absolute prediction error. The resulting favorable estimation models created from the proposed cost data warehouse were integrated into an expert system to facilitate information management and generate preliminary budgets for transportation agencies. Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction Construction of public infrastructure is critical in helping nations increase their global competitiveness and regional economic development. When the density of transportation network reach its peak level of saturation, few new roadways are constructed and in-service infrastructure must be maintained and rehabilitated to keep roadways safe and operating at a desirable level of service. The Michigan State Transportation Commission neglected road and infrastructure maintenance and repair for many years and, consequently, road repair costs exploded, costing the commission US $3 billion annually for the period 2004–2007 (ACP, 2008). Regular maintenance of roadways and bridges ensures the safety and efficiency of transportation infrastructure. Maintenance operations account for a major portion of activities performed by highway agencies. As Taiwan’s roadway system ages, the Taiwan Ministry of Transportation and Communications (MOTC) must be proactive in generating the financial budgets necessary for maintenance operations. This study presents a novel cost data warehouse system for maintenance projects and establishes cost predictive models for an automated cost-estimating system for transportation agencies.
* Corresponding author. Tel.: +886 2 2737 6321; fax: +886 2 2737 6606. E-mail addresses:
[email protected],
[email protected] (J.-S. Chou). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.07.015
Maintenance operations have become increasingly complex due to rapid traffic growth, funding limitations, and a shortage of skilled workers such as experienced estimators. Government agencies have difficulty hiring and retaining qualified employees. To address these challenges, an urgent need exists for agencies to document cost-estimation practices, examine maintenance policies, and develop innovative budgeting schemes that improve preliminary cost estimates and control project costs. More accurately forecasting project costs is of priority concern for the MOTC as it faces pressure from cost-conscious legislators, tight budgets, new financing arrangements, and increased workload. With growing emphasis placed on taxpayer awareness of government image, the MOTC requires reliable cost-estimation systems that provide quality and safety assurances to local residents and communities. This research proposes an effective framework that establishes the processes required for daily operations in the data warehouse, and transforms the wealth of data and experience of transportation agencies into effective models. 2. Research background and objectives Statistical modeling of cost prediction has been developed and applied particularly to transportation construction projects over the last few decades (Bell & Bozai, 1987; Chengalur-Smith, Ballou, & Pazer, 1997; Chou et al., 2006; Federle & Pigneri, 1993; Harbuck, 2002; Herbsman, 1983; Saito, Sinha, & Anderson, 1991; Sanders,
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
Maxwell, & Glagola, 1992; Sthapit & Mori, 1994; Williams, 2005; Wilmot & Cheng, 2003). Empirical results derived from implementing this approach indicate that the multiple regression technique has great potential in forecasting construction costs based on historical data, but is imperfect due to insufficient knowledge of the statistical modeling process and incomplete data sources available to practitioners. This study presents a clear data mining approach evolved from a collection of paper-based raw data and constructs a project data warehouse for successive project cost prediction to improve information management. According to MOTC statistics for 2006, routine highway maintenance and subgrade treatment have been applied to over 350 million square meters annually (total highway length, 20,000 km). As roadway density in Taiwan is higher than that in other countries, maintenance operations have become extremely important to the MOTC. Routine maintenance of highways and bridges encompasses resurfacing, widening culverts and guard fences, stabilizing slopes, and improving landscaping within the existing right of way. These tasks are carried out to provide comfortable and safe driving conditions for road users. Various budget problems have plagued successful projects. Ways to get around an infeasible estimate include redesigning or re-scoping a project, requesting additional funding, allowing contractors to bid for alternate projects, or rescheduling a project. Poor and inaccurate estimates typically hinder highway maintenance operations. Such estimates can be due to a lack of information, inexperienced personnel, arbitrary changes made by managers in the estimation process, project delays, and design errors. One frequently unrecognized consequence of underestimating a budget in the preliminary project phase is that the benefit to cost ratio of the project is overstated, which can result in poor project feasibility as other options are overlooked. Additionally, inaccurate estimates expose managers to further scrutiny from both the public and politicians, potentially affecting investments needed to solve transportation problems. To ensure that the estimation process for highway maintenance operations flows smoothly without compromising accuracy, accessibility of project information to generate reliable cost estimates is an important managerial issue. Hence, this study develops prediction models for the early stages of a project with readily available parameters to facilitate estimation procedures with a reference estimate. Due to the nature and uncertainty in construction projects, available information during the early stages is typically scarce. Estimators frequently refer to similar projects, apply guesswork and rely on personal experience when generating preliminary cost estimates. These practices often cause significant differences between a proposed budget and actual bid price and, therefore, result in delays. Furthermore, the missed estimate may impact the effectiveness of resource allocations and reduce the quality of the infrastructure built. Additionally, based on site visits and interviews, this study determined that most project data was stored in physical archives in paper formats. Therefore, this study presents a novel customized dimensional data warehouse management system that accommodates paper-based cost records and provides an implicit knowledge base for subsequent data mining to extract non-trivial patterns and engineering experiences. This study also presents a modeling technique for comparing linear and log-linear predictive models with a set of explanatory and response variables. To establish a reliable method as the basis of project estimates and provide a paper trail of documentation for feedback and continuous improvement of the developed database management system and models, the research tasks are as follows: (1) elucidate development flow of a customized data warehouse for storing paper-based documents in efficient and reusable ways; (2) improve estimate accuracy in the early stages of a project by generalizing
641
predictive models from the data mart in the data warehouse using parametric modeling techniques; and (3) implement an estimation system by integrating graphical user interfaces and embedded models to assist transportation agencies in generating budgets early. The methodologies presented here can be bundled as a systematic approach for similar projects. The rest of this paper is organized as follows. Section 3 presents a literature review of techniques utilized in project cost estimation in the construction industry. Section 4 then discusses the research methodology employed. Section 5 offers the architecture of the dimensional data warehouse. Next, Section 6 describes the parametric modeling process and exhibits analytical and validation results. Section 7 implements the cost-estimation system based on the developed models. Conclusions are finally drawn in Section 8, along with recommendations for future research.
3. Literature review Accurately and timely estimating project costs is vital to successful project delivery. A wide range of accuracy patterns for estimations exists in the beginning stages of a project. The differences should be gradually narrowed down as a project progresses. Once the initial gap between estimated and actual costs can be reduced during the initiation stage, a project should proceed smoothly as additional input information is acquired. Many cost estimation methods have been developed to increase the accuracy of project budgeting. Existing cost estimation methods can be categorized as analogous cost estimating, bottom-up estimating, computerized tools and artificial intelligence, and parametric modeling (An, Kim, & Kang, 2007; Anderson, Molenaar, & Schexnayder, 2007; Ben-Arieh & Qian, 2003; Emsley et al., 2002; Hegazy & Ayed, 1998; Kwak & Watson, 2005; Lowe, Emsley, & Harding, 2006; Niazi et al., 2006; PMI, 2004). The tools and techniques in these methods can be used for appropriate project phases (Table 1). Analogous cost estimating (ACE) (AACE, 1997), also referred to as top-down estimating, uses the values of parameters such as project length, size, duration, cost, and construction complexity from other similar projects as the basis for estimating the cost of a new project. This approach is typically utilized to generate estimates early in a project’s lifecycle when little information is available. However, this method requires sufficient knowledge and expert judgment (PMI, 2004). ACE is relatively less costly but more time consuming and less accurate. For example, even when activities in two projects are similar, different backgrounds and environments including technology used and materials or equipment can easily result in estimates that differ considerably. Bottom-up estimates facilitate evaluation of project cost by decomposing work into manageable tasks, operations, or activities (PMI, 2004). Conversely, generic estimation techniques, such as the activity-based costing (ABC), calculate costs incurred for activities associated with manufacturing a product (Niazi & et al., 2006). Ben-Arieh and Qian (2003) presented a methodology using ABC to evaluate design and development costs for machined parts produced in a controlled manufacturing facility. Computerized tools and artificial intelligence have an innovative role in cost estimating processes; however, considerable effort is required to attain a certain level of effectiveness by validating computing outcomes. Artificial neural networks (ANNs) are a feasible alternative for early cost prediction based on their ability to model complex systems given a minimal amount of data (Al-Tabtabai, Alex, & Tantash, 1999; Emsley et al., 2002; Hegazy & Ayed, 1998). Another common form of artificial intelligence is case-based reasoning (CBR), which is based on the characterization of projects using a number of essential attributes. These attributes are then utilized with weighted values to match similar cases (Mendes,
642
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
Table 1 Appropriate cost estimates at various project stages. Primary estimating techniques
Project stage Preliminary stage
Planning
Preliminary design
Detailed design
Contracting and construction
1. Analogous cost estimating Expert judgment 2. Bottom-up estimates Activity-based costing (ABC) 3. Computer-aided tools and artificial intelligence Case-based reasoning (CBR) Artificial neural networks (ANNs) Simulation for cost estimation Customized cost-estimating system 4. Parametric modeling Parametric estimating Feature-based costing (FBC)
Mosley, & Counsell, 2002). Notably, CBR requires a series of systematic procedures for extracting relevant knowledge from experience, integrating a case into an established knowledge structure, indexing the case for subsequent matching, adapting a new case to similar cases, and saving this information in a knowledge base (Toussaint & Cheng, 2006). Numerous studies have identified CBR as a particularly appropriate application in the conceptual stage of domain knowledge (Chantaraskul & Cuthbert, 2005; Gang, 2005; Thomasson, Ratcliffe, & Thomas, 2006). Some researchers determined that cost estimates should be represented as a range rather than as a single value. In this sense, Monte Carlo simulation can be employed to design a probabilistic cost estimation model under various project conditions (Diekmann, 1983; Touran & Suphot, 1997). A range of cost estimates can then be developed for various scenarios; however, this requires a greater amount of statistical data and complex mathematical algorithms compared with other methods. The parametric method estimates construction cost based on such parameters as project square footage or, say, number of beds in a hospital. This technique uses validated relationships between known technical and cost characteristics of a project obtained from historical data. Many studies have investigated the non-linear relationships between cost and project characteristics and have attained high levels of accuracy; however, the estimation accuracy depends on the quality of underlying data and sophisticated statistical techniques employed to construct the model (Harbuck et al., 2002; Kim, An, & Kang, 2004; Lowe et al., 2006; Phaobunjong & Popescu, 2003; Trost & Oberlender, 2003; Yu, 2006). A national cooperative research report indicated that most US departments of transportation (DOTs) used two parameters, i.e., number of lanes and project length, to estimate highway project cost during early project stages (Anderson et al., 2007; Chou et al., 2006; Schexnayder, Weber, & Fiori, 2003). This approach is known as the lane-mile approach. Chou et al. (2006) proposed a quantity-based approach to estimate highway project costs. The advantages of this approach include segregation of unit price from an estimation in the initial stage, which reduces uncertainties due to market conditions and time inflation, and early introduction of semi-detailed quantity estimates to continuously track quantity changes as the project proceeds to subsequent phases.
4. Research methodology Transportation agencies typically generate large amounts of detailed cost data for a project lifecycle, and this data is generally distributed across functional systems or saved in conventional paperbased formats. With timely access to and reuse of data for previous
projects, one can create useful rules and apply these rules to future projects and decision-making processes (Mallach, 2000; Turban & Aronson, 2005). The knowledge discovered from historical data is considered paramount intangible business intelligence (BI) for an enterprise. The Data Warehouse Institute defines BI as processes, tools, and technologies required to transform data into information, and then into knowledge and effective business plans (Rujirayanyonga & Shi, 2006). A data warehouse is defined as a subjectoriented, integrated, non-volatile, and time variant collection of relevant data that ensures that historical data is consistent and easy to retrieve. A data warehouse also facilitates convenient access to and reuse of data to support of management decisions (Inmon, 2002; Rujirayanyonga & Shi, 2006). The proposed research roadmap was divided into the following three phases (Fig. 1): establishment of a dimensional data warehouse (steps 1–3); statistical prediction modeling (steps 4–8); and development of the prototype expert system (steps 9–12). First, to construct a dimensional data warehouse that accommodates project data for long-term use, sample data was gathered from the five maintenance offices of the MOTC, which are geographically distributed in Taiwan. The exemplary data warehouse for transportation construction projects contains such information as project type and description, project location, project length, number of lanes, traffic volume, terrain type, quantity calculation, unit price, estimated cost, actual cost, and completion time. The data warehouse has flexibility to scale-up depending on the project information that must be stored. In the second stage, one selects one of the data marts and generates query tables of interest. The most favorable statistical estimation models can thus be created by exploring non-volatile and cost- and subject-oriented data marts. Specifically, a data mart is a subset of a data warehouse covering a particular subject or department data for a specific purpose or personal query, visual presentation, and data mining. Fig. 2 presents the framework for the data-mining process, which includes external data sources, data preprocessing and storage, and visual presentation. For the final development stage of the proposed system, this study adopts pavement maintenance and rehabilitation projects as case studies to demonstrate the application of graphical user interfaces with embedded forecasting models. A parametric estimating technique was used to establish the relationships between project parameters (i.e., characteristics, functions and features) and engineering quantity of work items. Item cost can then be derived by multiplying predicted quantity with the corresponding item unit price, which is readily available from the Taiwan Public Construction Commission (TPCC) database. All of these functions are integrated into the proposed expert system.
643
Dimensional data warehouse
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
Step 3 Data mart as project type Step 2
Step 1 Start
Pavement
Clean Extract Transform Load
Extract
Data collection
Drainage
Dimensional model data warehouse
Taiwan Ministry of Transportation and Communications
Sideslope
Statistical Prediction Models
etc
Step 6
Step 5
Statistical assumptions revisited
Development of multivariate regression methods for the prediction models
Step 4
Step 7
Step 8 Project cost & quantity validation
O.K.
O.K.
Expert System with Application
Identify standard work item
Out of acceptable accuracy range
N.G. Examination of goodness of fit & statistical testing
Select one project type
Look up item price Provide cumulative cost percentage of standard work items
Step 10 Development of user interface Step 9
Step 11
Estimating System analysis and design
Computer languages & codings
Information retrival interface
System maintenance and models update
Step 12
Database for related work item unit prices
Automation to cost estimating & database modeling Fig. 1. Data warehouse, modeling, and system development flow chart.
During project initiation, estimating the costs of all work items is unnecessary as changes can occur during a project’s lifecycle. Therefore, based on observations of descriptive statistics of the extracted data, high-frequency and high-cost work items in previous projects are first identified and serve as the primary object in model development. Total project cost (TPC) can be derived via the summation of direct costs and indirect costs, as in Eq. (1). Indirect costs, such as business taxes, contractor and management overheads, are typically a fixed percentage of direct costs in public transportation projects.
TPC j ¼ DC j þ IC j þ Contingencyj ¼
n X ItemQtyi UnitPricei þ IC j þ Contingencyj CCPsj i¼1
ð1Þ
where TPCj is the Total project cost of the jth project, DCj is Direct costs of the jth project, ICj is Indirect costs of the jth project, Contingencyj is Contingency of the jth project, represented as a percentage,
CCPsj is Cumulative cost percentage for standard work items of the jth project; (cumulative standard work item costs of the jth project)/(direct costs of the jth project), ItemQtyi is ith standard work item quantity of the jth project, UnitPricei is the ith standard work item unit price of the jth project 5. Architecture of the dimensional model-based data warehouse Project documentation is generally stored in either electronic or paper-based files; however, these files are often stored in an unorganized and fragmented fashion (Weiser & Morrison, 1998). Based on interviews with experienced engineers working in Taiwan’s transportation agencies, such storage practices are common in the construction industry. Consequently, attempts to reuse stored information and generate beneficial rules are hindered. To extract valuable information from documentation efficiently, establishing an effective e-storage database and expediting information flow
644
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
External source
Data analysis
Data storage
Query and visualization
Data in paper-based format
Pavement Clean Extract Transform Load
Drainage Prediction Dimensional data warehou se Side slope ……
Operational database or spreadsheet
etc Data mart Examination
Fig. 2. Framework of the data-mining process.
are important tasks. The database allows project estimators and managers to transform unprocessed data, which incorporates implicit intelligence, into explicit knowledge. A dimensional modelbased data warehouse is proposed that performs such functions and accommodates mixed cost data related to transportation projects (Fig. 3). The dimensional data model uses two tables, namely, fact tables and dimension tables. Fact tables store primary project-level and item-level data such as project duration, roadway length, roadway width, daily traffic volume, engineering quantity of work item, and work item quantity at completion. Dimensional tables store
descriptions of work items, terrain, roadway direction and roadway type, construction division in charge, historical item prices, and project type. All fact and dimensional tables are scalable and can be expanded to accommodate additional data fields of interest. The primary functions of the dimensional data model are to increase query efficiency and avoid data redundancy as the database expands. Table 2 shows the tables of the dimensional data warehouse. Data fields in the proposed data warehouse contain general project information. Most of the conceptual information can be acquired from an illustration of a typical roadway section (Fig. 4).
Fig. 3. Fact constellation schema for transportation project data.
645
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651 Table 2 Data tables of the dimensional data warehouse. Type
Dimension
Description
Objectives
Table name
Fact table
1. Project 2. Item
Parameters in the project Numerical data of the work item, including estimation and final accounts of work item quantities
Data storage Data storage
FactTbl_Project FactTbl_Item
Dimension table
1. Work Item
Description of work items and their units
DimTbl_ItemDescrip
2. Price
Price listing of work items
3. Project type
Including retaining walls, pavement, bridge, tunnel, drainage, traffic control devices, landscape planting maintenance machine and others Work items listing with a specific project type
Reduce redundancy Query and analysis Query and analysis Query and analysis Query and analysis Reduce redundancy Query and analysis Query and analysis
4. Project typework item 5. Roadway function 6. Division
Including national expressways, provincial highways, county highways, rural highways and exclusive highways Contact information and location of division
7. Terrain
Including terrain plain, hill and mountainous
8. Roadway direction
The description of roadway direction
Data from the visual graph can be categorized as basic design parameters (i.e., lane number, roadway width, and work area), environmental factors (i.e., terrain type, level of precipitation, and number of rainy days), and traffic volumes (i.e., traffic volume of various vehicle types). All data are easily accessible by estimators or engineers during the early project stages. Based on the dimensional data warehouse schema, the relationships among data sets can be drilled down or rolled up within a data cube. Dimensions of a work item, project type and roadway type, for instance, can be obtained from a populated database into a data mart (Fig. 5). A data mart with increased dimensions can be constructed when necessary. In the following section, a particular project type
DimTbl_ItemPrice DimTbl_ProjTypeDescrip DimTbl_ProjItem DimTbl_SysTypeDescrip DimTbl_SectionDescrip DimTbl_TerrainDescrip DimTbl_DirectionDescrip
is extracted into a sample data mart for subsequent statistical parametric modeling. 6. Data analysis and model evaluation This section describes a case study using a pavement maintenance project data mart extracted from a data warehouse for model development. Following data preprocessing, linear and log-linear regression models were constructed after carefully considering various measured response variables to establish an enhanced approach. Meaningful and interesting patterns were then verified through a hold-out sample.
Project design Lane number
Environmental factors
PI_FTLnNo (lane)
PI_TerrainPlain (nominal) PI_LaneNo (lane) PI_TerrainHill (nominal) PI_TerrainMountain (nominal) PI_Precip (mm)
PI_ProjLength (meter)
PI_RainDay (day)
Project width PI_STLnWidth (meter) PI_STLnWidth (meter) PI_ShldWidth (meter)
Traffic volumes PI_PCUperday (vehicle) PI_PCUtop (vehicle) PI_DirectIndex (index) PI_CarNo (vehicle) PI_BusNo (vehicle)
Working area
PI_TruckNo (vehicle)
PI_SquareM
PI_JointTruckNo (vehicle) PI_MotorNo (vehicle)
PI_LaneMeter
PI_ProjWidth (meter) Fig. 4. Accessible information for a typical pavement maintenance project.
646
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
Fig. 5. Project dimensional data cube.
6.1. Data description The raw data were collected from the 1st to 5th engineering districts of Taiwan’s Directorate General of Highways. In total, 937 rehabilitation projects were performed on existing roadways, including repair of road bases, slopes, and drainage, and pavement maintenance, landscaping and planting. Although obtaining a complete collection of data for all past projects is unlikely, the primary goal is to ensure that the data warehouse fits project data in practice, and can serve as a centralized data management system for the transportation agency. This study utilized the data mart for pavement maintenance operations (PMOs) which account for the largest number of projects and outnumber other project types by two or three times, underscoring the importance and priority of PMOs in recent years.
6.2. Parametric formulation The purpose of parametric modeling is to derive an empirical formula for estimating engineering quantity for major work items (response variables) with a set of significant parameters (predictor variables). A preliminary total project cost can be derived using Eq. (1) by multiplying item quantity by available item unit prices from the TPCC database. Based on literature findings, interviews with experienced estimators and availability of significant factors, 19 parameters from each project were obtained as possible input predictors for regression analysis. From descriptive statistics of original data, all data had skewed patterns. Therefore, the performance of the log-linear model was compared with that of the linear model. Furthermore, to assess the suitability of response variables, raw engineering quantity (REQ), engineering quantity per lane meter (EQLM), engineering quantity per project length in meters (EQPL), and engineering quantity per construction area in square meters (EQSM) were used to develop the linear and log-linear models. As this study focuses on the early project phase, estimating the cost of all work items to determine total project cost (TPC) is unnecessary for the following reasons: (1) an overview of project estimation is the key focus of this study; (2) emphasis should be placed on major work items (e.g., items accounting for over 1% of TPC, thereby increasing management effectiveness and reducing estimating burden at a conceptual stage); and (3) detailed project and work information at this stage may be unavailable or difficult to obtain. The following four major work items were identified out of 22 for a generic PMO project based on high cost percentage and high
frequency rate (c%, f%): pavement of reclaimed densely graded asphalt concrete (65.7%, 92%); salvaging, hauling, and stockpiling reclaimable asphalt pavement (13.3%, 83%); reflective pavement markings (5.4%, 91%); and sprinkling of the tack coat with emulsified asphalt (4.9%, 98%). Statistics show that the average cost percentage and frequency rate for a generic project for the remaining 18 minor work items were less than 1% and 35%, respectively. The four major work items account for roughly 90% of the cumulative cost percentage (CCP) for a typical project, and have an average frequency rate of 91%. Hence, the major work items are considered standard items and are estimated first. Notably, the CCP is not a fixed percentage for all projects. Therefore, we assume that CCP varies with the input parameter set. Moreover, unlike other general construction projects, the cost of a PMO project is frequently discounted, whereby the profit from reclaimed asphalt concrete is deducted from the project cost by the contractor. This deductible item was also modeled in linear and log-linear forms alongside the four major work items and CCP. As the functional relationship between response and predictor variables is imposed, the general form of the parametric formulation can be expressed as follows:
Y p ¼ b0 þ
N X
bi xi þ ep
ð2Þ
i¼1
where Yp is the response variable for the engineering quantity of pwork in the measurement of REQ, EQLM, EQPL, and EQSM, bi are the coefficients in relation to the problem, N is the number of covariate variables (predictors), and b0 is the intercept. The generalized linear model reduces to a familiar multiple linear regression, whereby random component ep has a normal distribution with a mean of zero and variance N(0, r2). With the inclusion of explanatory parameters and associated known values of the response variable in the previous PMO data mart, the multivariate regression model can determine coefficients bi that produce the best fit for historical data with the largest variance explanation. 6.3. Model performance and validation Nineteen data fields were extracted from the project data warehouse; most were identified as essential factors when estimating engineering costs in numerous studies (Akinci & Fischer, 1998; Akintoye, 1998; Al-Tabtabai et al., 1999; Baloi & Price, 2003; Bell et al., 1987; Chou et al., 2006; Sanders et al., 1992). The selected parameters can be categorized as geometrical configurations of a roadway (roadway length, roadway width, roadway lane number,
647
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651 Table 3 Significant parameters of the developed models. Category
Basic project configuration
Parameter description
Roadway geometry
Working area Traffic volume
Range
Mean
Median
Std. dev.
Statistical significance (p 6 0.05) Linear model
Log-linear model
No of lanes (LN) No of construction segments (SEG) Roadway length (RL, m) Roadway width (RW, m) Fast track lane width (FTLW, m/line) Slow track lane width (STLW, m/line) Shoulder width (SW, m/line) Construction area (CA, m2) Lane meter (LM, m)
4–10 1–6 100–80,000 6–40 2.6–4.3 0–3.6 0–5.1 1920–1,688,000 600–555,750
5.73 1.6 7138.4 16.85 3.6 0.6 1.6 117,792 423,821
6.00 1.0 2736 18 3.6 0.0 1.4 48,535 15,600
1.47 0.9 12,451 6.85 0.2 1.1 1.1 228,514 79,893
s s s s s – – s s
– – – U U U U U U
Direction index (DI, East/West or North/South of roadway direction) No of trucks (TN, veh/day) No of joint-trucks (JTN, veh/day)
East/West (1), North/South (0) 9–1224 0–927
–
–
–
s
–
264 163
231 91.5
229 175
– s
U U
factors. Previous study suggested that prices of estimated engineering quantities using their mean unit costs (MUCs) should be collectively published on the Web and regulated by the transportation agency (Chou et al., 2006). The quantity estimation problem was modeled using multivariate linear and log-linear regressions. Tables 4 and 5 present the resulting best models. For linear models, the best standardized response variable for L1 and L3 was REQ, whereas that for L2, L4, and L5 was EQSM. Statistical violation of multicollinearity occurred in linear modeling, inferred by curvilinear trends in data. However, the problem was alleviated after applying a logarithmic transformation to response and explanatory variables. Analytical results indicate that the log-linear regression model had comparable predictive power and better calibration than the linear regression model. The non-standardized REQ of the log-linear models produces the best fit for data. Model results imply
and construction area in lane meters and square meters), traffic volume (passenger car unit (PCU) and average daily traffic, including motorcycles, cars, and heavy trucks) and environmental factors (terrain, precipitation and average rainfall days near the construction area). Table 3 shows the 11 remaining important factors that have significant levels of p < 0.05 and their descriptive statistics in the developed models. Additionally, a simple analysis of the interaction between two independent variables revealed nothing of significance. Based on analytical results, environmental factors that normally have effects on project cost change had no significant influence on the PMO engineering quantity estimation. This fact inferentially indicates that separating estimates of quantity and unit price eliminates uncertainties and reduces complexity when estimating project costs as item unit price typically fluctuates according to market conditions, inflation, time and environmental Table 4 Linear models with standardized regression coefficients. Item description
Measurement unit
N
Response variable
Parametric equation
Adj R2
L1. Pavement of reclaimed dense-graded asphalt concrete L2. Salvaging, hauling, and stockpiling reclaimable asphalt pavement L3. Reflectorized pavement markings L4. Sprinkling of tack coat
M2 M2
132 121
1.282CA 0.742LM + 0.190FTLW 1.487RL 0.421RW + 1.036CA + 0.148DI
0.444 0.373
Ton M2
131 141
1.305CA 0.727LM + 0.166FTLW 0.715LM 0.467RW 1.101RL 0.220SEG
0.472 0.425
L5. Reclaimed asphalt concrete deductible
Ton
95
0.432RW 0.376RL
0.322
L6. CCP: Cumulative cost percentage of standard work items (L1–L4 above)
%
ItemQty ItemQty/ M2 ItemQty ItemQty/ M2 ItemQty/ M2 Ratio
0.419RL 0.431JTN + 0.523RW 0.369LN + 0.182FTLW
0.393
Average
0.405
142
Table 5 Log-linear models with standardized regression coefficients. Item description
Measurement unit
N
Response variable
Parametric equation
Adj R2
LG1. Pavement of reclaimed dense-graded asphalt concrete LG2. Salvaging, hauling, and stockpiling reclaimable asphalt pavement LG3. Reflectorized pavement markings LG4. Sprinkling of tack coat LG5. Reclaimed asphalt concrete deductible LG6. CCP: Cumulative cost percentage of standard work items (LG1–LG4 above)
M2 M2
132 121
ItemQty ItemQty
CA0.814 CA0.790 STLW0.207
0.661 0.633
Ton M2 Ton %
131 141 95 142
ItemQty ItemQty ItemQty Ratio/M2
CA0.831 CA0.792 STLW0.192 LM0.861 LM0.966 RW0.118 SW0.064 FTLW0.032 JTN0.027
0.689 0.622 0.737 0.984
Average
0.721
648
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
Table 6 Randomly selected project samples for cost validation. Case No
1 2 3 4 5 6 7 8 9 10
Project name
Cost prediction error (%)
County highways no.102 30 K + 100 31 K + 333 Provincial highways no.4 0 K + 000 2 K + 140 Provincial highways no.1 16 K + 380 16 K + 760 and 20 K + 260 20 K + 940 Provincial highways no.3 75 K + 600 76 K + 800 Provincial highways no.1 51 K + 937 53 K + 000 Provincial highways no.3 36 K + 000 36 K + 500 and 40 K + 200 40 K + 600 Provincial highways no.7 14 K + 500 17 K + 020 Provincial highways no.7 11 K + 590 11 K + 920 and 14 K + 560 14 K + 690 Provincial highways no.3 415 K + 093 421 K + 372 County highways no.114 6 K + 750 7 K + 000 and 8 K + 500 9 K + 800
Sum of standard work item costs
Total project cost via Eq. (1)
22.0 13.8 8.2 2.3 9.1 11.0 17.3 10.6 3.9 1.1
16.0 9.0 10.6 1.9 6.5 1.5 5.7 19.1 25.9 8.1
9.9
10.4
Mean absolute prediction error
the predictors and response variables have a non-linear relationship, which can also be observed in advance based on the positive skewed distribution of variates. Log transformation alleviates nonlinearity and multicollinearity of predictors, and improves explanatory power by 32% in average. Furthermore, mean absolute quantity prediction error of log-linear models was 10.9%, which was 24.2% less than that for the linear model using ten random hold-out test data samples. Cost prediction accuracy is therefore evaluated using the loglinear quantity models (Table 5) by multiplying their MUCs and then comparing these predicted costs to actual costs. The mean absolute cost prediction errors of the summation of standard work item costs and total project cost (Table 6) are 9.9% and 10.4%, respectively, which in practice is satisfactory.
1
The design aim of the prototype project expert system is to automate the estimation process and store and retrieve historical project data from the structured schema. The following section describes the development of the graphical user interface using Microsoft AccessÒ, Visual BasicÒ, and structured query language (SQL). 7.1. System development and estimating flow The prototype data warehouse was built using rapid application development, a process that emphasizes immediate deployment of a system while simultaneously maintaining quality and reducing
Process
Problem statement and scope
Scope definition 2 Problem analysis and requirement analysis
Directorate General of Highway, MOTC
Output
Initial system Improvement objective Initial system proposal
Refined system improvement objective and design changes
The problem is solved
7. The prototype project expert system
3 Design (Logical and Physical)
Logical or physical design specification
6
4
Analysis
Construction
Parametric Cost Estimating System (PCES) within the Project Data Warehouse 5 Technical feedback
Design Prototypes and partial functional system
Testing Candidate release version of functional system 6 System operation and maintenance
5 PCES version #.#
Fig. 6. RAD approach for the project data warehouse.
Delivery of a version
649
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
Adjust output
Project estimator
System participants
3.0 According to special design to adjust work item and project cost
Process Data warehouse or system
Conduct a new project estimation
Action Output
Save adjusted output 1.0 Enter parameters at project initiation
Estimating output
Dimensional data warehouse
Project cost and item quantity Project parameters 4.0 Project estimating report & information
Project Cost Data Warehouse
Numerical computation 2.0 System calculates item quantity and project cost automatically
Data collection of project cost and parameters
Maintenance personnel to update estimation models
Budget report Latest estimation models
Project screening and feasibility analysis department
5.0 Refresh parametric estimation models
Update parametric estimation models
Fig. 7. Data flow diagram for the logical cost estimation process.
Estimator PCES Main Page
About this system Yes
Yes
Exit
New project estimation No
Renew item price Historical project tracking
Refresh
Data warehouse query
No Refresh is done Project cost estimating system Select project type Figure 9
Fig. 8. Project information flow and sample window shots (1/2).
development costs. This approach offers a series of techniques for compressing analysis, design, construction, and test phases into a
series of short, iterative development cycles (Fig. 6). Upon defining the model and prototype requirements using feedback from expe-
650
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
Fig. 9. Project information flow and sample window shots (2/2).
rienced engineers, the system was generated by defining the scope and analyzing requirements followed by deployment of the iteratively rapid approach. Fig. 7 presents a data flow diagram depicting the sequence of processes, functions and flow of data using the proposed parametric cost-estimating system. 7.2. Implementation Before operating the system, a user has to determine whether unit prices in the system are the most recent prices. This data can be updated via the TPCC database. Fig. 8 shows sample windows for querying using multiple criteria (right side) and updating the most recent unit prices (left side). The historical project-tracking button retrieves a case according to the primary key, and, in this case, is the project ID. Fig. 9 shows project inputs, automatic estimation, and output windows. The implementation phase for the full system involves application testing, user acceptance testing, and system installation (Marakas, 2006). After testing the proposed system based on the aforementioned schema, the system is ready to be used by experienced engineers. The final system installation stage includes system conversion, final documentation, end user training, and maintenance. 8. Conclusions This paper presents a practical data-mining process starting from collection of data, preprocessing, and construction of a novel project cost data warehouse. A parametric prediction technique is applied to establish useful estimation models. Although the data warehouse design, predictive modeling and system deployment are tailored to PMO projects, this research provides systematic procedures and a general framework that can be applied to similar
transportation projects. Additionally, this study provides researchers and practitioners with further insight into the relationships between various engineering quantity measurements and conceptual information (features, functions and characteristics) during the early stages of a project. The derived response functions are related to four dependent variables – REQ, EQLM, EQPL, and EQSM. These variables were constructed in linear and log-linear forms with a set of predictors. Through comparison of modeling results, the best estimation model is the natural logarithm of the quantity model with the original measurement unit (REQ); this model had a mean adjusted Rsquare of 0.668 and mean absolute quantity prediction error of 10.9%. Validation results suggest that the natural logarithmic transformation reduces mean absolute quantity prediction error by 24.2% when compared to those of linear models. For total project cost estimation, the transformed models generated a mean absolute project cost prediction error of 10.4%, which satisfies generally acceptable level of accuracy (15–35%) during the early stages of a project. Six explanatory variables were statistically significant, suggesting that they are key factors in predicting engineering quantity associated with pavement maintenance and repair projects. These variables can be classified into two groups. One group is related to the project geometrical configuration, including project width, fast or slow traffic lane width, shoulder width, construction area, and lane meters. The other group is related to traffic capacity, namely, volumes of trucks, tractors and trailers. The proposed prototype cost data warehouse comprised ten tables of data collected from Taiwan’s MOTC. Dimension tables were created for storing general information and facts that capture project characteristics. A user can generate new estimates based on the relationships constructed in the data warehouse automatically and retrieve specific project cost information through structural queries. Furthermore, to increase model estimation efficiency, an
J.-S. Chou, H.-C. Tseng / Expert Systems with Applications 38 (2011) 640–651
expert system embedded with the developed models further enhanced with the aid of a graphical user interface facilitates usage by transportation agencies when generating early budgets. Future research should focus on model refinement for full-scale system implementation and an automatic link to the most recent unit price database to improve continuous estimation accuracy. Acknowledgment The author would like to thank the National Science Council of the Republic of China, Taiwan, for financially supporting this research under Contract No. NSC 95-2221-E-194-001. References AACE (1997). Recommended practice no. 17R-97 (1997): Cost estimate classification system. AACE, Inc. ACP (2008). Road repair costs are exploding [Cited 2008 July 8]. . Akinci, B., & Fischer, M. (1998). Factors affecting contractors’ risk of cost overburden. Journal of Management in Engineering, 14(1), 67–76. Akintoye, A. (1998). Analysis of factors influencing project cost estimating practice. Construction Management and Economics, 18, 77–89. Al-Tabtabai, H., Alex, A. P., & Tantash, M. (1999). Preliminary cost estimation of highway construction using neural networks. Cost Engineering, 41(3), 19–24. An, S.-H., Kim, G.-H., & Kang, K.-I. (2007). A case-based reasoning cost estimating model using experience by analytic hierarchy process. Building and Environment, 42(7), 2573–2579. Anderson, S., Molenaar, K., & Schexnayder, C. (2007). Guidance for cost estimation and management for highway projects during planning, programming, and preconstruction. Washington, DC: National Cooperative Highway Research Program, Transportation Research Board. Baloi, D., & Price, A. D. F. (2003). Modelling global risk factors affecting construction cost performance. International Journal of Project Management, 21, 261–269. Bell, L. C., & Bozai, G. A. (1987). Preliminary cost estimating for highway construction projects. In 1987 AACE transactions. Ben-Arieh, D., & Qian, L. (2003). Activity-based cost management for design and development stage. International Journal of Production Economics, 83(2), 169–183. Chantaraskul, S., & Cuthbert, L. G. (2005). Congestion pattern matching in casebased reasoning control for 3G networks. In IEEE international conference on wireless and mobile computing, networking and communications, WiMob’2005 (Vol. 2, pp. 134–141). Chengalur-Smith, I. N., Ballou, D. P., & Pazer, H. L. (1997). Modeling the costs of bridge rehabilitation. Transportation Research Part A – Policy and Practice, 31(4), 281–293. Chou, J.-S. et al. (2006). Quantity-based approach to preliminary cost estimates for highway projects. Transportation Research Record(1946), 22–30. Diekmann, J. E. (1983). Probabilistic estimating: Mathematics and applications. Journal of Construction Engineering and Management, 109(3), 297–308. Emsley, M. W. et al. (2002). Data modeling and the application of a neural network approach to the prediction of total construction costs. Construction Management and Economics, 20(6), 465–472. Federle, M. O., & Pigneri, S. C. (1993). Predictive model of cost overruns. In 1993 AACE transactions. Gang, K. W. (2005). Research planning approach using knowledge map integrated with case-based reasoning. In IEEE international engineering management conference (Vol. 2, pp. 897–901). Harbuck, R. H. (2002). Using models in parametric estimating for transportation projects. AACE international transactions, EST.05 (ES51) (pp. EST.05.1–EST.05.09).
651
Hegazy, T., & Ayed, A. (1998). Neural network model for parametric cost estimation of highway projects. Journal of Construction Engineering and Management, 124(3), 210–218. Herbsman, Z. (1983). Long-range forecasting highway construction costs. Journal of Construction Engineering and Management, 109(4), 423–435. Inmon, W. H. (2002). Building the data warehouse (3rd ed.). Wiley Computer Publishing. Kim, G.-H., An, S.-H., & Kang, K.-I. (2004). Comparison of construction cost estimating models based on regression analysis, neural networks, and casebased reasoning. Building and Environment, 39(10), 1235–1242. Kwak, Y. H., & Watson, R. J. (2005). Conceptual estimating tool for technologydriven projects: Exploring parametric estimating technique. Technovation, 25, 1430–1436. Lowe, D. J., Emsley, M. W., & Harding, A. (2006). Predicting construction cost using multiple regression techniques. Journal of Construction Engineering and Management, 132(7), 750–758. Mallach, E. G. (2000). Decision support and data warehouse systems. Boston: Irwin/ McGraw-Hill. Marakas, G. M. (2006). Systems analysis & design. McGraw-Hill. Mendes, E., Mosley, N., & Counsell, S. (2002). The application of case-based reasoning to early web project cost estimation. In Proceedings – IEEE computer society’s international computer software and applications conference. Oxford, United Kingdom: Institute of Electrical and Electronics Engineers Computer Society. Niazi, A. et al. (2006). Product cost estimation: Technique classification and methodology review. Journal of Manufacturing Science and Engineering, Transactions of the ASME, 128(2), 563. Phaobunjong, K., & Popescu, C. M. (2003). Parametric cost estimating model for buildings. AACE International Transaction (pp. EST.13.1–EST.13.11). PMI (2004). A guide to the project management body of knowledge (PMBOK guide) (3rd ed.). Project Management Institute. Rujirayanyonga, T., & Shi, J. J. (2006). A project-oriented data warehouse for construction. Automation in Construction, 15(6), 800–807. Saito, M., Sinha, K. C., & Anderson, V. L. (1991). Statistical models for the estimation of bridge replacement costs. Transportation Research Part A – Policy and Practice, 25A(6), 339–350. Sanders, S. R., Maxwell, R. R., & Glagola, C. R. (1992). Preliminary estimating models for infrastructure projects. Cost Engineering, 34(8), 7–13. Schexnayder, C. J., Weber, S. L., & Fiori, C. (2003). Project cost estimating: a synthesis of highway practice. National Cooperative Research Program, Transportation Research Board. Sthapit, N., & Mori, H. (1994). Model to estimate highway earthwork cost in Nepal. Journal of Transportation Engineering, 120(3), 498–504. Thomasson, B. J., Ratcliffe, M. B., & Thomas, L. A. (2006). Improving the tutoring of software design using case-based reasoning. Advanced Engineering Informatics, 20(4), 351–362. Touran, A., & Suphot, L. (1997). Rank correlation in simulating construction costs. Journal of Construction Engineering and Management, 123, 297–301. Toussaint, J., & Cheng, K. (2006). Web-based CBR (case-based reasoning) as a tool with the application to tooling selection. International Journal of Advanced Manufacturing Technology, 29(1–2), 24–34. Trost, S. M., & Oberlender, G. D. (2003). Predicting accuracy of early cost estimates using factor analysis and multivariate regression. Journal of Construction Engineering and Management, 129(2), 198–204. Turban, E., & Aronson, J. E. (2005). Decision support systems and intelligent systems. Upper Saddle River, NJ: Prentice-Hall. Weiser, M., & Morrison, J. (1998). Project memory: Information management for project teams. Journal of Management Information Systems, 14(4), 149–166. Williams, T. P. (2005). Bidding ratios to predict highway project costs. Engineering, Construction and Architectural Management, 12(1), 38–51. Wilmot, C. G., & Cheng, G. (2003). Estimating future highway construction costs. Journal of Construction Engineering and Management, 129(3), 272–279. Yu, W.-D. (2006). PIREM: A new model for conceptual cost estimation. Construction Management and Economics, 24(3), 259–270.