Piecewise Surface Regression Modeling in ... - Semantic Scholar

2 downloads 0 Views 338KB Size Report
Juan Luo. Department of Computer Science, George Mason University, ... [1] and finance [2], a common first step in understanding a set of data points is to.
Piecewise Surface Regression Modeling in Intelligent Decision Guidance System Juan Luo and Alexander Brodsky

Abstract. An intelligent decision guidance system which is composed of data collection, learning, optimization, and prediction is proposed in the paper. Built on the traditional relational database management system, the regression learning ability is incorporated. The Expectation Maximization Multi-Step Piecewise Surface Regression Learning (EMMPSR) algorithm is proposed to solve piecewise surface regression problem. The algorithm proves to outperform a few currently-used regression learning packages. Optimization and prediction are integrated to the system based on the learning outcome.

1 Introduction Increasing number of applications require predicting behavior of a complex system and making decisions to move the system towards desirable outcomes such as finding the best course of action in emergencies and creating public policies aimed at most positive outcomes. In such applications, predictions and decisions are to be made in the presence of large amounts of dynamically collected data and learned uncertainty models. Large amount of data usually are saved in a database system, e.g. Relational Database Management System (RDBMS) such as Oracle. The Structured Query Language (SQL) is intuitive and broadly used. However, it doesn’t support decision optimization and statistical learning, often necessary for building decisionguidance applications. Juan Luo Department of Computer Science, George Mason University, Fairfax, VA 22030 e-mail: [email protected] Alexander Brodsky Department of Computer Science, George Mason University, Fairfax, VA 22030 e-mail: [email protected]

J. Watada et al. (Eds.): Intelligent Decision Technologies, SIST 10, pp. 223–235. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com 

224

J. Luo and A. Brodsky

Relational databases have traditionally taken the view that the data they store is a set of discrete observations. This is clearly reasonable when storing individual facts, such as the salary of employees or the description of a product. However, when representing time or space varying data, such as a series of temperature observations or a history of stock price over time, a set of discrete points is often neither the most intuitive nor compact representation. For researchers in many fields, such as biology [1] and finance [2], a common first step in understanding a set of data points is to model those points as a collection of curves, typically generated using some form of regression (curve fitting). Regression, a form of modeling, helps smooth over errors and gaps in raw data points (noisy or has missing values), yields a compact and more accurate representation of those points as a few parameters, and provides insight into the data by revealing trends and outliers. Regression analysis attempts to build a model based on the relationship of several independent variables and a dependent variable [3]. Given as input to regression learning is a parametric functional form, e.g., f (x1 , x2 , x3 ) = p1 x1 + p2x2 + p3x3 , and a set of training examples, e.g., tuples of the form (x1 , x2 , x3 , f ), where f is an experimental observation of the function f value for an input (x1 , x2 , x3 ) . Intuitively, the problem of regression analysis is to find the unknown parameters, e.g., p1 , p2 , p3 which best approximate the training set. For example, the national housing price can be modeled as a function of such determinants as age of the house, the floor area of the house, neighborhood attributes, and location attributes. This functional form may have unknown parameters, reflecting the relationship between house price and a particular attribute of the house. In realistic situation, a single parametric functional form, e.g., f (x1 , x2 , x3 ) = p1 x1 + p2 x2 + p3 x3 may not be able to express the relationship between independent and explanatory variables when the relationship changes according to the different value intervals in which the explanatory resides. For example, housing prices show different behavior in response to age of the house or the floor area of the house based geographical location. So, instead of the conventional and stationary model, for example, f (x) = p0 + p1 x in the case of linear regression with a single explanatory variable, the piecewise linear regression model can be expressed as ⎧ f (p , x) x < b1 ⎪ ⎪ ⎨ 1 1 f2 (p2 , x) b1 ≤ x < b2 (1) f (p, x) = ... ⎪ ⎪ ⎩ fk (pk , x) bk−1 ≤ x In the PWR expression in equation 1, for different value intervals of the explanatory variable, a specific functional form fi represents a ”segment” of the line in the overall problem. The problem is named piecewise surface regression when the number of explanatory variables is more than one. The algorithm to tackle this specific type of regression learning is one focus of our paper. A decision-guidance management system (DGMS) that supports a closed-loop of data acquisition, learning, prediction and decision optimization was proposed in [4]. Mathematical and Constraint Programming (MP and CP), used for decision

Piecewise Surface Regression Modeling in Intelligent Decision Guidance System

225

optimization, i.e., finding values for control variables that maximize or minimize an objective within given constraints, are involved in the decision-guidance management system. The syntax of decision optimization in DGMS is adopted in our paper to find the best solution, given the outcome from piecewise regression learning. The contribution of the paper is summarized as: • We extend the RDBMS with the piecewise regression learning ability. The functional forms are represented as database tables. The regression learning process is implemented as stored procedures. • We propose an EM-based Multi-Step Piecewise Surface Regression Learning Algorithm (EMMPSR) to solve piecewise surface regression problem. The multiple steps involved are clustering, local regression, classification and regression learning for each individual surface. • We describe a case study of decision optimization process based on the learning outcome of EMMPSR algorithm. The rest of paper is organized as following. In section 2, we briefly discuss the state-of-art intelligent decision systems and related work of piecewise regression learning. In section 3, we give a motivating example, which serves as the running example for the rest of paper as well. We present the extended model of regression learning to the standard relational database using the running example in section 4. In section 5, the comprehensive EMMPSR algorithm is described. We presents out experimental results in section 6. The decision optimization process is described in section 7 and the paper is concluded in section 8.

2 Related Research Mathematical and scientific packages like MATLAB [9] and R [10], do support creating regression models. However, these tools lack support for declarative or relational queries. Queries typically need to be implemented as custom scripts in MATLAB, or in languages like Perl. A related concern is that tools like MATLAB do not provide a seamless way to interact with data already stored in a DBMS. Data from a relational table needs to be manually imported into MATLAB in order to fit a regression model to it. Once a model has been fit to the data, it can be used to make predictions or compute the interpolated value of a function at specific points from within MATLAB, but this code lives in custom scripts, which do not provide any of the benefits of storing the data within the RDBMS. Existing commercial DBMSs provide some support for fitting models by taking the form of modeling tools and add-ons for applications. For example, IBM’s Intelligent Miner [11] support creating models using PMML (Predictive Model Markup Language). Models are viewed as stand alone black boxes with specialized interfaces for fitting and visualization. A typical use of PMML involving regression is to first fit a set of points to functions using an external tool, load those functions into the database, and then use the functions to predict the value of some other set of points by plugging them into the functions, typically using a stored procedure. However, the piecewise surface regression problem is not supported by the extended add-ons.

226

J. Luo and A. Brodsky

Constraint query languages, proposed in the context of querying geometric regions and formalized by [5], represent and query infinite regions as systems of constraints. There have been prototype implementations of constraint database systems for solving spatial queries and interpolating spatial data [6] and [7]. Our methodology is simpler and specifically restricted to regression models whereas constraint databases have focused mainly on linear constraints to keep query processing tractable. The focus of our work is on efficient query processing for regression models, while work on constraint query languages and databases has traditionally focused on the complexity of supporting large numbers of constraints (e.g., for linear programming applications). A decision guidance query language (DGQL) framework [8] was proposed as the implementation of decision guidance management system, which used SQLlike syntax but allowed optimization and learning. It annotates existing queries in SQL to precisely express the optimization semantics, and then translates the annotated queries into equivalent mathematical programming (MP) formulation that can be solved efficiently. Although regression learning problem can be formulated in DGQL, the piecewise feature of the functional forms makes the corresponding reduced MP models very expensive and inefficient to solve. That’s why we propose a different piecewise regression learning methodology but still adopt the optimization syntax / semantics of DGQL to express the decision optimization process in our case study.

3 A Motivating Example To make our discussions more concrete, consider an example of decision making to support a logistic transportation network, in which two different products are to be shipped from several origins (suppliers) to several destinations at a minimum cost. What is to be decided in this network is the exact amount of each product to be transported from each supplier to each destination, at the same time minimizing the total cost of transportation network. Figure 1 illustrates the network. Transportation

Fig. 1 A transportation network

Piecewise Surface Regression Modeling in Intelligent Decision Guidance System

227

problem is a classical example of mathematical optimization problem in Operations Research area. The unit cost of shipment from each supplier to each destination is usually given in advance and is fixed as well. In this case, the problem can be easily formed as a mathematical constraint problem and solved by any mathematical solver like CPLEX [12]. Four populated tables have been created in the database instance for the transportation network. The Supplier table contains the information related to suppliers. The Destination table contains information related to destinations. The Shipping Rate table lists the shipping rate for each supplier-destination pair and each product. Table 4, Transportation Amount table, is actually the table to be solved in the decision problem. The amount of each product to be shipped is marked as a special type of attribute ’TBD’ which means this attribute is to be decided by optimization. Table 1 Supplier

Table 2 Destination

SID SNAME SPAMT1 SPAMT2 s1 DET 60 90 s2 LIN 80 80 ... ... ... ...

Table 3 Shipping Rate

SID SNAME DPAMT1 DPAMT2 d1 ABC 40 30 d2 XYZ 50 70 ... ... ... ...

Table 4 Transportation Amount

SID DID UNIT RATE1 UNIT RATE2 s1 d1 5 5 s2 d2 2.5 2.5 s3 d3 6 6 s4 d4 5.5 5.5 ... ... ... ...

SID DID s1 d1 s2 d2 s3 d3 s4 d4 ... ...

TAMT1 TBD TBD TBD TBD ...

TAMT2 TBD TBD TBD TBD ...

More realistically, however, the unit cost of shipment may not always be fixed but change according to the amount of both product1 and product2 to be shipped. The most favorable unit cost may be available for only a limited number of units. Shipments beyond this limit pay higher rates. As an example, there are three cost rate levels are specified for each supplier-destination pair. Correspondingly the total cost of shipments along each pair increases with the amount shipped in a piecewiselinear style.

4 Predictive Modeling in RDBMS The piecewise regression learning is proposed as an extension to the RDBMS (see Figure 2). Given historical data table as inputs, it outputs a model that predicts future values of a designated target column based on the designated

228

J. Luo and A. Brodsky

Fig. 2 Piecewise regression extension to RDBMS

explanatory columns in the table. The predicative model is generated by issuing a stored function call to the predefined stored function in SQL, PWLearning. A table called FUNCTION NAME COLLECTION has been created in advance to contain predictive models for piecewise function polls. The schema of this table has four attributes FUNC ID (primary key), FUNC TABLE, BOUNDARY TABLE and CASE FUNC TABLE. FUNC TABLE represents the functional form for each piecewise surface. Its attributes are coefficients of the function. BOUNDARY TABLE represents the boundary constraints for each surface. Its attributes are coefficients of boundary surfaces. CASE FUNC TABLE represents the connection between each surface and boundary constraints. Each surface usually has more than one boundary constraint. Instances for these tables are listed at the end of section 7. The stored function PWLearning takes a few parameters as inputs, i.e. the name of the historical data table / view, the number of piecewise surfaces involved, and the dimension of the explanatory variables. The calling of PWLearning • firstly inserts a new row / predicative model description to FUNCTION NAME COLLECTION table. The row inserted by PWLearning is composed of a function identifier and values for the attributes FUNC TABLE, BOUNDARY TABLE and CASE FUNC TABLE. • secondly, generates the predictive model and stores the model information in three tables named by the values of most recently inserted row. • thirdly returns the assigned function identifier representing the predictive model. The predicative model can then be applied within SQL query that invokes another stored function, PREDICT(FUNC ID, X1, X2, ..., XN). This function takes the function identifier, FUNC ID and the value of explanatory variables as inputs. It returns target value back to the SQL query. The calling of PREDICT • firstly queries FUNCTION NAME COLLECTION table by FUNC ID and returns the name of three tables, FUNC TABLE, BOUNDARY TABLE and CASE FUNC TABLE.

Piecewise Surface Regression Modeling in Intelligent Decision Guidance System

229

• secondly queries BOUNDARY TABLE to see how many rows in the table can be satisfied by the value explanatory variables (X1, ..., XN). • thirdly joins the BOUNDARY TABLE and CASE FUNC TABLE based on the query result of second step and selects the piecewise surface identifier which the value of explanatory variables belongs to. • finally queries the FUNC ID table. The target value is calculated using both coefficients in the table and the value of explanatory variables and returned to the calling statement. In the transportation network, the historical cost set is prepared as a view in Table 4. For each supplier-destination pair, the functional form of cost can be expressed by a piecewise linear function with two explanatory variables PAMOUNT1 and PAMOUNT2 (the amount of product 1 and the amount of product 2 shipped between each supplier-destination pair). The target variable is the COST which is noisy or has missing values.

Table 5 Historical cost table SID DID PAMOUNT1 PAMOUNT2 COST s1 d1 10 10 21.98 s2 d2 10 30 56.98 s3 d3 22 25 94.36 s4 d4 38 9 88.67 ... ... ... ... ...

There are totally 12 supplier destination pairs in the transportation network. Correspondingly twelve rows are inserted into the FUNCTION NAME COLLECTION table. Table 6 FUNCTION NAME COLLECTION FUNC ID S1D1 S1D2 ... S3D4

FUNC TABLE BOUNDARY TABLE CASE FUNC TABLE FUNC S1D1 BOUNDARY S1D1 CASE FUNC S1D1 FUNC S1D2 BOUNDARY S1D2 CASE FUNC S1D2 ... ... ... FUNC S3D4 BOUNDARY S3D4 CASE FUNC S3D4

5 Piecewise Surface Regression Learning Algorithm 5.1 Formal Definition Regression analysis attempts to build a model based on the relationship of several independent (explanatory) variables and a dependent variable [3]. Let x1 , ..., xn , be

230

J. Luo and A. Brodsky

independent variables, and y, be dependent variable, both range over the set of R. The latter is a random variable defined over the underlying distribution of sample tuples in In = R × R × ... × R. Suppose the learning set contains m tuples. Let us denote such a tuple as xh = (xh1 , ..., xhn ) for h = 1, ..., m. The collection of data, c = (xh , yh ) for h = 1, . . . , m, represent the available training data to estimate the values of the random variable y = f (xh , β )+ N for h = 1, . . . , m, where N is a random noise. We assume that N is distributed as a Gaussian with 0 mean and variance σ such that: E(y) = E( f (xh , β ) + N) = E( f (xh , β )) = f (xh , β ) , where E is the expected value. The standard least squares method is used to find coefficients β of f that minimize σ . Application can be found, which lie on the borderline between classification and regression; these occur when the input space X can be subdivided into disjoint regions Xi characterized by different behaviors of the function f to be reconstructed. One of the simplest situation of such kind is piecewise surface regression: in this case X is a polyhedron in the n-dimensional space Rn and {X}ki=1 is a polyhedral  partition of X, i.e. Xi ∩ X j = Φ for every i, j = 1, . . . , k and ki=1 Xi = X. The target of a piecewise surface regression problem is to reconstruct an unknown function f : X → R having a linear behavior in each region Xi f ∗ (x) = fi (x j , βi )

if i = 1, . . . , k, j = 1, . . . , m

(2)

when only a training set D containing m samples (xh , yh ), h = 1, . . . , m, is available. The output yh gives a noisy evaluation of f (xh ), being xh ∈ X; the region Xi to which xh belongs is not given in advance. The parameters set β1 , β2 , . . . , βi for i = 1, 2, . . . , k, characterizes the function set fi and their estimate is a target the piecewise surface regression problem. The regions Xi are polyhedral, i.e., they are defined by a set of li linear inequalities, which can be written in the following form:   1 Ai ≥0 (3) x where Ai is a matrix with li rows and n + 1 columns and their estimate is another target of learning process for every i = 1, 2, . . . , k . According to (2) and (3), the target of the learning problem is actually two-fold: to generate both the regions Xi and the parameter set βi for the unknown function set fi , utilizing the information contained in the training set.

5.2 Algorithm The idea of PWLearning is described in Algorithm 1. The Expectation Maximization (EM) algorithm [13] has been adapted in our algorithm. The partition of input space is separated by applying a double-fold k-means clustering algorithm, incorporating the value of target variable. After the clustering of polyhedral regions, a multi-category SVM library [14] is called to find out the boundary matrix Ai in (3) for every polyhedral region. Each region is represented by a boundary matrix Ai .

Piecewise Surface Regression Modeling in Intelligent Decision Guidance System

231

For each polyhedral region, a surface regression model is easily learned by robustfit [15]. Similarly to EM algorithm, an iteration process is involved in our approach: first, the local models are trained according to the resulted clusters. Then the data points in every cluster will be re-assigned to the local model which has the best predicative performance. The local model will be updated again based on the newly created clusters of polyhedral regions. The iteration process will be repeated until termination criterion reached. Algorithm 1. The EM-based Multi-step Piecewise Surface Regression Learning Input: data set D with size m, number of clusters k Output: local models fi and cluster boundary matrix Ai for i = 1, ..., k 1

2

3

4

5

(Local regression) foreach h = 1, ..., m do 1.1 Build the local dataset Eh containing the sample (xh , yh ) and the pairs (x, y) ∈ S, together with the e − 1 closest neighbors x to xh . 1.2 Perform a linear regression to obtain the feature vector vh (with dimension n + 1) of a linear unit fitting the samples in Eh . (Clustering) Perform clustering process in the feature vector space. 2.1 Run regular k-means on feature vector space Rn+1 with assigned feature vector center set CV to subdivide the set of feature vectors vh into k groups Ui . 2.2 Build a new training set D containing the m pairs (xh , ih ) being Uih the cluster including vh repeat Multi-category classification on training set D to compute the cluster boundary set Ai for every region Xi . (Regression) For every j = 1,..., k, run a linear regression on the samples (x, y) ∈ D with x ∈ Xi . The parameter set Bi returned represents the ith surface function fi . Update cluster index of each data point, further the training set D , according to the minimal predicative error among surface function set fi for i = 1, ..., k. until the maximum number of iterations reached or the no cluster index is reassigned ;

6 Experimental Results To evaluate our EM-based multi-step piecewise surface regression algorithm EMMPSR, we generate synthetic high-dimensional data which is piecewisedefined. We compare the performance of EMMPSR with those of M5P (weka.classifier.trees) [16], classregtree (matlab statistical toolbox) [9], and MultilayerPerceptron (three layer neural network) (weka.classifier.functions) [16] on three set of synthetic data.

232

J. Luo and A. Brodsky

The data sets are generated using three different piecewise models. Each model has linear boundaries between regions and linear functions within each region. Model 1 and model 2 each has three regions and two independent variables. Model 3 has five regions and nine independent variables. Data in each model are generated with additive Gaussian noise with zero mean and 0.1 variance. We generated 300 sample points for model 1, 900 data points for model 2 and 1500 data points for model 3. The second data set is generated from the piecewise functions: ⎧ ⎨ 3 + 4x1 + 2x2 if 0.5x1 + 0.29x2 ≥ 0 and x2 ≥ 0 f (x1 , x2 ) = −5 − 6x1 + 6x2 if 0.5x1 + 0.29x2 < 0 and 0.5x1 − 0.29x2 < 0 ⎩ −2 + 4x1 − 2x2 if 0.5x1 − 0.29x2 ≥ 0 and x2 < 0 (4) This target function is depicted in Equation (4) and the data points are plotted in Figure 3. Total 900 samples are drawn uniformly from I2 = [−1, 1] × [−1, 1] and y is determined as y = f ∗ (x1 , x2 ) + ε , where ε ∼ N(0, 0.1). In this setting, the target value need to combined to determine the appropriate cluster prototypes.

Fig. 3 Synthetic Data Set Generated in Model 2

The following function estimate is yielded by the EMMPSR algorithm: ⎧ 3.0067 + 3.9940x1 + 1.9977x2 if 0.5x1 + 0.32x2 ≥ 0.005andx2 ≥ 0 ⎪ ⎪ ⎨ −5.0217 − 6.0201x1 + 6.0056x2 if 0.5x1 + 0.32x2 < 0.005 and f (x1 , x2 ) = 0.5x ⎪ 1 − 0.31x2 < 0.01 ⎪ ⎩ −2.0035 + 3.9793x1 − 2.0330x2 if 0.5x1 − 0.31x2 ≥ 0.01andx2 < 0 (5) As noted, the generated model is a good approximation of the unknown function to learn in Equation 4. Five-fold cross validation is adopted to evaluate the learning performance by randomly dividing the data set into 5 equal parts. Each part is held out in turn and the remaining four is trained for the learning method. The root mean squared error (RMSE) [17] is calculated on the unseen data. The results are

Piecewise Surface Regression Modeling in Intelligent Decision Guidance System

233

summarized in Table 7. It is obvious that EMMPSR outperforms other methods as to RMSE. Table 7 RMSE values for performance comparison experiments on synthetic data sets Model Model1 Model2 Model3

M5P MultilayerPerceptron Classregtree EMMPSR 1.0925 3.0657 2.8899 0.3759 0.7599 1.8773 0.4995 0.2538 37.6910 47.8030 33.3755 30.8755

Another matrix to be compared among different methods is average number of rules generated by each model for a data set. In EMMPSR it is the number of regions, while in M5P and Classregtree it is the number of rules generated during the process of building the tree. EMMPSR only uses a fraction of the rules that are generated by M5P and Classregtree. The corresponding tables are generated for the predictive model of Equation 5. FUNC TABLE represents three piecewise functional forms with their coefficients. BOUNDARY TABLE represents the boundary matrix with their coefficients as well. CASE FUNC TABLE joins the FUNC TABLE and BOUNDARY TABLE together. Table 8 FUNC TABLE PID C0 C1 C2 P1 3.0067 3.9940 1.9977 P2 -5.0217 -6.0201 6.0056 P3 -2.0035 3.9793 -2.0330

Table 9 BOUNDARY TABLE BID A0 B1 0.005 B2 0.01 B3 0

A1 0.5 0.5 0

A2 0.32 -0.31 -1

Table 10 CASE FUNC TABLE CASE BID LESS THAN FLAG P1 B1 -1 P1 B3 -1 P2 B1 1 P2 B2 1 P3 B2 -1 P3 B3 1

7 Decision Optimization The optimization process for the transportation network is described in the following query sequence written in SQL. CREATE VIEW supply-transport-from AS SELECT s.SID, s.SPAMT1 as total-supply1, s.SPAMT2 as total-supply2,

234

J. Luo and A. Brodsky

sum(t.TAMT1) as total-transp1, sum(t.TAMT2) as total-transp2 FROM Supplier s, Transportation Amount t WHERE s.SID = t.SID GROUP BY s.SID CHECK total-supply1 = total-transp1, total-supply2 = total-transp2 CREATE VIEW destination-transport-to AS SELECT d.DID, d.DPAMT1 as total-dest1, d.DPAMT2 as total-dest2, sum(t.TAMT1) as total-transp1, sum(t.TAMT2) as total-transp2 FROM Destination d, Transportation Amount t WHERE d.DID = t.DID GROUP BY d.DID CHECK total-dest1 = total-transp1, total-dest2 = total-transp2 CREATE VIEW TOTAL-COST AS SELECT SID, DID, sum(PREDICT(SID + DID, TAMT1, TAMT2)) FROM Transporation Amount t MINIMIZE TOTAL-COST The queries follow the syntax of SQL except two exceptions. First, the attribute TAMT1 and TAMT2 in Table 4, Transportation Amount is marked as a special annotation ’TBD’. Second, the objective of the decision optimization is given as ’MINIMIZE TOTAL-COST’. In DGQL, decision optimization problem is written as a regular data problem, i.e., a sequence of relational views and accompanying integrity constraints, together with some annotation of which database table column needs to be decided by the system (i.e., variables) and toward what goal (i.e., optimization objective). Here, existing queries in the reporting software can be directly used. Essentially, DGQL allows users to write optimization problem as if writing a reporting query in a forward manner. In other words, DGQL has SQL-like syntax, yet it uses mathematical programming algorithms to solve optimization problems by involving action statements like (maximize) or (minimize). The above two exceptions in the SQL queries can be represented by DGQL queries: SELECT dgql.augment(’Transportation Amount’, ’TAMT1’, null, null); SELECT dgql.augment(’Transportation Amount’, ’TAMT2’, null, null); SELECT * FROM dgql.minimize(’TOTAL-COST’); In the optimization process for the transportation network, the total cost is calculated in the VIEW TOTAL COST by summing the cost of each supplier-destination pair. The transportation cost between Supplier ’SID’ and Destination ’DID’ is calculated and returned by the stored function PREDICT(SID+DID, TAMT1, TAMT2). The functional identifier FUNC ID is specified by the concatenation of both ’SID’ and ’DID’. The amount TAMT1 for product 1 and TAMT2 for product 2 are explanatory variables.

Piecewise Surface Regression Modeling in Intelligent Decision Guidance System

235

8 Conclusion An intelligent decision system is proposed in this paper which involves a loop of data collection, learning, optimization and prediction. The EMMPSR algorithm is designed to solve the piecewise surface regression problem in the learning stage. Experimental results show that the EMMPSR outperforms a few currently wideused regression tools in terms of RMSE and simplicity of functional forms. The piecewise surface regression is extended as a predictive model for the RDBMS. The decision optimization, built on the outcome of the learning stage, shows how to find the optimal solution and at the same time satisfy constraints. A couple of future research topics will be selections of main features which will be used for regression, and how to choose initial number of clusters as input to EMMPSR algorithm.

References 1. Hunter, J., McIntosh, N.: Knowledge-Based Event Detection in Complex Time Series Data. In: Proceedings of the Joint European Conference on Artificial Intelligence in Medicine and Medical Decision Making, pp. 271–280 (1999) 2. Fox, J.: Applied Regression Analysis, Linear Models, and Related Methods (1997) 3. Draper, N., Smith, H.: Applied Regression Analysis Wiley Series in Probability and Statistics (1998) 4. Brodsky, A., Wang, X.: Decision-Guidance Management Systems (DGMS): Seamless Integration. In: The 41st Hawaii International International Conference on Systems Science (HICSS-41 2008), pp. 7–10 (2008) 5. Kanellakis, P., Kuper, G., Revesz, P.: Constraint Query Languages. In: Symposium on Principles of Database Systems, pp. 299–313 (1990) 6. Revesz, P.: Constraint databases: A survey. In: Semantics in Databases, pp. 209–246 (1995) 7. Revesz, P., Chen, R.: The MLPQ/GIS Constraint Database System. In: SIGMOD Conference on Management of Data (2000) 8. Brodsky, A., Egge, N., Wang, X.: Reusing Relational Queries for Intuitive Decision Optimization. In: 44th Hawaii International International Conference on Systems Science, pp. 1–9 (2011) 9. Matlab, http://www.mathworks.com/products/matlab 10. The R Project For Statistical Computing, http://www.r-project.org/ 11. IBM: IBM DB2 Intelligent Miner, http://www-386.ibm.com/software/data/iminer 12. AMPL, http://www.ampl.com 13. Dempster, P., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of The Royal Statistical Society, Series B, 1–38 (1977) 14. Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/˜cjlin/libsvm 15. Huber, P., Ronchetti, E.: Robust statistics. Wiley, New York (1981) 16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA Data Mining Software: An Update. SIGKDD Explorations (2009) 17. Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004)

Suggest Documents