DATA MINING FOR SYSTEM IDENTIFICATION SUPPORT

0 downloads 0 Views 37KB Size Report
... used and they are correlations [2], principal components analysis (PCA) [3] and decision trees ... models. PGSL finds the minimum of a user-defined objective function .... [3] Jackson J.E. A user's guide to principal components. Wiley series in.
DATA MINING FOR SYSTEM IDENTIFICATION SUPPORT Sandro Saitta, Benny Raphael, Ian F.C. Smith Informatique et Mécanique Appliquées à la Construction (IMAC), Ecole Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne, Switzerland {sandro.saitta, benny.raphael, ian.smith}@epfl.ch

Abstract: A system identification methodology that makes use of data mining techniques to improve the reliability of identification is presented in this paper. An important aspect of this methodology is the generation of a population of candidate models. An indication of the reliability of system identification can be obtained through an examination of characteristics of the population. This paper presents data mining techniques that provide support for this examination.

1. INTRODUCTION The goal of system identification [4] is to determine the state of a system and values of system parameters through comparisons of predicted and observed responses. A correct understanding of the models output by such techniques is an important aspect. Challenges associated with system identification are that many model predictions might match observations and the best matching model may not be the correct model. For the purposes of this paper, the reliability of identification is defined as the probability that the candidate model(s) obtained through system identification corresponds to reality. Reliability is poor when many models predict the same response at measured locations. Factors that affect the reliability of system identification have been studied in previous research [7]. The present work uses machine learning and data mining techniques [8] for an estimation of the reliability of identification. Three techniques are used and they are correlations [2], principal components analysis (PCA) [3] and decision trees [1].

2. METHODOLOGY The system identification methodology developed in [6] is summarised below. Users input measurement data and specify sets of modelling assumptions related to the structure. The model selection process identifies a set of candidate models whose predictions best match the measurements. A feature extraction module extracts characteristics of these models in order to examine if the system identification is reliable. If good model characteristics show wide variation, it means that many models match measurements. This happens when parameter values compensate for the effects of incorrect assumptions or when the measurement system is inadequate. If good solutions are located in one well-defined region of the search space, parameters are more likely to be accurate. A global search algorithm called PGSL [5] is used to identify candidate models. PGSL finds the minimum of a user-defined objective function through sampling the solution space using a probability distribution function that is updated using evaluations of previous samples. Tests that have been carried out on highly non-linear benchmark functions indicate that PGSL performs better than other search methods when the number of variables is increased. System identification involves the minimisation of an objective function that computes the difference between model predictions and measurements. Typically a Mean Square Error (MSE) function is used. All the models whose objective function values lie below a threshold are selected as candidates. The threshold is computed through an estimate of modelling and measurement errors. In order to perform data mining, all models (both good and bad) that are generated during the search are saved. Good models (candidates) that are identified within a PGSL run are usually located near the best solution found in that run. Therefore multiple PGSL runs are carried out in order to obtain good models in different parts of the search space. A timber beam supported on springs has been used to illustrate the methodology described in the previous section. It is emphasized that even though this study focuses on a beam structure, it can be applied to other structures in other domains. After candidate models are selected using the system identification methodology, a set of parameters are chosen.

Currently this is done manually1. In the present study, every model has the same set of parameters2. The data set used for mining includes an additional attribute. There is a boolean attribute that indicates whether the model is good or bad according to the MSE function. This results in a m × ( p + 1) matrix, where m is the index for the models and p is the number of input parameters. 3. RESULTS Correlation measurement avoids difficulties related to visualization in a p-dimensional space. It brings out information related to the reliability of identification of parameters. Correlation measures can only accommodate linear relationships between two parameters. Non-linear relationships usually exist between parameters. Another limitation is that relationships between more than two parameters at a time cannot be obtained. This last limitation is somewhat overcome through implementation of the principal components analysis (PCA). Two notable results are obtained using PCA. Models consist of parameters, pi. First, some parameters have more importance than others. This is reflected by the coefficients of the principal components. More important parameters have higher values of coefficients than the others. Secondly, the coefficients for some parameters may always be zero. This does not mean that these parameters have no influence on separating good and bad models. On the contrary, this means that every good model has the same value for pi. In other words, pi is a reliable parameter for model identification. This result is confirmed by the two-dimensional plot of the MSE versus a parameter pi. PCA brings out the fact that there are relationships between parameters of good models. However, it is difficult to determine the exact relationship; since PCA is a linear data mining method, it is not able to bring out non-linear relationships in the data. Decision trees provide new information related to the data and confirm other relationships. An example is the importance of some other parameter. Therefore, decision trees have advantages over other techniques. One of the 1

In future, a semi-automatic feature selection methodology is planned. The current methodology will be extended to accommodate models that contain different sets of parameters.

2

strengths of decision trees is that they generate easily understandable rules. Such trees help bring out potential meaning of data. The limitation of decision trees is that the method does not perform well when combinations of variables (in the form of linear or non-linear relationships) determine classes of data points. 4. CONCLUSIONS Simple data mining methods such as correlation, principal components analysis and decision trees help support system identification and indicate the reliability of solutions. These techniques are complementary and each method can be used to verify conclusions obtained from others. However, all three methods have limitations. The most important drawback is that non-linear relationships between parameters may not be discovered. Future work involves the use of other data mining methods based on kernels such as Support Vector Machines (SVM) and kernel PCA. Other challenges include, automatic feature selection and accommodating data containing varying sets of parameters. Finally, a range of data mining methods will be combined into a methodology for supporting system identification. References [1]

Breiman L., Friedman J.H., Olshen R.A. and Stone C.J. Classification and regression trees. Wadsworth International Group, Belmont, California, 1984.

[2]

Edwards A.L. An introduction to Linear Regression and Correlation. Second Edition, W.H. Freeman and Company, New York, 1984.

[3]

Jackson J.E. A user's guide to principal components. Wiley series in probability and mathematical statistics, 1991.

[4]

Ljung L. System Identification - Theory For the User. Prentice Hall, 1999.

[5]

Raphael B. and Smith I.F.C. A direct stochastic algorithm for global search. Applied Mathematics and Computation, 146, 2003, pp 729-758.

[6]

Robert-Nicoud Y. Une méthodologie mesures-modèles pour l'identification de systèmes de génie civil. PhD Thesis, EPFL, Lausanne, 2003.

[7]

Robert-Nicoud Y., Raphael B. and Smith I.F.C. Improving the reliability of system identification. Next Generation Intelligent Systems in Engineering, Fortschritt-Berichte VDI, 4, No 199, VDI Verlag, 2004, pp 100-109.

[8]

Witten I. and Frank E. Data Mining. Practical machine learning tools and techniques with Java implementations, Morgan Kaufmann Publishers, 2000.

Suggest Documents