On the use of Advanced Inductive Methods for ...

4 downloads 0 Views 632KB Size Report
Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed. ISIS Research Group, Department of Electronics and Computer Science, Materials. Research Group ...
On the use of Advanced Inductive Methods for Knowledge Extraction from Complex Datasets Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed

ISIS Research Group, Department of Electronics and Computer Science, Materials Research Group, School of Engineering Sciences, University of Southampton, Southampton UK. (jsk97r [email protected])

September 30, 1999 Abstract. Advanced inductive methods are increasingly being used for modelling tasks. However, a limitation of many approaches is the lack of model transparency which can aid in model validation and model interpretation. In the last few years, many authors have tried to address this shortcoming from a number of di erent angles, focusing primarily on selecting inputs which are relevant in predicting the output. The paper compares quantitative and qualitative methods and benchmarks them on arti cial and `real world' modelling problems. The quantitative methods considered are: Bayesian neural networks, neurofuzzy networks, support vector machines and multivariate linear regression. Additionally, the qualitative method of graphical Gaussian models is applied for elucidating relationships in the data. Characteristics of the modelling approaches are outlined, and the quantitative methods are compared in terms of generalisation performance, and transparency. In conclusion it is shown that models with good transparency and generalisation can be recovered. Keywords: Knowledge discovery, data mining, inductive methods, transparency, generalisation.

1. Introduction A goal of data mining is to discover the implicit and non-trivial relationships that exist in data. Advanced inductive methods (Cherkassky, 1998) (Williams, 1996) are well suited to learning the underlying model of a physical system from examples. However, these techniques can be considerably enhanced if they are equipped with the further attribute of transparency - allowing direct visualisation of the underlying relationships between inputs and outputs. This enables prior knowledge to be used to determine the con dence in a model's prediction, and provides a framework into which expert knowledge can be integrated. Neural network methods for function approximation are well documented in the literature (Hush, 1993)(Townsend, 1999). However, the majority of these techniques have been described as being black-box (Ljung, 1987); that is the mathematical relationships produced by the network architecture are complex, making model interpretation dicult.

c 1999 Kluwer Academic Publishers. Printed in the Netherlands.

Paper.tex; 3/11/1999; 11:28; p.1

2 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed Transparent modelling approaches provide an attractive alternative, allowing a model's performance to be assessed not only on predictive performance, but also using expert knowledge. This can be done, for example, by allowing the in uence of a particular input or combination of inputs on the output to be assessed. A number of characteristics that can aid in model transparency are robust feature selection algorithms, and techniques for decomposing the model into smaller more interpretable portions that can be readily visualised. Throughout this paper this is viewed as an important property. However, the accuracy of the model is still paramount since the interpretation of a model that poorly resembles a system's input/output relationship is of little use. After a review of data modelling and knowledge extraction techniques, the problem of interpreting the complex nonlinear relationships in an arti cial and commercial processing dataset is described.

2. Knowledge Representation and Transparency The problem of regression is to approximate an unknown function from the observation of a limited sequence of (typically) noise corrupted input/output data pairs. More formally, consider a dataset D = (x ; y ) =1 , drawn from an unknown probability distribution, where x 2 < represents a set of inputs, y 2 < represents a single output, and N represents the number of training examples. The empirical modelling problem is to discover an underlying mapping x ! y that is consistent with the dataset D . The regression function is learnt from a training set, with its performance being measured on an independent test set. Given the high dimensionality of the modelling problem, the non-systematic distributions of the input data and the typically complex, non-linear underlying relationships, the task is well suited to a computer-based, data-driven approach. The following sections introduce the characteristics associated with model transparency and generalisation. N

i

i

N i i n

i

N

2.1. Transparency The ability to visualise complex nonlinear relationships provides an important tool for knowledge extraction. Expert or prior knowledge can be used to increase the robustness of the modelling process. This knowledge can be incorporated into the modelling process directly by introducing a likelihood over the possible models. Alternatively, expert knowledge can be incorporated as an aid to model interpretation and model validation. Model structure can be interpreted by assessment of

Paper.tex; 3/11/1999; 11:28; p.2

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets3

parameter or hyperparameter values, as well as by assessing the trends between inputs and output. Qualitative data visualisation algorithms have been proposed in the literature (Bishop and Tipping, 1998). However, the majority of these techniques are limited because they are based on a projection of the data onto a low-dimensional visualisation space. Whilst such plots can reveal the structure of simple datasets, they are severely limited for nonlinear datasets with a large numbers of variables. Projection and visualisation techniques which are capable of revealing the underlying data structure in these cases are in demand as complex interactions are commonplace in commercial applications. To this end the use of graphical representations has started to play an increasing role in data modelling (Whittaker, 1990)(Kandola et al, 1999). The learning problem can also be compounded if the data contains a large quantity of redundant information resulting in poor performance. Data pre-processing can allow the salient information within a dataset to be retained or enhanced, by extracting and transforming features from a set of data. A common problem encountered when using datasets with a high input dimension is the curse of dimensionality (Bellman, 1961). Hence, one of the most important forms of pre-processing may involve a reduction in the dimensionality of the input data. An important consideration with all pre-processing techniques is that care should be taken to ensure that salient information is not lost from the dataset when they are used. However, a small reduction in model performance may be justi ed if a large number of inputs can be removed, enhancing the transparency of the model. 2.2. Generalisation After a model has been trained, it is essential to assess its performance with data which is independent of the training process ensuring an unbiased measure of model performance. The best generalisation performance will be achieved if the capacity of the model is matched to the complexity of the underlying process, and the model representation avoids problems associated with model mismatch (Burges, 1998). Many techniques for improving generalisation performance are inspired by Occam's razor, which states that the simplest model which accurately represents the data is the most desirable. Typically the optimal model size, or even a range of model sizes, is unknown prior to training the model. Hence, the search for the correct model size must be undertaken as an integral part of the inductive process. This is achieved in one of two ways: constrain the model structure, or constrain the model parameters. The addition of a penalty term to the cost function which is being minimised directly in uences model capacity via

Paper.tex; 3/11/1999; 11:28; p.3

4 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed a smoothness constraint. The simplest penalty term uses the squared norm of the parameter vector, to favour models with small parameter magnitudes. This technique is referred to as zeroth order regularisation, or ridge regression (Bishop, 1995). Whilst regularisation provides a way of generating parsimonious models, Rao (Rao, 1999) argues that the presence of local minima in the error surface will tend to trap gradient based learning techniques resulting in poor generalisation performance. If the desired regression function performs poorly because of this, the typical recourse is to optimise a larger model, under the assumption that the simpler model did not characterise the data. The larger model will give improved performance on the training set, but will not generalise well to new data. Clearly, techniques in which the error surface is convex are desirable. In fact, the support vector machine approach (described later) has this attractive feature. Overall, it may be seen that establishing an optimal modelling process requires careful consideration and balancing of modelling characteristics. A simple model which has good generalisation performance may not be transparent, since its model structure may not be open to interrogation. This limits the potential for knowledge extraction and validation through expert knowledge. Consequently, this work advocates transparent modelling approaches that can combine the advantages of a parsimonious model representation with good generalisation performance. The next section describes the advanced inductive techniques which are used in this paper to model two datasets are introduced.

3. Advanced Inductive Modelling Techniques This section considers the advanced inductive methods used in this paper to model both an arti cial dataset and a commercial dataset. These techniques were considered as they all embody some form of transparency. Five modelling techniques are considered: Graphical Gaussian Models, Multivariate linear regression (MLR), Bayesian multi-layer perceptron (MLP) modelling, Neurofuzzy modelling, and Support Vector Machine (SVM) modelling. The graphical Gaussian model was used to discover the conditional independence structure amongst the data variables. The remaining quantitative approaches were assessed against each other quantitatively using the mean squared error (MSE) test statistic, and qualitatively in terms of transparency. A quantitative approach to the discovery of model structure is the ANalysis Of VAriance (ANOVA) representation. It describes the de-

Paper.tex; 3/11/1999; 11:28; p.4

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets5

composition of the regression function into a series of additive components, with the objective of representing a model by a subset of terms from the expansion,

X XXf f (x) = f0 + f (x ) + N

=1

N

i

i

N

=1 = +1

i

i

j

i;j

(x ; x ) +    + f1 2  (x) (1) i

j

; ;

;N

i

where N represents the number of inputs, f0 represents the bias and the other terms represent univariate, bivariate, etc., components. These basis functions are semi-local and are similar to the approaches used by Friedman (Friedman, 1991) in the Multivariate Adaptive Regression Splines (MARS) technique. The neurofuzzy and SVM techniques considered in this paper, employ an ANOVA representation to enable model structure to be determined by providing trend information from additive sub-models as well as the more common method of input selection. In contrast the MLR and Bayesian MLP approaches only allow transparency to be incorporated via interpretation of parameter or hyperparameter values. Trend information in these techniques can be introduced by considering arti cial datasets, although this will only re ect a low dimensional slice through the model space and hence it is severely limited. 3.1. Graphical Gaussian Models The principal role of a graphical Gaussian model is to convey the conditional independence structure in a dataset via a graphical representation. The graphical Gaussian model can aid in the selection of input variables as part of data-preprocessing, retaining only those variables which are conditionally dependent upon the output, providing a powerful tool for indicating variable in uence as part of a model interpretation strategy. In this paper, the graphical Gaussian model is compared with the quantitative approaches by consideration of variable in uence and variable interactions. Let X 0 be a k-dimensional vector of random variables. A conditional independence graph, G = (V; E ) describes the association structure of X 0 by means of a graph, speci ed by the vertex set V and the edge set E (Whittaker, 1990). There is a directed edge between vertices i and j if the set E contains the ordered pair (i; j ); vertex i is a parent of vertex j , and vertex j is a child of vertex i. A graphical model is then a family of probability distributions, P , that is a Markov distribution over G where X 0 and X 0 represent the variables for which conditional independence is being tested for, given the other variables in the dataset X 0 . Following the work of Whittaker (Whittaker, 1990) (Kandola et al, G

a

b

c

Paper.tex; 3/11/1999; 11:28; p.5

6 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed 1999), a graphical Gaussian model was constructed using the deviance statistic given by,

dev(X 0 ?? X 0 jX 0 ) = ?Nln(1 ? corr2 (X 0 ; X 0jX 0 )) (2) This test statistic has an asymptotic 2 distribution with one degree b

c

a

N

b

c

a

of freedom. Elements in this deviance matrix, determine the signi cance of dependencies in the graphical model. A hypothesis test at a 95% con dence interval of the 2 distribution, is used to extract the signi cant e ects. However, it should be borne in mind that the dependencies that are depicted by the graphical model may represent true physical interactions, or they may re ect sampling artifacts. 3.2. Multivariate Linear Regression

This is the simplest of all the modelling techniques considered, and was motivated by the need to provide a benchmark against which all of the other techniques could be compared. A multivariate linear regression model (MLR) is given by:

y = w0 + w1x1 + w2x2 +    + w x N

N

(3)

where w1 ;    ; w are unknown parameters to be estimated, w0 is a bias term and y is the predicted output. The unknown vector of parameters, w, can be estimated in the least squares sense. The uncertainty () in each of these parametric values is given by, N

q

 = MSE:diag[(X X )?1] T

(4)

Low parametric uncertainty values, relative to their parameter values, are desirable since they imply more con dence in the parameter values, and hence more signi cance in the inputs. Despite the simplicity of the MLR model, normalisation of the data means that the magnitude of the parameter values can be directly interpreted to show the rst order importance of these variables in contributing to the output. However, interpretation of parameter values and associated uncertainty provides a crude form of input selection, since the MLR technique typically su ers from the problem of model mismatch. 3.3. Bayesian Multi-layer Perceptron A Bayesian MLP is an extension of the classical MLP in that network training takes place in a Bayesian learning framework (Bishop, 1995)(Neal, 1995). Bayesian learning is attractive since it allows prior

Paper.tex; 3/11/1999; 11:28; p.6

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets7

knowledge to be included in the modelling process by specifying a probability distribution over network parameters. The Bayesian evidence framework of MacKay (MacKay, 1994) was chosen because it includes principled methods for model capacity control, via zeroth order regularisation. To overcome the black-box nature of the Bayesian MLP, Automatic Relevance Determination (ARD) (MacKay, 1994)(Neal, 1995) was incorporated as a method of input selection. In an ARD model, each input variable has an associated hyperparameter that controls the magnitudes of the weights on connections to the input unit. If the hyperparameter associated with an input indicates a small standard deviation for weights out of that input, these weights are likely to be small, and that input will have little e ect on the output; if the hyperparameter speci es a large standard deviation the e ect of that input will be signi cant (Neal, 1995). Interpretation of these hyperparameter values enables an inputs in uence on the network to be assessed, providing a method for knowledge extraction. A Gaussian prior (MacKay, 1994) was chosen for the initial values of the weights (including the bias) which corresponds to the use of weight decay regularisation, that limits the potential for over tting by controlling the capacity of the model. Given a prior distribution, and an expression for the likelihood (Bishop, 1995), the posterior distribution of the weights can be found using Bayes' theorem. Bayesian learning then simpli es to nding the weight vector, which minimises the cost function,

X A(w) = 2 (y(x ; w) ? t(n))2 + 2 N

i

n

=1

i

X w2 W

=1

i

(5)

i

The variance of the priors (1= ) and the noise variance ( ) are unknown. In a strict Bayesian framework, parameters which are not de ned must be integrated out using a process referred to as marginalisation (Bishop, 1995). However, this is avoided because of its computational expensive and the approximate method of the evidence framework is used (MacKay, 1994). The cost function is minimised by using the scaled conjugate gradient (SCG) algorithm (Moller, 1993). Empirical results from this work have shown that the nal solution obtained is sensitive to the initial values of and , causing the network to converge to a local rather than global minimum. The approach suggested by Nabney (Nabney, 1999) was used in this work; relatively small values of were initially chosen since they correspond to a small weight decay, allowing greater model exibility during the early stages of training. However, MacKay (MacKay, 1999) has suggested that this problem can be overcome by using Markov Chain Monte Carlo i

Paper.tex; 3/11/1999; 11:28; p.7

8 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed (MCMC) (Geman et al, 1992)(Neal, 1995) methods, which use correct Bayesian sampling of the hyperparameters and parameters. 3.4. Additive Neurofuzzy Networks Neurofuzzy systems (Jang, Sun and Mizutani, 1997; Brown and Harris, 1994; Bossley, 1997) combine the learning ability of neural networks with a fuzzy representation, to provide an enhanced linguistic representation. Fuzzy systems (Zadeh, 1973) have been developed as a method of representing rule-based knowledge. The additive neurofuzzy networks considered here combine robust, linear optimisation algorithms with a non-linear search procedure in order to decompose the model into an additive structure. This is advantageous because it can simplify the interpretation of the network structure. The neurofuzzy model space is built from B-spline basis functions and the model is searched using the ASMOD (Additive Spline Modelling of Observational Data) algorithm (Bossley, 1997). B-spline basis functions are attractive since their order can be chosen to control the smoothness of the model and additionally, they can be interpreted as fuzzy membership functions. In the ASMOD approach a set of piecewise polynomial basis functions are de ned by a series of knots. The introduction of additional knots within the basis functions enables increasingly complex functions to be approximated, whilst an increase in the order allows potentially smoother functions to be obtained. The resulting model is a multidimensional polynomial surface which can be decomposed as a series of local, low order polynomials, which can be considered as a set of local kth order Taylor series approximation to the system. The model is constructed using a forward selection, backwards elimination algorithm that updates the model iteratively by selecting the best re nement from a set of possible re nements. These re nements can include: knot insertion, knot deletion, subnetwork deletion, as well as decreasing or increasing the order of the B-spline. At each stage in the model construction an MSE based statistical signi cance measure (Gunn, Brown and Bossley, 1997) is used to select the optimal re nement. 3.5. Support Vector Machines (SVMs) SVMs (Vapnik, 1995) have recently received an intensi ed research e ort, due to their strong theoretical foundations and promising empirical performance. The initial development of SVMs was driven by attempts to resolve the bias/variance trade-o , model complexity issues and the incidence of model over tting. The exploration and formulation of these concepts has contributed to statistical learning and approximation theory. This formulation embodies the principle of structural

Paper.tex; 3/11/1999; 11:28; p.8

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets9

risk minimisation (SRM) developed by Vapnik (Vapnik, 1995). SRM di ers from the commonly used empirical risk minimisation (ERM), by trying to minimise an upper bound on the expected risk, rather than minimising the error on the training set. The notion of the Vapnik-Chervonenkis (VC) dimension is central to SVM learning, and it provides a measure of model capacity (Burges, 1998). If the VC dimension is low, the potential to over t the data is low, enabling good generalisation performance to be obtained. In a regression framework the SVM approach is similar to zeroth order regularisation (Smola, 1998), and as such a link can be made with the Bayesian MLP. SVMs make use of reproducing kernels which are functions that provide an elegant approach to dealing with nonlinear algorithms by enabling computations to be done in the input space as opposed to the potentially high dimensional feature space. This is possible because the data only appears in the training problem in the form of dot products, x  x . The kernel function is given by K (x ; x ) = (x )  (x ); (6) where  is the projection from input to feature space. Prior knowledge can be incorporated in SVMs by careful choice of the kernel function (Scholkopf et al, 1999). In this work spline kernels are used due to their ability to approximate arbitrary functions, and their close connection with the curvature penalty function through the representer theorem (Wahba, 1990). Additionally SVMs allow a wide variety of loss functions to be employed enhancing their applicability. In this paper the loss function employed within the SVM approach is quadratic. This loss function gives a solution that is identical to a Gaussian process (Burges, 1998)(Williams, 1998); the di erence lies solely in their motivation. SVMs, like the Bayesian MLP, are essentially black box models, however transparency can be introduced by use of the SUpport vector Parsimonious ANOVA (SUPANOVA) (Gunn and Brown, 1999). The SUPANOVA technique is designed to select a parsimonious model representation by selecting a small set of terms from the complete ANOVA representation. It achieves this by decomposing the non-linear modelling problem into three stages as shown in gure 1. The rst stage is used to select a complete set of ANOVA basis functions, from which the second stage selects a subset such that accuracy is maintained. The nal stage then constructs a model with this sparse kernel. SUPANOVA di ers from other parsimonious techniques, such as the ASMOD approach, by trying to nd a complete model that is then decomposed into the signi cant terms. In contrast, ASMOD typically starts o from an empty initial model and then i

j

i

j

i

j

Paper.tex; 3/11/1999; 11:28; p.9

10

Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed Data

ANOVA Basis Selection

Sparse ANOVA Selection

Parameter Selection

Model

Figure 1. The SUPANOVA construction framework.

adds and deletes signi cant terms using a forwards selection backward elimination approach.

4. Experimental Modelling Two modelling problems were used to assess the degree of transparency of the all the techniques described above. The following section presents the results which were obtained for each of the modelling techniques considered in this paper, for both the arti cial and commercial datasets. In each trial, 90% of the data was used for training the model, and 10% was used to estimate the generalisation performance. To reduce the e ects of data partitioning on the generalisation estimate, the modelling algorithms were evaluated for 10 di erent (random) partitions of the data, and the mean values are quoted in the results section. 4.1. The Artificial Dataset To demonstrate the performance of the di erent transparent modelling approaches, an arti cial modelling problem proposed by Friedman (Friedman, 1991) was used. The model is a ten input function, which contains ve redundant inputs given by, f (x1; x2;    ; x10) = 10sin(x1x2) + 20(x3 ? 12 )2 + 10x4 + 5x5 +  (7) where  is zero mean, unit variance, additive Gaussian noise, corresponding to approximately 20% noise, and the inputs were generated independently and randomly from a uniform Gaussian distribution in the interval [0,1]. The experiments were performed using 200 examples, 180 for training and 20 for estimating the generalisation performance. 4.1.1. Graphical Gaussian Model A graphical Gaussian model provides a tool for visualising the interactions and conditional dependencies between the di erent data variables. Figure 2 shows the graphical Gaussian model for the arti cial dataset.

Paper.tex; 3/11/1999; 11:28; p.10

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 11 y

x1

x5

x4

x2

x10

x8

Figure 2. Illustration of conditional independence structure via a graphical Gaussian model in an arti cial dataset.

For the ten input arti cial dataset, a total of 45 relations are possible, of which approximately 10% of these relations are found to be signi cant. The graphical model can be used to provide a form of input selection based on conditional dependencies between the output given the input variables. This suggests that the output could be directly predicted from four input variables. 4.1.2. Multivariate Linear Regression Despite the simplicity of the MLR model, normalisation of the data means that the magnitude of the parameter values can be directly interpreted to show the rst order importance of these variables in contributing to the output. Low parametric uncertainty values, relative to their parameter values, are desirable since they imply more con dence in the parameter values, and hence more signi cance in the inputs. The parameter values and their associated uncertainty for the training dataset are given in table 1. Table I. Parameter values and associated uncertainty for a multivariate linear regression model. Bias

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

w 0.83 6.57 8.56 -0.31 10.0 5.18 -0.48 0.20 -0.96 -0.45 -0.92 

0.93 0.63 0.64

0.62

0.61 0.67

0.65

0.58

0.64

0.67

0.63

Given the parameter values and the associated uncertainty, four inputs have been selected as being necessary to determine the output. These inputs are the same as those chosen by the graphical model.

Paper.tex; 3/11/1999; 11:28; p.11

12 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed 4.1.3. Bayesian MLP A single hidden layer network with varying numbers of hidden nodes was used to model the relationship between the inputs and the output. The network weights were initially randomised from a Gaussian distribution. A linear activation function was also used on the output node. A plot of the mean MSE for both the training and the test data sets is shown in gure 3 for an increasing number of hidden nodes. The optimal network structure was determined to be six hidden nodes since this corresponds to the lowest error on the test set. 7 Train Test 6

Mean MSE

5

4

3

2

1

0

2

4

6

8

10

12

14

Number of Hidden Nodes

Figure 3. Variation of mean training and test MSE for a Bayesian MLP trained with varying numbers of hidden nodes.

The incorporation of ARD allows the in uence of each input on the output to be assessed. Table 2 shows the mean ARD hyperparameter values (and associated standard deviation over the ten datasets) indicating the in uence of each variable on the output for the optimal model structure. Table 3 shows the ranked selection of each input variable for each of the ten models trained. Table II. Mean ARD Hyperparameter values for six hidden nodes. x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

1= 0.482 0.423 0.246 0.003 0.001 0.0001 0.00 0.066 0.001 0.001 std

0.12

0.17

0.06

0.01

0.00

0.001

0.00

0.07

0.01

0.005

Paper.tex; 3/11/1999; 11:28; p.12

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 13

Table III. Ranked importance of the input variables. Input variables Dataset

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

1

2nd

1st

3rd

-

-

5th

-

-

4th

-

2

2nd

1st

3rd

7th

-

6th

-

4th 5th

7th

3

1st

3rd

2nd 7th

-

-

-

5th 4th

6th

4

2nd

1st

3rd

5th

-

-

-

4th

-

-

5

2nd

1st

4th

5th 7th 6th

-

-

3rd

-

6

1st

2nd

3rd

4th 6th

-

-

-

-

5th

7

1st

2nd

3rd

4th

-

-

-

-

-

8

1st

2nd

3rd

4th 7th

-

-

5th 6th

-

9

2nd

1st

3rd

4th 6th

-

-

5th

-

-

10

1st

2nd

3rd

5th 6th 4th 7th

-

-

-

-

4.1.4. Neurofuzzy Model The neurofuzzy model identi ed all of the ve inputs necessary to determine the proof stress. The global transparency introduced by the use of sub-models in the neurofuzzy algorithm is illustrated in gure 4. The neurofuzzy model di ers from the MLR and Bayesian MLP in that interpretation is achieved directly through the resulting model structure. In contrast, the MLR and Bayesian MLP are interpreted through their parameter values alone. 4.1.5. Support Vector Machines Of the possible 1024 di erent terms in the full ANOVA expansion, only 6 terms were chosen as being signi cant. Five univariate terms corresponding to the rst ve data inputs were selected, together with a bivariate term combining the rst two data inputs. These six terms de ne the entire model structure. The regression surfaces for these terms are shown in gure 5, and table 4 shows the stability of these terms across the ten di erent data partitions.

Paper.tex; 3/11/1999; 11:28; p.13

14

Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed x4

x1

x2

Σ

output

x5

x3

(i) 40 50

30 20

0

10 −50 1

0

1

0.5 −10

0

0.2

0.4 0.6 Input4

0.8

1

Input2

25

25

20

20

15

15

10

10

5

5

0

0

0.2

0.4 0.6 Input5

0.8

0

1

(ii)

0

0.2

0.5 0

0

0.4 0.6 Input3

Input1

0.8

1

Figure 4. Submodel structure identi ed by a Neurofuzzy model.

In summary, table 5 shows the MSE values obtained for each of the modelling techniques used to model this dataset. 4.2. The Commercial Dataset A commercial processing-properties dataset for DC cast aluminium plate is considered, concentrating on prediction of the mechanical property 0.2% proof stress. This dataset is illustrative of the problems and challenges that arise in real world modelling; high input dimensionality, sparsely distributed data, and highly correlated inputs. The raw dataset consisted of 35 input variables and 2870 data pairs covering alloy com-

Paper.tex; 3/11/1999; 11:28; p.14

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 15

Table IV. Input Selection via ANOVA decomposition in a Support Vector Machine. Dataset Components

1

2

3

4

5

6

7

8

9

10

x1

     

     

  ?   

  ?   

  ?   

     

     ?

     

     

     

x2 x3 x4 x5

N x x 1

2

Table V. MSE values for each each of the modelling techniques used on the arti cial dataset. Linear Neurofuzzy Bayesian MLP SVM MSE Train

4.20

1.11

0.69

0.49

MSE Test

4.45

1.49

0.73

0.72

positional and thermomechanical processing information. Given the complex nature of the processing problem, for a physically interpretable model to be constructed a smaller subset of inputs were selected based on prior knowledge and the data distribution. The number of inputs was initially reduced by selecting a data subset for a single tensile testing direction (LT) at one position in the nal plate, which accounted for the majority of the dataset. All of the major alloying elements and the major impurities were retained as inputs to the model. This data pre-processing resulted in a reduced size dataset. Ten input variables remained; the nal gauge (FG), Cu, Fe, Mg, Mn, Si (all in weight percent), cast slab length (SL), solution treatment time (STT), percentage stretch (%st.), and reduction-ratio (RR). After removing the data entries with missing and repeated values a total of 290 data points remained.

Paper.tex; 3/11/1999; 11:28; p.15

16 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed 4.2.1. Graphical Gaussian Model Figure 6 illustrates the graphical model obtained for the commercial dataset. This suggests that proof stress could be directly predicted from FG, STT and %st. since these are the only variables with which proof stress has direct links. This implies that the composition elements (within the available data ranges) have no direct e ect on proof stress. Within this dataset the compositional inputs are also correlated with FG, STT and %st., so the direct e ects of these variables may mask other relationships. 4.2.2. Multivariate Linear Regression The normalised inputs allow the parameter values to be assessed in terms of their overall e ect on the output. Table 6 shows the parameter and associated uncertainty terms for the commercial dataset. Table VI. Parameter values and associated uncertainty for a multivariate linear regression model. Bias

FG

Cu

Fe

Mg

Mn

Si

SL

w 360.0 -18.1 5.30 4.50 -0.58 4.86 -8.01 -0.41 

4.85

3.45

5.05 3.39

6.12

3.80

3.64

3.37

STT %st.

RR

24.4

21.5

-2.1

8.20

3.98 5.68

The bias term, the nal gauge, solution treatment time and percentage stretch show the largest parameter values with lowest uncertainty. Proof stress is predicted to have a proportional dependence on STT and %st. and an inverse proportional dependence on FG. Selection of these variables is consistent with the graphical model dependencies. The mean MSE obtained for the linear model was 94.1 for training and 145.0 for testing. 4.2.3. Bayesian MLP The Bayesian MLP network was trained in the same manner as it was for the arti cial dataset. Figure 7 shows the variation of training and test set errors for increasing numbers of hidden nodes. The optimal network structure was determined to have seven hidden nodes since this corresponds to the lowest error on the test set. Table 7 shows the mean ARD hyperparameter values (and associated standard deviation over the ten datasets) indicating the in uence of each variable on the output for the optimal model structure, and table 8 shows the ranked selection of each input variable for each of the

Paper.tex; 3/11/1999; 11:28; p.16

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 17

ten models trained. From the values quoted nal gauge (FG), silicon (Si), percentage stretch (%st.) and slab length (SL) exhibit the largest values. Three of these four inputs are consistent with those inputs selected by the graphical Gaussian model and the MLR. The mean MSE obtained for the training data was 57.6, whilst that for the test data was 90.5 representing an e ective standard deviation of  9.5MPa. Table VII. Mean ARD Hyperparameter values for seven hidden nodes. FG

Cu

Fe

Mg

Mn

Si

SL

STT

%st.

RR

1= 0.134 0.046 0.075 0.164 0.05 0.485 0.281 0.066 0.184 0.09 std

0.08

0.05

0.03

0.24

0.04

0.15

0.18

0.03

0.07

0.04

Table VIII. Ranked importance of the input variables. Input variables Dataset FG Cu

Fe

Mg Mn

Si

SL

STT %st. RR

1

4th

-

-

-

-

1st

2nd

-

3rd

-

2

2nd

-

-

-

-

1st

4th

5th

3rd

6th

3

5th

-

-

3rd

6th

1st

2nd

-

4th

4

-

-

-

1

th

4

2

rd

3

-

5

5

-

-

3rd

-

-

1st

-

-

2nd

4th

6

-

-

-

-

-

1st

2nd

-

3rd

4th

7

5th

-

-

4th

-

2nd

1st

-

3rd

-

8

3rd

-

-

-

-

2nd

1st

-

-

-

9

5th

-

-

4th

-

1st

2nd

-

3rd

-

10

2nd 5th 7th

-

-

1st

4th

-

3rd

6th

st

nd

th

6

th

Paper.tex; 3/11/1999; 11:28; p.17

18 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed 4.2.4. Neurofuzzy Model The neurofuzzy model identi ed only three inputs as being signi cant in determining the proof stress. FG, identi ed as important by the graphical model and the multiple linear regression model. All three inputs were selected as being important by the Bayesian MLP. The global transparency introduced by the use of sub-models in the neurofuzzy algorithm is illustrated in Figure 8. Figure 8(i) shows the model structure obtained in terms of basis functions de ned on a series of knots, whilst gure 8(ii) shows trend information for the inputs which neurofuzzy selected as being important in determining proof stress. Two of these three inputs are consistent with those selected by the graphical model and MLR, whilst all of these inputs were chosen by the Bayesian MLP. Figure 8(ii) shows that neurofuzzy predicts that an increase in the nal gauge will result in a decrease in the proof stress, and an increase in the percentage stretch will result in a proportional increase in the proof stress. Similar dependencies were predicted by the linear regression. The mean error for the training set was 130.6 whilst the generalisation error was 132.6 which corresponds to an e ective standard deviation of 11.5MPa on the proof stress. 4.2.5. Support Vector Machines The SVM based SUPANOVA technique was applied to the ten input materials dataset, of the possible 1024 di erent terms in the full ANOVA expansion, only 12 terms were chosen as being signi cant. The full selection of terms is given in table 4. The univariate terms selected were the bias, Mg, Si, STT, %st., the bivariate terms were FG Mg, FG RR, Cu Si, Fe Si, Mn SL, Si RR, and the trivariates terms FG Cu Si and Fe Si %.st. Examples of these are illustrated in gure 9. Table 9 shows the stability of these terms across the ten di erent data partitions. These regression surfaces represent interaction terms; to see the overall e ect of an input all the interaction terms associated with that input must be considered. Figure 9(i) is an example of a univariate interaction term. This allows visualisation of the global contribution of percentage stretch on proof stress, as an independent e ect, as it does not appear in any other terms. Interpretation of the bivariate ( gure 9(ii)) and trivariate ( gure 9(iii)) can be less straightforward. By looking at all 12 terms of the type shown in gure 9 the entire structure of the model is de ned. There is a degree of complexity associated with the interpretation of any one variable, as they may appear in several terms, e.g. nal gauge in gure 9(ii) and 9(iii). The total contribution of nal gauge towards the model

N N N N

N N NN

N

N

Paper.tex; 3/11/1999; 11:28; p.18

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 19

Table IX. Input Selection via ANOVA decomposition in a Support Vector Machine. Dataset Components

1

2

3

4

5

6

7

8

9

10

? Mg  Si  STT  %st.  N FG Mg  N FG %st. ? N FG RR  N Cu Si  N Fe Si  N Mn SL  N Si RR  N N FG Cu Si  N N FG Mg %st. ? N N Cu Mg %st. ? NN Fe Si %st.  N N Fe Si RR ? N N Fe SL RR ? N N N Fe Si SL RR ?

 ?            ? ?  ?  ?

 ?   ?      ? ?  ? ? ?  ? ?

?             ? ?  ? ? ?

?             ? ?   ? ?

?    ?  ?    ? ?  ? ? ?  ? ?

?             ? ?  ? ? 

?             ? ?  ? ? ?

?                ?  ?

 ?        ? ? ?  ? ?   ? ?

Cu

output is obtained by considering all the terms in which it occurs. The mean MSE for the training set was 61.4 whilst the generalisation MSE was 80.8 (giving a standard deviation in error values of 8.9 MPa). In summary, table 10 shows the MSE values for each of the quantitative modelling techniques used to model the commercial dataset.

Paper.tex; 3/11/1999; 11:28; p.19

20

Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed

Table X. MSE values for each each of the modelling techniques used on the commercial dataset. Linear

Bayesian MLP

Neurofuzzy

SVM

MSE Train

94.1( 9.7)

57.6( 8)

130.6( 11.4) 61.4( 7.8)

MSE Test

145.0( 12)

90.5( 9.2)

132.6( 11.5)

80.8( 9)

5. Discussion A range of modelling techniques have been applied to two empirical modelling problems. Figures 11 and 12 summarise the performance of each of these quantitative techniques by showing the target versus prediction plots using the arti cial and commercial dataset respectively. Graphical Gaussian modelling was used to reveal the conditional independence structure between the data variables in both datasets. The structure obtained for the arti cial dataset ( gure 2) showed the output to be conditionally dependent upon only four of the input variables, and no signi cant amount of dependency on the irrelevant inputs. It is interesting to note that x3 was not selected as being conditionally dependent upon the output. This could be attributed to the the deviance statistic being inaccurate since it depends upon a linear correlation term which is unable to represent a quadratic term accurately. However, for the commercial dataset the output was shown to be conditionally dependent only upon three inputs, although there was a signi cant amount of dependence between the di erent input variables themselves. This would seem to make the separation of a particular inputs in uence on the output dicult. As such the result will be sensitive to the way a modelling technique handles these dependencies. This type of behaviour is typical of many manufacturing datasets. The MLR model exhibits the worst generalisation performance on both problems, and this is depicted in both target versus prediction plots showing that the linear model has not been able to capture any of the complexity associated with the underlying process. This can be attributed to the problem of model mismatch, since the MLR expression is not suciently expressive to capture the underlying function. However, for the arti cial dataset the MLR model is capable of selecting the four `linear' trends of the ve relevant inputs which contribute to the output. In the case of the commercial dataset it selects three inputs

Paper.tex; 3/11/1999; 11:28; p.20

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 21

which are in keeping with those selected by the advanced nonlinear techniques. It would be expected that the additive nature of neurofuzzy modelling technique would be suited to modelling the arti cial learning problem which is itself additive. The neurofuzzy modelling algorithm achieved good levels of performance on this dataset, being able to select all of the relevant inputs and reject the spurious ones. This is illustrated by looking at the target versus prediction plot for the neurofuzzy model. However, despite being more advanced than the MLR model they both gave similar levels of performance on the commercial dataset. The target versus prediction plot for this model shows that the neurofuzzy model is capable of modelling the central range of the data accurately (which is where the majority of the data lies), however it is not able to model the extremes of the data distribution. This can be attributed to neurofuzzy modelling being considered as a reductionist approach to data modelling, resulting in parsimonious models. The heuristic search procedure used to nd these parsimonious models can produce local minima trapping the model solution and therefore resulting in suboptimal model structures. This can result in fewer inputs than are required being included in the nal model structure. The Bayesian MLP model showed improved predictive performance over both the MLR and neurofuzzy modelling for both datasets. The target versus prediction plots for both datasets show that the Bayesian MLP is capable of modelling the full range of the data. Incorporation of ARD allows input/output variable relevance to be assessed. The incorporation of weight decay regularisation acts to constrain model structure resulting in parsimonious models. In the case of the arti cial dataset the network gives some in uence to those inputs which are irrelevant in determining the output. This is illustrated by looking at table 3, ARD for some data partitions gives in uence to inputs x8 , x9 and x10 which are simply noise. For the commercial dataset, ARD suggests that three inputs are important in determining the output and these are consistent across all of the ten dataset partitions. Empirical results from this work have shown that the Bayesian MLP is sensitive to the initial values selected for the variance of the prior 1= and the estimate of the noise variance , resulting in convergence to a local rather than global minimum. Recent work by Penny and Roberts (Penny and Roberts, 1999), has shown that the evidence framework of MacKay (MacKay, 1994) to be limited. The Gaussian approximation to the posterior distribution is central to MacKay's Bayesian evidence framework (MacKay, 1994)(Penny and Roberts, 1999). Walker (Walker, 1969) has shown that in the limit of an in nite training set the posterior distribution does become Gaussian. However, with a nite number of

Paper.tex; 3/11/1999; 11:28; p.21

22 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed samples the assumption fails, hence the evidence cannot be computed accurately. These limitations can be attributed to not evaluating the Hessian matrix explicitly, which is avoided because of its computational expense. Algorithms such as SCG (which was used to minimise the Bayesian costfunction) do not evaluate the Hessian matrix explicitly, but instead compute it using a nite di erence approximation to the dot product of the Hessian and the search direction. However, when computing the ARD hyperparameters the evidence framework is constructed using only the positive eigenvalues. This is because at a local minimum, some eigenvalues of the Hessian may be negative implying that the posterior variance in some directions of weight space is negative (Penny and Roberts, 1999). This situation is handled by ignoring these problematic directions and just considering the distribution in the non-negative directions. In networks which have redundant weights it will be possible that the Hessian matrix is singular or nearly singular. In these cases, eigenvalues will be of machine precision order, hence the calculation of the determinant (which uses products of eigenvalues) will be unreliable. The support vector machine approach showed the best performance on both the arti cial and commercial datasets. The target versus prediction plots show that the SVM approach is able to generalise well to regions where a large amount of data does not exist. The promising performance of this technique can be attributed to a number of attractive features. The use of a quadratic loss function results in a convex learning surface. This has the advantage of having only a single global minima which removes the problems associated with convergence to a local minima. The choice of kernel function allows expert knowledge to impose constraints on the desired characteristics of the resulting solution. This can typically include constraints on the smoothness of the solution. The SUPANOVA technique provides a complete ANOVA expansion which can de ne the entire model structure. Selection of input variables via the transparent SUPANOVA approach for the arti cial dataset and commercial dataset are illustrated in table 4 and table 5 respectively. For the arti cial dataset the relevant inputs are consistently chosen across the di erent data partitions with a high degree of accuracy. Given the nature of the commercial processing problem, determination of relevant inputs is considerably more complex. Table 9 shows that a selection of univariate and bivariate terms are consistently chosen across the ten data partitions, however for certain data partitions trivariate terms are selected with high uncertainty. Incorporation of these terms requires further validation. The MLR and Bayesian MLP approaches whilst giving di erent levels of performance to each other on the datasets, both incorporate

Paper.tex; 3/11/1999; 11:28; p.22

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 23

transparency by allowing model structure to be interpreted via a single numerical parameter or hyperparameter value. In addition to providing a form of input selection, the neurofuzzy and SUPANOVA techniques di er by providing additional transparency through the visualisation of low order additive input/output sub-models. The Bayesian MLP approach can also be used to depict trend information. This can be done by the generation of arti cial datasets, where an input is allowed to vary between its minimum and maximum values but with the other inputs set to their mean values. These datasets are limited because they only represent a particular slice through the input space and hence the trend depicted will only represent highly local behaviour. In contrast, the ANOVA approach in the neurofuzzy and SUPANOVA techniques is a global depiction of the trends, and as such is more valuable. The predicted trends for the relevant inputs in the arti cial dataset and for a selection of the terms for the commercial dataset are given in gure 5 and gure 9 respectively. For the commercial dataset the presence of a distinct silicon (Si) e ect in all of the quantitative modelling techniques considered is consistent with the known behaviour of this type of metal (Wilson et al, 1967). Although it is interesting to note that in the MLR model the parametric uncertainty is approximately half the parameter value. The predicted trends for the univariate input terms x3, x4, and x5, and the single bivariate term are consistent with the function although the trends for the univariate terms x1 and x2 require further validation.

6. Conclusions This paper described the modelling of both an arti cial and commercial dataset, using a range of advanced adaptive numeric methods. In the context of this example the key empirical modelling themes of model validation, transparency and generalisation have been illustrated. The models produced have allowed a greater understanding of the nature of the data and the underlying physical process, and in the context of these examples, the limitations of empirical modelling have also been described. For accurate data driven models to be produced, the need for good quality and uncorrelated data is paramount. Reliance on a single empirical modelling approach will not gain the maximum amount of information from the data. A degree of transparency in the modelling process allows validation of the model against physical understanding in addition to assessment of the generalisation error alone. The Bayesian MLP and support vector machine approaches have given similar levels of modelling performance on both the datasets used in this study.

Paper.tex; 3/11/1999; 11:28; p.23

24 Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed However, the two techniques di er in their degrees of model transparency. The SUPANOVA approach has been shown to combine model transparency with good generalisation performance compared to the other approaches, showing that model transparency and generalisation performance need not be mutually exclusive.

7. Acknowledgments The authors would like to thank I. Nabney, and DJC. MacKay for helpful comments on ARD and the evidence framework. This work is supported by a grant from the EPSRC (Engineering and Physical Sciences Research Council) and British Aluminium plate.

References V. Cherkassky and F. Mulier. Learning from Data : Concepts, Theory and Methods. John Wiley Publishers, 1998. P. M. Williams. Using neural networks to model conditional multivariate densities. Neural Computation, vol.8, pp.843-854, 1996. L. Ljung System identi cation: theory for the user. Prentice-Hall Publishers, 1987. C. M. Bishop and M. E. Tipping. A hierarchical latent variable model for data visualisation. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.20, no. 3, 1998. J. H. Friedman. Multivariate Adaptive Regression Splines. The Annals of Statistics, vol. 19, no. 1, pp1-141, 1991. J. Whittaker. Graphical Gaussian Models in Applied Multivariate Statistics. Wiley Publishers, 1990. L. A. Zadeh. Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. on System Man and Cybernetics, vol. 3, no. 1, pp28-44, 1973. R. E. Bellman. Adaptive Control Process. Princeton University Press, 1961. S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, vol. 4, no. 1, pp1-58, 1992. C. M. Bishop. Neural networks for Pattern Recognition. Oxford University Press, 1995. J. S. Kandola, S. R. Gunn, I. Sinclair and P. A. S. Reed, Data Driven Knowledge Extraction for Materials Property Prediction. IEEE Intelligent Processing and Manufacturing of Materials, USA, 1999. R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag Publishers, 1995. D.J. C. MacKay. Bayesian non-linear modelling for the prediction competition. ASHRAE Transactions: Symposia, OR-94-17-1, 1994. M. Moller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, vol. 6, no. 1, pp.525-533, 1993. J.S. R. Jang, C. T. Sun and E. Mizutani, Neurofuzzy and soft computing. Prentice-Hall Publishers, 1997.

Paper.tex; 3/11/1999; 11:28; p.24

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 25

M. Brown and C. J. Harris. Neurofuzzy Adaptive Modelling and Control. PrenticeHall Publishers, 1994. K. M. Bossley. Neurofuzzy Modelling for System Identi cation. Ph.D thesis, University of Southampton, 1997. V. Vapnik. The nature of Statistical Learning Theory. Springer-Verlag Publishers, 1995. C. J. Burges A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, vol.2, 1998. A. J. Smola. Learning with Kernels. Ph.D thesis, GMD First, Available from http://www.gmd. rst.de, 1998. G. Wahba. Spline Models for Observational Data. Series in Applied Mathematics (SIAM), vol. 59, 1990. C.K. I. Williams. Prediction with Gaussian Processes: From Linear Regression to linear prediction and beyond. Learning and Inference in Graphical Models, MIT Press, 1998. S. R. Gunn and M. Brown. SUPANOVA - A Sparse Transparent Modelling Approach. IEEE Neural Networks for Signal Processing, Wisconsin, USA, Aug. 1999. W. D. Penny and S. J. Roberts. Bayesian neural networks for classi cation: how useful is the evidence framework?. Available from http://www.ee.ic.ac.uk/hp/sta /sjrob/pubs.html, 1999. I. T. Nabney. Personal communication. Neural Computing Research Group, School of Engineering and Applied Science, Aston University, UK, March 1999. D.J. C. MacKay. Personal communication. Department of Physics, Cavendish Laboratory, Cambridge University, UK, March 1999. A. M. Walker. On the asymptotic behaviour of posterior distributions. Journal of the Royal Statistical Society, B, vol. 31(1), pp80-88, 1969. D. R. Hush and B. Horne. Progress in supervised neural networks. IEEE Signal Processing Mag., pp.8-39, Jan. 1993. N.W. Townsend and L. Tarassenko. Estimations of Error Bounds for Neural Network Function Approximators. IEEE Trans. Neural Networks., pp.217-230, vol. 10, March 1999. A.V. Rao, D.J. Miller, K. Rose, and A. Gersho. A Deterministic Annealing Approach for Parsimonious Design of Piecewise Regression Models . IEEE Trans. PAMI., Feb. 1999. S.R. Gunn, M. Brown, and K. M. Bossley. Network Performance Assessment for Neurofuzzy Data Modelling. Advances in Intelligent Data Analysis : Reasoning about Data, Springer Verlag Publishers, pp313-324, 1997. B. Scholkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. Muller, G. Rtsch, and A.J. Smola. Input Space Versus Feature Space in Kernel-Based Methods. IEEE Trans. Neural Networks, vol10. pp1000-1018, Sep. 1999. R. N. Wilson, D. M. Moore and P.J.E. Forsyth. E ects of 0.2% Silicon on Precipitation Processes in an Aluminium-2.5% Copper-1.2% Magnesium alloy. J. Inst. Metals, vol95. pp177-183, 1967.

Paper.tex; 3/11/1999; 11:28; p.25

26

Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed

0.7 1.6

0.6 1.4

1.2

0.4

1

0.3

0.8

y

y

0.5

0.6

0.2

0.4

0.1

0.2

0 0

−0.1

0.1

0.2

0.3

0.4

0.5 x1

0.6

0.7

0.8

0.1

0.2

0.3

0.4

0.9

0.5 x2

0.6

0.7

0.8

0.9

(ii) X2 univariate term.

(i) X1 univariate term.. 0

12

−0.5 10 −1

−1.5

8

6

y

y

−2

−2.5

−3

4

−3.5 2 −4

−4.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0

0.9

0.1

0.2

0.3

3

0.4

0.5 x

0.6

0.7

0.8

0.9

4

(iii) X3 univariate term.(iv) X4 univariate interaction term. 6

5

10

8 4

y

y

6 3

4

2 2 0 0.8

1

0.6

0.8 0.6

0.4 0.2 0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

5

(v) X5 univariate term.

x

2

0.4 0.2 x

1

(v) Bivariate term.

Figure 5. The univariate and bivariate interaction terms obtained from SUPANOVA for the arti cial dataset.

Paper.tex; 3/11/1999; 11:28; p.26

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 27

PS

FG

%st. STT

RR Mg Cu

Fe

Mn Si

SL

Figure 6. Illustration of conditional independence structure via a graphical Gaussian model.

150

140

130

Mean Squared Error (MSE)

120

110

100

90

80

70

60

50

0

5

10

15

Number of Hidden Nodes

Figure 7. Variation of mean training and test MSE for a Bayesian MLP trained with varying numbers of hidden nodes.

Paper.tex; 3/11/1999; 11:28; p.27

28

Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed

Si

Σ

%Stretch

Proof Stress

Final Gauge

(i)

PS

Si

%Stretch

(ii)

Final Gauge

Figure 8. Submodel structure identi ed by a Neurofuzzy model.

Paper.tex; 3/11/1999; 11:28; p.28

Proof Stress

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 29

% Stretch

Proof Stress

(i) Univariate interaction term.

Reduction Ratio

Final Gauge

(ii) Bivariate interaction term.

Cu

Cu

Si

Cu

Si

Cu

Si

Cu

Proof Stress

Si

Si

Proof Stress

Cu

Proof Stress

Si

Proof Stress

Proof Stress

Proof Stress

Final Gauge

(iii) Trivariate interaction term. Figure 9. Examples of univariate, bivariate and trivariate interaction terms obtained from SUPANOVA.

Paper.tex; 3/11/1999; 11:28; p.29

30

Jaz S. Kandola, Steve R. Gunn, Ian Sinclair, Philippa Reed 30 Train Test

25

Prediction

20

15

10

5

0

0

5

10

15 Target

20

25

30

(i) MLR 30

25

Prediction

20

15

10

Train Test

5

0

0

5

10

15 Target

20

15

20

25

30

25

30

25

30

(ii) Neurofuzzy 30

25

Prediction

20

15

10

5

0 0

5

10

Target

(iii) Bayesian MLP 30

25

Prediction

20

15

10

5

0 0

5

10

15

20

Target

(iv) SVM Figure 10. Target versus Prediction plots for each of the advanced inductive techniques used on the arti cial dataset. Paper.tex; 3/11/1999; 11:28; p.30

On the Use of AdvancedInductiveTechniques for Knowledge Extractionfrom Complex Datasets 31 410 400 390

Prediction

380 370 360 350

Train Test

340 330 330

340

350

360

370 Target

380

390

400

410

(i) MLR 410 400 390

Prediction

380 370 360 350

Train Test

340 330 330

340

350

360

370 Target

380

390

400

410

(ii) Neurofuzzy 410 400 390

Prediction

380 370 360 350 Train Test 340 330 330

340

350

360

370 Target

380

390

400

410

(iii) Bayesian MLP 410 400

Prediction

390 380 370 360 350

Train Test

340 330 330

340

350

360

370 Target

380

390

400

410

(iv) SVM Figure 11. Target versus Prediction plots for each of the advanced inductive techniques used on a commercial dataset. Paper.tex; 3/11/1999; 11:28; p.31

Paper.tex; 3/11/1999; 11:28; p.32

Suggest Documents