Document not found! Please try again

Automatic composition and optimisation of multicomponent predictive ...

5 downloads 93503 Views 770KB Size Report
Dec 28, 2016 - Official Full-Text Paper (PDF): Automatic composition and optimisation of ... a one-click solution for automating algorithm selection and ...... [2] G. S. Linoff and M. J. A. Berry, Data mining techniques: for marketing,. sales, and ...
1

Automatic composition and optimisation of multicomponent predictive systems

arXiv:1612.08789v1 [cs.LG] 28 Dec 2016

Manuel Martin Salvador, Marcin Budka, and Bogdan Gabrys

Abstract—Composition and parametrisation of multicomponent predictive systems (MCPSs) consisting of chains of data transformation steps is a challenging task. This paper is concerned with theoretical considerations and extensive experimental analysis for automating the task of building such predictive systems. In the theoretical part of the paper, we first propose to adopt the Well-handled and Acyclic Workflow (WA-WF) Petri net as a formal representation of MCPSs. We then define the optimisation problem in which the search space consists of suitably parametrised directed acyclic graphs (i.e. WA-WFs) forming the sought MCPS solutions. In the experimental analysis we focus on examining the impact of considerably extending the search space resulting from incorporating multiple sequential data cleaning and preprocessing steps in the process of composing optimised MCPSs, and the quality of the solutions found. In a range of extensive experiments three different optimisation strategies are used to automatically compose MCPSs for 21 publicly available datasets and 7 datasets from real chemical processes. The diversity of the composed MCPSs found is an indication that fully and automatically exploiting different combinations of data cleaning and preprocessing techniques is possible and highly beneficial for different predictive models. Our findings can have a major impact on development of high quality predictive models as well as their maintenance and scalability aspects needed in modern applications and deployment scenarios. Index Terms—Automatic predictive model building and parametrisation; Multicomponent predictive systems; KDD process; CASH problem; Bayesian optimisation; Data preprocessing; Predictive modelling; Petri nets

I. I NTRODUCTION

P

ERFORMANCE of data-driven predictive models heavily relies on the quality and quantity of data used to build them. In real applications, even if data is abundant, it is also often imperfect and considerable effort needs to be invested into a labour-intensive task of cleaning and preprocessing such data in preparation for subsequent modelling. Some authors claim that these tasks can account for as much as 60-80% of the total time spent on developing a predictive model [1], [2]. Therefore, approaches and practical techniques allowing to reduce this effort by at least partially automating some of the data preparation steps, can potentially transform the way in which predictive models are built. In many scenarios one needs to sequentially apply multiple preprocessing methods to the data (e.g. outlier detection → missing value imputation → dimensionality reduction), effectively forming a preprocessing chain. Composition of such chain, which apart from choosing the components to use and arranging them in a particular order, also includes Bournemouth University. Data Science Institute. Poole, United Kingdom.

setting their hyperparameters, is a challenging problem that has been studied in different fields and contexts (e.g. clinical data [3], [4], historical documents [5] or process industry [6]). Despite these works attempting to be as abstract as possible, the frameworks they propose are optimised for particular case studies. Hence their underlying approaches are lacking in the context of cross-domain generalisation. After the data has been preprocessed in an appropriate way, the next step in a data mining process is modelling. Similarly to preprocessing, this step can also be very labour-intensive, requiring evaluation of multiple alternative models. Hence automatic model selection has been attempted in different ways, for example using active testing [7], meta-learning [8] or information theory [9]. A common theme in the literature is comparison of different models using data always preprocessed in the same way. However, some models may perform better if they are built using data specifically preprocessed with a particular model type in mind. In addition, hyperparameters play an important role in most of the models and setting them manually is time-consuming mainly for two reasons: (1) there are typically multiple hyperparameters which can take many values (with an extreme case being continuous hyperparameters), and (2) they are validated using crossvalidation (CV). Hyperparameter optimisation is an additional step which is usually performed after the model has been selected (see e.g. [10]–[12]). However, the problem is in fact very similar to model selection since different parametrisations of the same model (e.g. different kernels in an SVM, different structures of a neural network) can in fact be treated as different models. Therefore it makes sense to approach both model and hyperparameter selection problems jointly. The Combined Algorithm Selection and Hyperparameter optimisation (CASH) problem presented in [13] defines a search space in which both these tasks are merged. A major limitation of that study was the use of only one preprocessing step (i.e. feature selection) whose optimisation was decoupled from model building. That is, both feature selection and predictor were individually optimised and therefore selected features were not based on their predictive power in conjunction with the learning algorithm used in the subsequent step, but on some other criteria (e.g. correlation with the target variable). In our approach we have included the feature selection and additional preprocessing components as part of the whole CASH problem. In this paper we extend the approach presented in [13] to support joint optimisation of predictive models (classifiers and regressors) and preprocessing chains. We refer to a sequence of preprocessing steps followed by

2

a machine learning model as a MultiComponent Predictive System (MCPS). The motivation for automating composition of MCPS is twofold. In the first instance it will help to reduce the amount of time spent on the most labour-intensive activities related to predictive modelling, and therefore allow to dedicate human expertise to other tasks. The second motivation is to achieve better results than a human expert could, given a limited amount of time. The number of possible methods and hyperparameter combinations increases exponentially with the number of components in an MCPS and in majority of cases it is not computationally feasible to evaluate all of them. In our previous work [14] we presented the first approach for hyperparameter optimisation of WEKA1 classifiers that modify their inner data preprocessing behaviour by recursively expanding the search space constructed by Auto-WEKA (a tool for solving the CASH problem defined by WEKA algorithms and hyperparameters). This paper extends our previous work by: a) formally defining an MCPS as a Petri net and its theoretical considerations for the CASH problem; b) significantly enlarging the search space by allowing and testing the composition of MCPSs with an arbitrary number of WEKA filters; c) a thorough analysis and discussion of discovered solutions; d) an analysis of the trade-off between exploration and exploitation in the investigated search algorithms and strategies; and e) performing further experiments with 7 new datasets from process industry where automation of the whole chain of preprocessing, data cleaning, model selection and parametric optimisation is critical for predictive performance in a production environment. The paper is organised as follows. The next section reviews previous work in automating the CASH problem and highlights the available software. Section III formally describes multicomponent predictive systems. Section IV extends CASH problem to MCPSs and describes the challenges related to automation of their composition. In Section IV-A, our contributions to Auto-WEKA software, now allowing to create and optimise arbitrary chains of preprocessing steps followed by a predictive model, are presented. The methodology used to automate MCPS composition is discussed in Section V followed by the results in Section VI, and conclusions in Section VII. II. R ELATED WORK The Combined Algorithm Selection and Hyperparameter configuration (CASH) problem [13] consists of finding the best combination of learning algorithm A∗ and hyperparameters λ∗ that optimise an objective function L (e.g. Eq. 1 minimises the k-fold CV error) for a given dataset D. Formally, CASH problem is given by: A∗λ∗ =

arg min A(j) ∈A,λ∈Λ(j)

k 1X (j) (i) (i) L(Aλ , Dtrain , Dvalid ) k i=1

(1)

where A = {A(1) , . . . , A(k) } is a set of algorithms with associated hyperparameter spaces Λ(1) , . . . , Λ(k) . The loss 1 A popular open-source data mining package developed at the University of Waikato http://www.cs.waikato.ac.nz/ml/weka/

TABLE I AVAILABLE SOFTWARE TOOLS SUPPORTING SMBO SMAC [16] (Java) http://bit.ly/1KwlDjE Library for SMAC and ROAR methods Hyperopt [17] (Python) http://bit.ly/1V0iBuM Library including random search and TPE methods HPOLib [19] (Python) http://bit.ly/1YjVbjf Wrapper for SMAC, TPE and Spearmint methods Auto-WEKA [13] (Java) http://bit.ly/1KqxLhD This tool uses both SMAC and Hyperopt to automatically select and optimise 39 classifiers and 24 possible feature selection methods included in WEKA Auto-Sklearn [20] (Python) http://bit.ly/1gxpCjC This tool uses HPOLib and meta-learning to automate the algorithm selection of scikit-learn (a machine learning library for Python)

function L takes as arguments an algorithm configuration Aλ (i.e. an instance of a learning algorithm and hyperparameters), (i) (i) (i) a training set Dtrain and a validation set Dvalid = D \ Dtrain . The CASH problem can be approached in different ways. One example is a grid search, i.e. an exhaustive search over all the possible combinations of discretized parameters. Such technique can however be computationally prohibitive in large search spaces or with big datasets. Instead, a simpler mechanism like random search, where the search space is randomly explored in a limited amount of time, has been shown to be more effective in high-dimensional spaces [12]. A promising approach gaining popularity in the recent years is based on a Bayesian optimization framework [15]. In particular, Sequential Model-Based Optimization (SMBO) [16] is a framework that incrementally builds a regression model using instances formed from algorithm configuration λ and its predictive performance c, subsequently used to explore promising candidate configurations. Examples of SMBO methods are SMAC (Sequential Modelbased Algorithm Configuration [16]) and TPE (Tree-structure Parzen Estimation [17]). SMAC models p(c|λ) using a random forest, while TPE maintains separate models for p(c) and p(λ|c). Both methods support continuous, categorical and conditional attributes (i.e. attributes whose presence in the optimisation problem depend on the values of other attributes – e.g. Gaussian kernel width parameter in SVM is only present if SVM is using Gaussian kernels). Two other methods falling into the same category are ROAR (Random Online Aggressive Racing [16]) which randomly selects the set of candidates instead of using a regression model, and Spearmint [18] which uses Gaussian processes to model p(c|λ) similarly to SMAC. However, due to their limitations (it was shown that SMAC outperforms ROAR [16], while Spearmint does not support conditional attributes) these methods are not used in this study. Currently available software tools supporting SMBO methods are listed in Table I. It should be noted however, that to the best of our knowledge, there are no comprehensive studies or tools which would tackle the problem of flexible composition of many data preprocessing steps (e.g. data cleaning, feature selection, data transformation) and their simultaneous parametric optimisation, which is one of the key issues addressed in this paper.

3

III. MCPS DESCRIPTION Data-driven workflows have been used to guide data processing in a variety of fields. Some examples are astronomy [21], biology [22], clinical research [3], [4], archive scanning [5], telecommunications [23] or banking [24] to name just a few. The common methodology in all these works consists on following a number of steps to prepare a dataset for building a predictive model. The workflow resulting from connecting different methods to solve a predictive task is known as Multi-Component Predictive System (MCPS [14]). There are different approaches that can be considered in order to formalise MCPSs: 1) Function composition: Each component is a function f : A → B that makes an operation over an input tensor2 X and returns an output tensor Y . Several components can be connected by composing functions i.e. f (g(X)). However, this approach cannot handle MCPS with parallel paths, with varying component counts. 2) Directed Acyclic Graphs (DAG): A DAG G = (F, A) is a graph made of a set of components (or nodes) F and a set of directed arcs A connecting pairs of components. This approach is very flexible and makes it easy to understand the structure of the MCPS. However, DAGs on their own are not expresive enough to model system execution (e.g. how data is transformed, what meta-data is generated, multiple iterations, and preconditions) or temporal behaviour (e.g. duration and delays). 3) Petri nets (PN): Petri nets are a modelling tool that can be applicable to many types of systems [25]. A Petri net P N = (P, T, F ) is a directed bipartite graph consisting of a set of places P and transitions T connected by arcs F . Depending on the system, places and transitions can be interpreted in different ways. In this work, places are considered to be data buffers and transitions are data processing methods. PNs have been shown to be very useful to model workflows [26] since they are very flexible, can accommodate complex process logic and have a strong mathematical foundation. We therefore propose this approach to model MCPSs. A multicomponent predictive system can be represented as a WA-WF-net (Well-handled and Acyclic Workflow Petri net [26], [27]). Formally, M CP S = (P, T, F )

(2)

is a directed acyclic graph where P and T are finite sets of nodes called places and transitions, respectively, and F are the arcs connecting nodes. In an MCPS, a place p ∈ P can contain a single token which can be either input data or predictions, depending on the place where the token is located. To generalise, we can say that a token represents a tensor (i.e. multidimensional array). Each transition t ∈ T represents a method that takes a token, performs a certain operation over the data (e.g. normalisation), and then returns a new token with the modified data. The arcs indicate the direction in which the token is flowing. 2A

tensor is a multi-dimensional array.

A place p ∈ P of an MCPS can only have a single input (denoted as •p) and a single output (p•). The initial place i has no input (•i = ∅) and it is called source, while the last place o of the net is a sink (o• = ∅). However, a transition can have multiple input and output arcs. We consider that all the transitions in an MCPS are either AND-split or AND-join nodes. That is, a transition t is only fired when all input places of t contain a token. After that, t consumes such tokens and produces a new token for each output place of t. The lifetime of an MCPS is defined by a set of states M = {M0 , ..., Mm }. Each state is the distribution of the tokens over P . At the initial state M0 , only one token is available at the source i. When the transitions with incoming tokens are fired, the state of the net changes from Mn to Mn+1 . At the final state Mm , it is guaranteed that only one single token is available at the sink o. Definition 1: A Petri net is an MCPS iff all the following conditions apply: • • •

• • •





The Petri net is a workflow net3 . The Petri net is well-handled and acyclic4 . The places P \{i, o} have only a single input and a single output. The Petri net is 1-sound5 . The Petri net is safe6 . All the transitions with multiple inputs or outputs are AND-join or AND-split, respectively. The token at i is a set of unlabelled instances and the token at o is a set of predictions for such instances. The Petri net can be hierarchical7 .

An MCPS can be as simple as the one shown in Figure 1 with a single transition. For example, the token in i can be a set of unlabelled instances, and t a classifier which consumes such token from the arc fi,t and generates a token in o with the predicted labels through ft,o . An MCPS can however be hierarchically extended [27] since each transition t can be either atomic or another WA-WF-net (with additional starting and ending dummy transitions) – see Figure 2 where atomic transitions are black and special transitions are grey. As a consequence, an MCPS can model very complex systems with multiple data transformations and parallel paths (see e.g. Figure 3 for a multi-hierarchy example). An atomic transition t is an algorithm with a set of hyperparameters λ that affect how the token is processed. In an MCPS, the main transitions are: (1) preprocessing methods, (2) predictors, (3) ensembles and (4) postprocessing methods. Depending on the number of inputs and outputs, we can distinguish the following types of transitions: 3 WF-nets have source and sink places, and are k-sounded (i.e. guarantees that if the net starts with k tokens in i, will end with k tokens in o at Mm ). Please refer to Def. 6 of [26] for a formal definition. 4 WA-nets satisfies that there is exactly one arc between a place and a transition and there are no cycles in the graph. Please refer to Def. 5.1 of [27] for a formal definition. 5 1-soundness property guarantees that only one token is available in i at M0 and in o at Mm (and all other places are empty). 6 The safe property implies that each place p ∈ P cannot contain more than one token at any given time. 7 Hierarchical nets contain special transitions made of Petri nets.

4

transition

fi,t

token

AdaBoostM1

ft,o

place

i

t

Filtered Classifier

o

Weighted Sum

Clone 1

Fig. 1. Multicomponent predictive system with a single transition

...

...

dummy

56

dummy

Resample

dummy

dummy

1 → 1 transitions (e.g. a classifier that consumes unlabelled instances and returns predictions) • 1 → n transitions (e.g. random subsampling that consumes a set of instances and returns several subsets) • n → 1 transitions (e.g. a voting classifier that consumes multiple predictions per instance and returns a single prediction per instance) An example showing all those types of transitions on a hierarchical MCPS is displayed in Figure 3 where AdaBoost method uses a VotedPerceptron as base classifier and applies several preprocessing steps to the input data. IV. AUTOMATING MCPS COMPOSITION Building an MCPS is typically an iterative, labour and knowledge intensive process. Despite a substantial body of research in the area of automated and assisted MCPS creation and optimisation (see e.g. [28] for a survey), a reliable fully automated approach still doesn’t exist. Similarly to process mining [29] – in which a WF-net is composed by extracting knowledge from the event logs of a working process – we tackle the MCPS composition as a CASH problem. To accommodate the definition of MCPS into a CASH problem we generalise A from Eq. 1 to be a set of MCPSs rather than individual algorithms. Hence each MCPS A(j) = (P, T, F )(j) has now a hyperparameter space Λ(j) , which is a concatenation of the hyperparameter spaces of all its transitions T . The CASH problem is now concerned with finding (P, Tλ∗ , F )∗ such as: k 1X (i) (i) L((P, Tλ , F )(j) , Dtrain , Dvalid ) (P,T,F )(j) ∈A,λ∈Λ(j) k i=1 (3) The main reason for MCPS composition being a challenging problem is the potentially large size of the search space. To

arg min

dummy

dummy Replace Missing Values

Fig. 2. Hierarchical multicomponent predictive system with parallel paths



Voted Perceptron

dummy

Perceptrons

Nominal To Binary

dummy

Fig. 3. The best MCPS in FULL search space for ‘dexter’ dataset as shown in Table VII. The type and name of the preprocessing methods and predictors used are explained in Tables II and III respectively.

begin with, an undetermined number of nodes can make the workflow arbitrarily complex. Secondly, we do not know a priori the order in which the nodes should be connected. Also, even transitions belonging to the same category (e.g. missing value imputation) can vary widely in terms of the number and type of hyperparameters (continuous, categorical or conditional), with definition of range for each hyperparameter being an additional problem in itself. The size of the search space can be reduced by applying a range of constraints like limiting the number of nodes, restricting the list of methods using meta-learning [30], prior knowledge [31] or surrogates [32]. In this study we have opted for the first of these approaches by limiting the maximum length of an MCPS. However, the CASH problem is not only challenging because of the size of the search space but also due to the amount of time needed to evaluate every possible solution, making techniques like grid search not feasible. Even ‘intelligent’ strategies can struggle with exploration because the high dimensional search space is likely to be plagued with many local minima. We use the predictive performance as a sole optimisation objective as shown in Eq. 3, noting however, that some problems may require to optimise several objectives at the same time (e.g. error rate, model complexity and runtime [33]). In this paper we use our extended Auto-WEKA version that is described in the next section, which supports automatic composition and optimisation of MCPSs with WEKA filters and predictive models as transitions. Once the MCPS is composed and its hyperparameters

5

optimised, it is trained with a set of labelled instances. Then the MCPS is ready to make predictions as shown in Figure 4. A. Extension and generalisation of Auto-WEKA Auto-WEKA is a software developed by Thornton et al. [13] which allows algorithm selection and hyperparameter optimisation both in regression and classification problems. The current Auto-WEKA version is 2.1 and it provides a blackbox interface for the CASH problem as expressed in Eq. 1, where the search space is defined by WEKA predictors. In this work we have extended Auto-WEKA 0.5 due to its flexibility for the necessary extensions to model MCPSs, not present in the current blackbox version. Both versions provide a one-click solution for automating algorithm selection and hyperparameter optimisation. However, version 0.5 is much more flexible, offering multiple customisations possibilities like preselection of WEKA predictors, choosing the optimisation strategy or setting the optimisation criteria. AutoWEKA also supports various usage scenarios depending on user knowledge, needs and available computational budget. One can for example, run several optimisations in parallel, ending up with multiple solutions that can then be analysed individually or used to build an ensemble [34]. In our previous work [14] we presented a new AutoWEKA feature that allows automatic expansion of WEKA classes during creation of the search space. That is, if a hyperparameter is a category formed of several WEKA classes, Auto-WEKA will now add the hyperparameters of such class recursively. Any WEKA filter can now be included as part of the composition process. In addition, we have developed a new WEKA filter that creates a flexible chain of common preprocessing steps including missing values handling, outlier detection and removal, data transformation, dimensionality reduction and sampling. The following external WEKA packages8 have been included as part of the developed extended Auto-WEKA tool to increase the number of preprocessing methods: EMImputation, RBFNetwork, StudentFilters, baggedLocalOutlierFactor, localOutlierFactor, partialLeastSquares and wavelet. Furthermore, we have created a new filter that combines any outlier detection method and their removal in a single step. We have also implemented a new data sampling filter for WEKA in which instances are periodically selected given a fixed interval of time – this is common in process industry to reduce the size of datasets. While the space restriction does not allow us to include more implementation details, all the scripts for the analysis of the extended Auto-WEKA results such as the creation of plots and tables have been also released in our repository9 . V. M ETHODOLOGY The three main characteristics which define a CASH problem are: a) the search space, b) the objective function and c) the optimisation algorithm.

In this study we have considered three search spaces: PREV: This is the search space used in [13] where predictors and meta-predictors (which take outputs from one or more base predictive models as their input) were considered (756 hyperparameters). It can also include the best feature selection method found after running ‘AttributeSelection’ WEKA filter (30 hyperparameters) for 15 minutes before the optimisation process begins. We use it as a baseline. • NEW: This search space only includes predictors and meta-predictors. In contrast with PREV space, no previous feature selection stage is performed. We would like to note however that some WEKA classifiers have already internal preprocessing steps as we showed in our previous work [14]. We take into account that a categorical hyperparameter can be either simple or complex (i.e. when it contains WEKA classes). In the latter case, we increase the search space by adding recursively the hyperparameters of each method belonging to such complex parameter (e.g. the ‘DecisionTable’ predictor contains a complex hyperparameter whose values are three different types of search methods with further hyperparameters – see Table III for details). That extension increases the search space to 1186 hyperparameters. • FULL: This search space has been defined to support a flow with up to five preprocessing steps, a predictive model and a meta-predictor (1564 hyperparameters). The nodes are connected in the following order: i → missing value handling → p1 → outlier detection and handling10 → p2 → data transformation → p3 → dimensionality reduction → p4 → sampling → p5 → predictor → p6 → meta-predictor → o. This flow is based on our experience with process industry [6], but these preprocessing steps are also common in other fields. If the meta-predictor transition is either ‘Stacking’ or ‘Vote’, its number of inputs can vary from 1 to 5. The methods that can be included in each component are listed in Tables II and III. Note, that FULL search space is more than twice as large as the one presented in [13] in terms of the raw number of hyperparameters. As the datasets we use in our experiments are intended for classification, we have chosen to minimise the classification error averaged over 10 CV folds within the optimisation process. Two SMBO strategies (SMAC and TPE) have been compared against two baselines (WEKA-Def and random search). The following experimental scenarios were devised: • WEKA-Def: All the predictors and meta-predictors listed in Table III are run using WEKA’s default hyperparameter values. Filters are not included in this strategy, although some predictors may perform specific preprocessing steps as part of their default behaviour. • Random search: The whole search space is randomly explored allowing 30 CPU-hours for the process. • SMAC and TPE: An initial configuration is randomly selected and then the optimiser is run for 30 CPU-hours •

8 http://weka.sourceforge.net/packageMetaData/ 9 https://github.com/dsibournemouth/autoweka

10 Outliers

are handled in a different way than missing values

6

TABLE II N UMBER OF PARAMETERS OF THE AVAILABLE PREPROCESSING METHODS . Method Missing values No handling ReplaceMissingValues CustomReplaceMissingValues ,→ (M) Zero ,→ (M) Mean ,→ (M) Median ,→ (M) Min ,→ (M) Max EMImputation Outliers No handling RemoveOutliers ,→ (O) InterquartileRange (IQR) ,→ (O) BaggedLOF Transformation No transformation Center Standardize Normalize Wavelet IndependentComponents Dimensionality reduction No reduction PrincipalComponents (PCA) RandomSubset AttributeSelection ,→ (S) BestFirst ,→ (S) GreedyStepwise ,→ (S) Ranker ,→ (E) CfsSubsetEval ,→ (E) CorrelationAttributeEval ,→ (E) GainRatioAttributeEval ,→ (E) InfoGainAttributeEval ,→ (E) OneRAttributeEval ,→ (E) PrincipalComponents ,→ (E) ReliefFAttributeEval ,→ (E) Sym.UncertAttributeEval ,→ (E) WrapperSubsetEval PLSFilter Sampling No sampling Resample ReservoirSample Periodic sampling

Num.

Categorical Simple Complex

0 0 0 0 0 0 0 0 3

0 0 M 0 0 0 0 0 1

0 0 0 0 0 0 0 0 0

0 0 1 1

0 0 0 1

0 O 0 0

0 0 0 2 0 3

0 0 0 0 0 1

0 0 0 0 0 0

0 3 2 0 1 2 1 0 0 0 0 2 2 2 0 0 1

0 1 0 0 2 3 0 2 0 0 2 1 2 1 1 0 4

0 0 0 S,E 0 0 0 0 0 0 0 0 0 0 0 0 0

0 2 2 1

0 0 0 0

0 0 0 0

to explore the search space in an intelligent way, allowing for comparison with the random search. In order to compare our results with the ones presented in [13] we have replicated the experimental settings as closely as possible. We have evaluated different optimisation strategies over 21 well-known datasets representing classification tasks and 7 real datasets from the process industry (see Table IV). Each dataset D = {Dtrain , Dtest } has been split into 70% training and 30% testing sets, unless partition was already provided. Please note that Dtrain is then split into 10-folds for Eq. 3 and therefore Dtest is not used during the optimisation or training process at all. To evaluate our work in a real-world setting, we have also performed a number of experiments using datasets representing real chemical processes. These datasets have been made available by a chemical company, and have been extensively used in our previous studies (e.g. [6], [35], [36]). The datasets are: i) ‘absorber’ which contains 38 continuous attributes

TABLE III N UMBER OF PARAMETERS OF THE AVAILABLE PREDICTORS . Method Predictors BayesNet ,→ (Q) local.K2 ,→ (Q) local.HillClimber ,→ (Q) local.LAGDHillClimber ,→ (Q) local.SimulatedAnnealing ,→ (Q) local.TabuSearch ,→ (Q) local.TAN NaiveBayes NaiveBayesMultinomial Logistic MLP SMO ,→ (K) NormalizedPolyKernel ,→ (K) PolyKernel ,→ (K) Puk ,→ (K) RBFKernel SimpleLogistic IBk KStar DecisionTable ,→ (S) BestFirst ,→ (S) GreedyStepwise ,→ (S) Ranker JRip OneR PART ZeroR DecisionStump J48 LMT REPTree RandomForest RandomTree Meta-predictors LWL ,→ (A) LinearNNSearch AdaBoostM1 AttributeSelectedClassifier ,→ (S) BestFirst ,→ (S) GreedyStepwise ,→ (S) Ranker ,→ (E) CfsSubsetEval ,→ (E) CorrelationAttributeEval ,→ (E) GainRatioAttributeEval ,→ (E) InfoGainAttributeEval ,→ (E) OneRAttributeEval ,→ (E) PrincipalComponents ,→ (E) ReliefFAttributeEval ,→ (E) Sym.UncertAttributeEval ,→ (E) WrapperSubsetEval Bagging ClassificationViaRegression FilteredClassifier LogitBoost MultiClassClassifier RandomCommittee RandomSubSpace Stacking Vote

Num.

Categorical Simple Complex

0 1 1 3 3 3 0 0 0 1 6 1 1 1 2 1 0 1 1 0 1 2 1 2 1 2 0 0 2 1 2 1 1

1 4 4 4 2 4 2 2 0 0 5 2 1 1 0 0 3 4 2 3 2 3 0 2 0 2 0 0 6 6 2 2 4

Q 0 0 0 0 0 0 0 0 0 0 K 0 0 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 1 2 1 0 0 0 0 2 2 2 0 0 2 0 0 2 1 1 2 0 0

2 1 3 2 2 3 0 2 0 0 2 1 2 1 1 0 0 0 0 4 3 0 0 2 2

A 0 0 S,E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

from an absorption process; ii) ‘drier’ which consists of 19 continuous features from physical sensors (i.e. temperature, pressure and humidity) and the target value is the residual humidity of the process product [35]; iii) ‘oxeno’ which contains 71 continuous attributes also from physical sensors and a target variable which is the product concentration measured in the laboratory [6]; and iv) ‘thermalox’ which

7

TABLE IV DATASETS , CONTINUOUS AND CATEGORICAL ATTRIBUTE COUNT, NUMBER OF CLASSES , NUMBER OF INSTANCES , AND PERCENTAGE OF MISSING VALUES . Dataset abalone amazon car cifar10 cifar10small convex dexter dorothea germancredit gisette kddcup09app krvskp madelon mnist mnistrot secom semeion shuttle waveform wineqw yeast absorber catalyst debutanizer drier oxeno sulfur thermalox

Cont 7 10000 0 3072 3072 784 20000 100000 7 5000 192 0 500 784 784 590 256 9 40 11 8 38 14 7 19 71 5 38

Disc 1 0 6 0 0 0 0 0 13 0 38 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Class 28 50 4 10 10 2 2 2 2 2 2 2 2 10 10 2 10 7 3 11 10 3 3 3 3 3 9 9

Train 2924 1050 1210 50000 10000 8000 420 805 700 4900 35000 2238 1820 12000 12000 1097 1116 43500 3500 3429 1039 1119 4107 1676 853 12311 7057 1974

Test 1253 450 518 10000 10000 50000 180 345 300 2100 15000 958 780 50000 50000 470 477 14500 1500 1469 445 480 1760 718 366 5277 3024 846

%Miss 0 0 0 0 0 0 0 0 0 0 69.47 0 0 0 0 4.58 0 0 0 0 0 0 0 0 1.83 0 0 0

has 38 attributes from physical sensors and the two target values are concentrations of N Ox and SOx in the exhaust gases [35]. Due to confidentiality reasons the datasets listed above cannot be published. We have however also evaluated our approach using 3 additional publicly available datasets from the same domain, which are: v) ‘catalyst’ consisting of 14 attributes, where the task is to predict the activity of a catalyst in a multi-tube reactor [37]; vi) ‘debutanizer’ which has 7 attributes (temperature, pressure and flow measurements of a debutanizer column) and where the target value is the concentration of butane at the output of the column [38]; and vii) the ‘sulfur’ recovery unit, which is a system for removing environmental pollutants from acid gas streams before they are released into the atmosphere [39]. The washed out gases are transformed into sulfur. The dataset has five input features (flow measurements) and two target values: concentration of H2 S and SO2 . The discussion of the results from the experiments using these datasets is presented in Section VI-D. To be consistent in terms of evaluation, we have transformed the target value of these regression datasets into classes (each of them representing the state of the output – high, normal, low). In datasets with 2 target values we were using 9 classes (combining the 3 states of both targets). For each strategy we performed 25 runs with different random seeds within a 30 CPU-hours optimisation time limit on Intel Xeon E5-2620 six-core 2.00GHz CPU. In the case a configuration step exceeds 30 minutes or 3GB of RAM to evaluate, its evaluation is aborted and not considered further. Once the optimisation process has finished, the returned MCPS is trained using the whole training set Dtrain and produce

D Dtrain

Dtest Xtest

Search strategy

Ytest

(P,T *,F)*

10-fold CV

Training

є

MCPS

Testing

test

Evaluation

ℰ Fig. 4. MCPS training and testing process

predictions for the testing set Dtest . VI. R ESULTS We organised our analysis around the following aspects: a) impact of significantly extending the search space in the optimisation process; b) identification of promising methods for each dataset; c) trade-off between exploration and exploitation in the search strategies; and d) evaluation in the context of the chemical industry datasets. A. Impact of extending the search space Classification performance for each dataset can be found in Tables V and VI, which P show the 10-fold CV error over 10 1 (i) ˆ (i) Dtrain (denoted as  = 10 , Y ), where L is i=1 L(Y (i) ˆ (i) the classification error and Y , Y the ground truth and the predictions for the fold i, respectively) and the holdout error over Dtest (denoted as E = L(Ytest , Yˆtest )) achieved by each strategy, respectively (see Figure 4 for the overall process). Random search, SMAC and TPE results have been calculated using the mean of 100,000 bootstrap samples (i.e. randomly selecting 4 of the 25 runs and keeping the one with lowest CV error as in [13]), while only the lowest errors are reported for WEKA-Def. PREV columns contain the values reported in [13], while NEW and FULL columns contain the results for the search spaces described in Section V. An upward arrow indicates an improvement when using extended search spaces (NEW and FULL) in comparison to previous results (PREV) reported in [13]. Boldface indicates the lowest classification error for each dataset. The first thing to note is that on average, SMBO approaches are considerably better than the best WEKA-Def results. The only exception is the holdout error of ‘amazon’ dataset.

8

TABLE V 10- FOLD C ROSS VALIDATION ERROR  (% MISSCLASSIFICATION ). A N UPWARD ARROW INDICATES AN IMPROVEMENT WITH RESPECT TO PREV SPACE . B OLDFACED VALUES INDICATE THE LOWEST CLASSIFICATION ERROR FOR EACH DATASET. dataset abalone amazon car cifar10 cifar10small convex dexter dorothea germancredit gisette kddcup09app krvskp madelon mnist mnistr secom semeion shuttle waveform wineqw yeast

WEKA-DEF 73.33 43.94 2.71 65.54 66.59 28.68 10.20 6.03 22.45 3.62 1.88 0.89 25.98 5.12 66.15 6.25 6.52 0.0328 12.73 38.94 39.43

PREV 72.03 59.85 0.53 69.46 67.33 33.31 10.06 8.11 20.15 4.84 1.75 0.63 27.95 5.05 68.62 5.27 6.06 0.0345 12.43 35.36 38.74

RANDOM NEW FULL 72.53 72.93 45.72 ↑ 62.22 0.47 ↑ 2.58 58.89 ↑ 66.18 60.45 ↑ 70.32 25.02 ↑ 34.82 7.54 ↑ 8.92 6.25 ↑ 6.94 21.31 21.80 2.30 ↑ 3.66 1.80 1.70 0.42 ↑ 0.56 19.20 ↑ 27.60 3.78 ↑ 7.85 58.10 ↑ 67.21 5.85 4.87 4.82 ↑ 7.85 0.0121 ↑ 0.0349 12.50 13.09 33.08 ↑ 35.95 37.16 ↑ 39.63

↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

PREV 71.71 47.34 0.61 62.36 58.84 25.93 5.66 5.62 17.87 2.43 1.70 0.30 20.70 3.75 57.86 5.24 4.78 0.0224 11.92 34.65 35.51

SMAC NEW 72.21 39.57 ↑ 0.38 ↑ 56.44 ↑ 57.90 ↑ 21.88 ↑ 6.42 5.95 19.65 2.21 ↑ 1.80 0.28 ↑ 15.61 ↑ 3.50 ↑ 55.73 ↑ 6.01 4.48 ↑ 0.0112 ↑ 12.33 32.64 ↑ 36.50

FULL 72.20 53.51 0.46 64.76 71.05 25.52 7.51 6.58 20.49 2.76 1.80 0.35 17.24 5.51 63.62 6.10 4.92 0.0112 12.48 33.24 37.23





↑ ↑

PREV 72.14 50.26 0.91 67.73 58.41 28.56 9.83 6.81 21.56 3.55 1.88 0.43 24.25 10.02 73.09 6.21 6.76 0.0251 12.55 35.98 35.01

TPE NEW 72.01 ↑ 40.27 ↑ 0.21 ↑ 55.59 ↑ 56.56 ↑ 23.19 ↑ 6.19 ↑ 5.92 ↑ 19.88 ↑ 2.35 ↑ 1.80 ↑ 0.31 ↑ 16.03 ↑ 3.60 ↑ 57.17 ↑ 5.85 ↑ 4.28 ↑ 0.0104 ↑ 12.43 ↑ 32.67 ↑ 36.17

FULL 72.17 56.26 0.29 70.40 68.65 29.93 7.76 6.16 20.07 3.07 1.80 0.33 19.14 7.19 65.89 6.05 5.96 0.0112 12.42 32.97 36.43



↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

TABLE VI H OLDOUT ERROR E (% MISSCLASSIFICATION ). A N UPWARD ARROW INDICATES AN IMPROVEMENT WITH RESPECT TO PREV SPACE . B OLDFACED VALUES INDICATE THE LOWEST CLASSIFICATION ERROR FOR EACH DATASET. dataset abalone amazon car cifar10 cifar10small convex dexter dorothea germancredit gisette kddcup09app krvskp madelon mnist mnistr secom semeion shuttle waveform wineqw yeast

WEKA-DEF 73.18 28.44 0.7700 64.27 65.91 25.96 8.89 6.96 27.33 2.81 1.7405 0.31 21.38 5.19 63.14 8.09 8.18 0.0138 14.40 37.51 40.45

PREV 74.88 41.11 0.0100 69.72 66.12 31.20 9.18 5.22 29.03 4.62 1.74 0.58 24.29 5.05 66.4 8.03 6.10 0.0157 14.27 34.41 43.15

RANDOM NEW FULL 72.92 ↑ 73.76 39.22 ↑ 58.94 0.1300 1.8400 58.23 ↑ 67.81 59.84 ↑ 72.93 24.75 ↑ 36.87 8.29 ↑ 12.59 5.27 5.37 25.40 ↑ 26.88 2.28 ↑ 3.83 1.72 ↑ 2.05 0.34 ↑ 0.39 19.10 ↑ 24.48 4.00 ↑ 13.92 57.16 ↑ 69.67 7.88 ↑ 8.20 4.78 ↑ 8.27 0.0071 ↑ 0.0219 14.26 ↑ 14.28 32.99 36.72 37.68 ↑ 40.86

↑ ↑

↑ ↑ ↑



However, the best solution in FULL search space is still better than WEKA-Def (20.89% vs 28.44%). This was expected as default values for WEKA methods set by human experts, are not optimised for any particular dataset. As shown in Table V, in the majority of cases (49 of 63), the MCPSs found in the NEW search space achieve better results (i.e. N EW < P REV ). In 28 out of 63 cases, the FULL search space also gets better performance (i.e. F U LL < P REV ). However, finding good MCPS is more challenging due to the large increase in search space size [40]. As an example, consider Figure 5 where the evolution of the best solution for ‘madelon’ dataset and SMAC strategy is represented over time for each of the 25 runs. Comparing Figures 5-a) and b) we can see that the rate of convergence is much higher in

PREV 73.51 33.99 0.4000 61.15 56.84 23.17 7.49 6.21 28.24 2.24 1.7358 0.31 21.56 3.64 57.04 8.01 5.08 0.0130 14.42 33.95 40.67

SMAC NEW 73.41 ↑ 36.26 0.0526 ↑ 55.54 ↑ 57.87 21.31 ↑ 7.31 ↑ 5.12 ↑ 25.42 ↑ 2.34 1.74 0.23 ↑ 16.80 ↑ 4.10 54.86 ↑ 7.87 ↑ 5.10 0.0070 ↑ 14.17 ↑ 32.90 ↑ 37.60 ↑

FULL 73.16 49.90 0.1958 69.99 72.64 31.05 10.14 5.51 26.68 2.85 1.74 0.39 18.26 22.70 67.03 7.87 5.46 0.0075 13.99 34.14 39.02

↑ ↑



↑ ↑ ↑ ↑

PREV 72.94 36.59 0.1800 66.01 57.01 25.59 8.89 6.15 27.54 3.94 1.7381 0.54 21.12 12.28 70.20 8.10 8.26 0.0145 14.23 33.56 40.10

TPE NEW 73.04 35.69 ↑ 0.0075 ↑ 54.88 ↑ 56.41 ↑ 22.62 ↑ 6.90 ↑ 5.25 ↑ 25.49 ↑ 2.37 ↑ 1.74 0.36 ↑ 16.91 ↑ 3.96 ↑ 56.31 ↑ 7.84 ↑ 4.91 ↑ 0.0069 ↑ 14.34 32.93 ↑ 37.89 ↑

FULL 73.26 62.52 0.0484 70.06 72.58 34.88 7.99 5.15 26.61 3.08 1.74 0.33 18.35 25.48 70.04 7.87 6.31 0.0077 14.04 34.09 38.92



↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

the smaller space (denoted as NEW). Nevertheless, the overall best-performing model for ‘madelon’ was found in the FULL space as seen in Table VII. The way in which the search space is extended can have a considerable impact on the accuracy of the MCPS found. Additional hyperparameters allowing for extra tuning flexibility (PREV to NEW) improved the performance in most of the cases. However, adding more transitions to the MCPS (NEW to FULL) does not seem to help on average, given the same CPU time limit. Nevertheless, the MCPSs found in the FULL search space for 13 out of 28 datasets have better or comparable performance to the solutions found in the NEW space as shown in Table VII.

9

Fig. 5. 10-fold CV error of best solutions found over time for ‘madelon’ dataset and SMAC strategy in a) NEW and b) FULL search spaces.

B. Identifying promising configurations The best MCPSs found for each dataset are reported in Table VII. Each row of this table represents a sequence of data transformations and predictive models as explained in Section V. See for example Figure 3, where the best MCPSs of ‘dexter’ dataset in the FULL search space is shown. Transitions are effectively WEKA methods that process the incoming tokens. The solutions found for different datasets are quite diverse, and they often also vary a lot across the 25 random seed runs performed for each dataset. In order to better understand the observed differences in the MCPSs found we have also measured the average pairwise similarity of the 25 MCPSs found for each dataset and the variance of their performances (see Figure 7). To calculate the similarity between configurations we use a weighted sum of Hamming distances given by PN (wi · δi ) d(F, G) = 1 − i=1 (4) PN i=1 wi where F and G are MCPS with N transitions, wi is the weight for the ith transition and δi is the Hamming distance (a standard measure of string dissimilarity) at position i. Weights have been fixed manually to W = {2, 1.5} in the NEW search space and W = {1, 1, 1, 1, 1, 2, 1.5} in the FULL search space. That is, preprocessing transitions have the same weight while both predictors and meta-predictors have higher weights because of their importance [40]. After computing the similarity matrix for the 25 MCPSs found in each dataset/strategy pair, we have grouped them using a complete-linkage hierarchical clustering algorithm. The resultant dendrogram for the ‘waveform’ dataset and SMAC strategy in the NEW search space (i.e. only predictor and meta-predictor as transitions) is shown in Figure 6a. As it can be seen, there are three main clusters defined by the base classifier which have been found to perform best: SimpleLogistic, Logistic and LMT (Logistic Model Tree). The dendrogram for the FULL scenario is shown in Figure 6b. Most of the MCPSs also use logistic classifiers (as in the smaller search space) but other good solutions include SMO

(support vector classifier) and JRip (rule-based classifier). It can also be noted that the clusters themselves are no longer as pure as before in terms of the base classifier chosen. It is worth mentioning that most of the MCPSs found for the ‘waveform’ dataset include a missing value replacement method even though there are no missing values in this dataset and therefore it doesn’t have any effect on the data or the classification performance. The presence of such an unnecessary component likely stems from the fact that selecting a method for replacing missing values at random has a prior probability of 0.75 (i.e. 3 out of 4 possible actions as seen in Table III) which means that it can be selected when randomly initialising the configurations of MCPSs to start from and using the search method which does not penalise unnecessary elements in the data processing chains. However, it is not the case with other nodes like ‘Transformation’ in which although the prior probability of selecting one of the available transformation methods is 5/6, selecting an appropriate method has a potential impact on the performance and therefore better transitions tend to be retained in the found solutions. For illustrative purposes we have selected three interesting cases from Figure 7 for a more detailed analysis: •





Low error variance and high MCPS similarity. Most of the best solutions found follow a very similar sequence of methods. Therefore similar classification performance is to be expected. For example, a repeated sequence in ‘car’ dataset with TPE optimisation is MultiLayerPerceptron (13/25) → AdaBoostM1 (22/25). Low error variance and low MCPS similarity. Despite having different solutions, classification performance in a group of analysed datasets does not vary much. This can mean that the classification problem is not difficult and a range of different MCPSs can perform quite well on it. This is for instance the case of the solutions found for the ‘secom’ and ‘kddcup09app’ datasets. High error variance and low MCPS similarity. In such cases, there are many differences between both the best MCPSs found and their classification performances. For instance, it is the case of ‘amazon’ dataset for which a

10

TABLE VII B EST MCPS FOR EACH DATASET IN NEW AND FULL SPACES AND ITS HOLDOUT ERROR . MV = MISSING VALUE REPLACEMENT, OU = OUTLIER DETECTION AND REMOVAL , TR = TRANSFORMATION , DR = DIMENSIONALITY REDUCTION , SA = SAMPLING . dataset abalone amazon car cifar10 cifar10small convex dexter dorothea germancredit gisette kddcup09app krvskp madelon mnist mnistr secom semeion shuttle waveform wineqw yeast absorber catalyst debutanizer drier oxeno sulfur thermalox

space NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL NEW FULL

MV Median Min Zero Zero Zero EM Mean Median -

OU IQR IQR IQR -

TR Center Normalize Standardize Center Standardize Standardize Standardize Normalize Center Normalize Standardize Center Normalize Wavelet Normalize Wavelet Wavelet Wavelet Normalize Standardize Wavelet Wavelet

DR RandomSubset RandomSubset RandomSubset RandomSubset Attr. Selection PCA PCA Attr.Selection RandomSubset RandomSubset RandomSubset -

SA Resample Resample Resample Resample Resample Reservoir Reservoir Resample -

predictor MLP Logistic SimpleLogistic NaiveBayesMult. SMO SMO RandomForest RandomTree RandomTree RandomTree RandomForest RandomTree DecisionStump VotedPerceptron OneR REPTree LMT LMT NaiveBayes VotedPerceptron ZeroR BayesNet JRip JRip REPTree IBk SMO J48 RandomForest BayesNet J48 ZeroR NaiveBayes SMO RandomForest REPTree SMO SMO RandomForest IBk RandomForest RandomTree REPTree KStar LMT JRip LMT REPTree JRip Logistic JRip JRip PART JRip Logistic MLP

meta-predictor RandomCommittee Bagging RandomSubSpace RandomSubSpace MultiClassClassifier AdaBoostM1 MultiClassClassifier Bagging MultiClassClassifier AdaBoostM1 AdaBoostM1 AdaBoostM1 AdaBoostM1 AdaBoostM1 RandomSubSpace LogitBoost Bagging Bagging LWL RandomSubSpace LWL MultiClassClassifier AdaBoostM1 AdaBoostM1 RandomSubSpace LogitBoost MultiClassClassifier AdaBoostM1 RandomCommittee RandomSubSpace AdaBoostM1 FilteredClassifier LWL FilteredClassifier AdaBoostM1 AdaBoostM1 RandomSubSpace Attr.SelectedClassifier AdaBoostM1 RandomSubSpace Bagging Bagging AdaBoostM1 Bagging AdaBoostM1 FilteredClassifier AdaBoostM1 FilteredClassifier Bagging Bagging AdaBoostM1 RandomSubSpace Bagging FilteredClassifier MultiClassClassifier FilteredClassifier

E 71.43 72.39 26.67 20.89 0.00 0.00 52.28 59.63 54.48 59.97 18.47 22.97 5.00 5.00 4.64 4.64 23.33 24.33 1.95 1.52 1.67 1.74 0.10 0.21 15.64 12.82 2.66 5.15 52.20 56.33 7.66 7.87 3.98 4.61 0.01 0.01 14.00 13.40 32.33 33.42 36.40 38.20 51.04 45.63 55.34 48.64 49.03 50.14 43.72 40.98 35.74 38.24 77.91 70.96 28.01 26.71

11

a) NEW

b) FULL AdaBoostM1

Logistic Bagging JRip

12.59 ± 0.23

Logistic → RandomSubspace

SimpleLogistic

Bagging

12.44 ± 0.18 RandomSubspace

LMT 12.69 ± 0.36

RandomSubspace

Logistic 12.46 ± 0.13 Bagging

12.71 ± 0.16 PCA → Resample → Logistic

LMT

12.82 ± 0.22

SimpleLogistic

12.61 ± 0.18

Normalize → SMO

Fig. 6. Dendrogram for ‘waveform’ dataset and SMAC strategy in the NEW (a) and FULL (b) search space. Average CV error and standard deviation are shown for the main clusters.

Fig. 7. Error variance vs. MCPS similarity in FULL search space

high error variance was observed in all of the optimisation strategies (see Figure 7). We believe such difference likely results from a combination of difficulty of the classification task (i.e. high input dimensionality, large number of classes and a relatively small number of training samples) and/or insufficient exploration from the random starting configuration in a very large space.

a large difference. Although SMAC includes a parameter11 for adjusting the exploration/exploitation rate, we have selected a different approach consisting of varying the number of seeds and the optimisation time. The reason behind this approach is that we found that these two factors are the most decisive in the optimisation process for the same search space. We have varied both the number of seeds and the maximum optimisation time, keeping the same fixed CPU-hour budget as in the previous experiments (i.e. 25 seeds x 30 h/seed = 750 h). Increasing the number of seeds to 50 allows to explore more regions of the search space. On the other hand, increasing the running time to 50 CPU-hours/seed makes it possible to exploit further the hyperparameter settings of potential good configurations. The results are presented in Figure 8 which contains the average holdout test error computed similarly as in Table VI. The results suggest that increasing the optimisation time (i.e. more exploitation) allows finding better solutions on average given the same search space. We can also observe a drastic difference between NEW and FULL search space for ‘amazon’ dataset, that links with what we observed in Figure 5 regarding the different convergence rates of the optimisation process in both search spaces. While this section describes a preliminary investigation into the potential trade-off in exploration versus exploitation aspects of the investigated search strategies it also highlights that there is a need for significantly improving the fundamental search algorithms in the context of MCPS optimisation which is also discussed in the next section.

C. Exploration vs exploitation analysis An inherent trade-off in a search through space with size exploding exponentially with the number of parameters, is that of exploration vs. exploitation. To study the impact of varying the relative importance of these two factors, we have selected two datasets (‘amazon’ and ‘madelon’) in which performance evolution changes considerably between different runs, and therefore the starting point in the SMAC search strategy makes

D. Process industry specific preprocessing One motivation for automating the composition and optimisation of MCPSs was the need for speeding up the process of developing soft-sensors, which are essentially predictive 11 --intensification-percentage sets the percent of time to spend intensifying versus model learning

12

a) 10-fold CV error (% missclassification) 100 80 60 40 20 0

RAND-NEW

absorber

SMAC-NEW

catalyst debutanizer

RAND-FULL

drier

oxeno

SMAC-FULL

sulfur

thermalox

sulfur

thermalox

sulfur

thermalox

b) Holdout error (% missclassification)

Fig. 8. Holdout test error E with 95% bootstrap confidence intervals varying the number of hours and seeds using SMAC.

100 80 60 40 20 0 absorber

drier

oxeno

c) # Evaluations x 10000

15

models based on easy-to-measure quantities used in the process industry. Our experience in this field comes from past involvement in multiple projects with chemical engineering companies. Raw data from chemical plants usually requires a considerable preprocessing and modelling effort. Although some WEKA predictors include inner preprocessing such as removal of missing values or normalisation, as shown previously in [14], many datasets need additional preprocessing to build effective predictive models. The fixed order of preprocessing nodes in the FULL search space has not been set arbitrarily – it follows the preprocessing guidelines that are common in process industry when developing predictive models (see e.g. [6], [41]). The results presented in Figure 9 show that including such preprocessing steps has allowed, on average, to find better MCPSs than in the NEW search space for several datasets from real chemical processes (listed in the bottom part of Table IV). Although the difference is very small, that implies that considerably extending the search space does not only have major negative effects but also can be positive for the predictive performance on these datasets. It is interesting to highlight that random search was able to find the MCPSs with lowest holdout error in half of the cases, but on the other hand also presents a higher error variance. We believe that the reason behind this is that random search has evaluated more models than SMAC for almost all the runs (see Figure 9c)). This suggests that random search – which evaluates more potential solutions and explores more regions of the search space – can, and in these few cases has found better solutions than SMAC. The best MCPSs found for the chemical datasets are shown at the end of Table VII. We would like to highlight that MCPSs including an attribute selection step have a considerable improvement of performance with respect to NEW (see e.g. ‘absorber’, ‘catalyst’ and ‘sulfur’). We also can see that there is a large difference between the CV error and the holdout test error in some of these datasets (e.g.  = 2.60% to E = 61.27% in ‘catalyst’). This is due to the evolving nature of some chemical processes over time. The test set (last 30% of samples) can be significantly different from the initial 70% of data used for training the MCPS. We have shown in [42] how it is possible to adapt these MCPSs

catalyst debutanizer

10 5 0 absorber

catalyst debutanizer

drier

oxeno

Fig. 9. a) Mean 10-fold CV error  and b) holdout error E with 95% bootstrap confidence intervals for process industry datasets. c) Total number of evaluations per dataset and strategy.

when there are changes in data. VII. C ONCLUSION AND FUTURE WORK In this study we have demonstrated via extensive experimental analysis that it is indeed possible to automate the composition of multicomponent predictive systems (MCPSs). Our novel formulation of MCPSs using the Well-handled and Acyclic Workflow (WA-WF) Petri net formalism, became a theoretical foundation of the study and enabled us to cast the problem within a rigorous mathematical framework [26]. At the same time, it opens the door to formally verify that MCPSs are correctly composed [43] which is still an outstanding and non-trivial problem. Our results have indicated that Sequential Model-Based Optimisation (SMBO) strategies perform better than random search given the same time for optimisation in the majority of analysed datasets. However, in some cases it is the random search that was able to find the best solution especially in the process industry datasets (in half of the cases), most likely due to the sheer number of models that it was able to evaluate within a given computational budget. The analysis of the trade-off between exploration and exploitation of the search space suggests however that increasing the optimisation time (i.e. more exploitation) allows to find better solutions on average. Investing time into fine-tuning of hyperparameters of a solution seems worthwhile only if this can significantly and quickly improve the quality of the solution as otherwise the optimisation strategy can stall in a local minimum that keeps away the exploration from other promising regions of the search space. The above highlights a clear need for significantly improving the fundamental search algorithms in the context of MCPS optimisation.

13

In addition, it would also be valuable to investigate if using different data partitioning like DPS [44], [45] would make any difference in the optimisation process. We believe that DPS could have a considerable impact when using certain datasets in SMAC strategy since SMAC discards potential poor solutions early in the optimisation process based on performance on only a few CV folds. In case the folds used are not representative of the overall data distribution, which as shown in [45] can happen quite often with CV, the effect on the solutions found can be detrimental. At the moment, available SMBO methods only support single objective optimisation. However, it would be useful to find solutions that optimise more than one objective, including for instance a combination of prediction error, model complexity and running time as discussed in [33]. In contrast to the collection of datasets presented in [13] (and also evaluated in this paper), the data distribution of process industry datasets evaluated in this paper is changing over time. In these cases there is a need to adapt the optimised MCPSs following the changing environment. The first approach to deal with concept drift in such datasets has been presented in [42], though the adaptation mechanisms for SMBO methods require further systematic research and form one of our future work directions. R EFERENCES [1] D. Pyle, Data preparation for data mining. Morgan Kaufmann, 1999. [2] G. S. Linoff and M. J. A. Berry, Data mining techniques: for marketing, sales, and customer relationship management. Wiley, 2011. [3] E. Teichmann, E. Demir, and T. Chaussalet, “Data preparation for clinical data mining to identify patients at risk of readmission,” in IEEE 23rd International Symposium on Computer-Based Medical Systems, 2010, pp. 184–189. [4] J. Zhao and T. Wang, “A general framework for medical data mining,” in Future Information Technology and Management Engineering, 2010, pp. 163–165. [5] I. Messaoud, H. El Abed, V. M¨argner, and H. Amiri, “A design of a preprocessing framework for large database of historical documents,” in Proc. of the 2011 Workshop on Historical Document Imaging and Processing, 2011, pp. 177–183. [6] M. Budka, M. Eastwood, B. Gabrys, P. Kadlec, M. Martin Salvador, ˇ S. Schwan, A. Tsakonas, and I. Zliobait˙ e, “From Sensor Readings to Predictions: On the Process of Developing Practical Soft Sensors,” in Advances in IDA XIII, vol. 8819, 2014, pp. 49–60. [7] R. Leite, P. Brazdil, and J. Vanschoren, “Selecting classification algorithms with active testing,” in LN in Computer Science, vol. 7376 LNAI, 2012, pp. 117–131. [8] C. Lemke and B. Gabrys, “Meta-learning for time series forecasting and forecast combination,” Neurocomputing, vol. 73, no. 10-12, pp. 2006– 2016, 2010. [9] A. MacQuarrie and C.-L. Tsai, Regression and Time Series Model Selection. World Scientific, 1998. [10] Y. Bengio, “Gradient-based optimization of hyperparameters,” Neural computation, vol. 12, no. 8, pp. 1889–1900, 2000. [11] X. C. Guo, J. H. Yang, C. G. Wu, C. Y. Wang, and Y. C. Liang, “A novel LS-SVMs hyper-parameter selection based on particle swarm optimization,” in Neurocomputing, vol. 71, 2008, pp. 3211–3215. [12] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization,” Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012. [13] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “AutoWEKA: combined selection and hyperparameter optimization of classification algorithms,” in Proc. of the 19th ACM SIGKDD, 2013, pp. 847–855. [14] M. Martin Salvador, M. Budka, and B. Gabrys, “Towards Automatic Composition of Multicomponent Predictive Systems,” in 11th Int. Conf. Hybrid Artif. Intell. Syst., 2016, pp. 27–39.

[15] E. Brochu, V. M. Cora, and N. de Freitas, “A Tutorial on Bayesian Optimization of Expensive Cost Functions with Application to Active User Modeling and Hierarchical Reinforcement Learning,” University of British Columbia, Department of Computer Science, Tech. Rep., 2010. [16] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” LN in Computer Science, vol. 6683, pp. 507–523, 2011. [17] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl, “Algorithms for HyperParameter Optimization,” in Advances in NIPS 24, 2011, pp. 1–9. [18] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Optimization of Machine Learning Algorithms,” Advances in NIPS 25, pp. 2960–2968, 2012. [19] K. Eggensperger, M. Feurer, and F. Hutter, “Towards an empirical foundation for assessing bayesian optimization of hyperparameters,” in NIPS Workshop on Bayesian Optimization in Theory and Practice, 2013, pp. 1–5. [20] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter, “Methods for Improving Bayesian Optimization for AutoML,” in ICML, 2015. [21] G. B. Berriman, E. Deelman, J. Good, J. C. Jacob, D. S. Katz, A. C. Laity, T. A. Prince, G. Singh, and M. H. Su, Generating complex astronomy workflows. Springer London, 2007, pp. 19–38. [22] A. Shade and T. K. Teal, “Computing Workflows for Biologists: A Roadmap,” PLoS Biol., vol. 13, no. 11, pp. 1–10, 2015. [23] A. Maedche, A. Hotho, and M. Wiese, “Enhancing Preprocessing in Data-Intensive Domains using Online-Analytical Processing,” in Second Int. Conf. DaWaK 2000, 2000, pp. 258–264. [24] W. Wei, J. Li, L. Cao, Y. Ou, and J. Chen, “Effective detection of sophisticated online banking fraud on extremely imbalanced data,” World Wide Web, vol. 16, no. 4, pp. 449–475, 2013. [25] T. Murata, “Petri nets: Properties, analysis and applications,” vol. 77, no. 4, pp. 541–580, 1989. [26] W. M. P. Van Der Aalst, “The Application of Petri Nets To Workflow Management,” J. Circuits, Syst. Comput., vol. 08, no. 01, pp. 21–66, 1998. [27] L. Ping, H. Hao, and L. Jian, “On 1-soundness and Soundness of Workflow Nets,” in Third Work. Model. Objects, Components, Agents, 2004, pp. 21–36. [28] F. Serban, J. Vanschoren, J.-U. Kietz, and A. Bernstein, “A survey of intelligent assistants for data analysis,” ACM Computing Surveys, vol. 45, no. 3, pp. 1–35, 2013. [29] W. van der Aalst, “Process Mining: Making Knowledge Discovery Process Centric,” ACM SIGKDD Explor. Newsl., vol. 13, no. 2, pp. 45– 49, 2012. [30] M. Feurer, J. T. Springenberg, and F. Hutter, “Using Meta-Learning to Initialize Bayesian Optimization of Hyperparameters,” in Proc. of the Meta-Learning and Algorithm Selection Workshop at ECAI, 2014, pp. 3–10. [31] K. Swersky, J. Snoek, and R. P. Adams, “Multi-Task Bayesian Optimization,” in Advances in NIPS 26, 2013, pp. 2004–2012. [32] K. Eggensperger, F. Hutter, H. H. Hoos, and K. Leyton-brown, “Efficient Benchmarking of Hyperparameter Optimizers via Surrogates Background: Hyperparameter Optimization,” in Proc. of the 29th AAAI Conf. on Art. Int., 2012, pp. 1114–1120. [33] B. Al-Jubouri and B. Gabrys, “Multicriteria Approaches for Predictive Model Generation: A Comparative Experimental Study,” in IEEE Symposium on Computational Intelligence in Multi-Criteria DecisionMaking, 2014, pp. 64–71. [34] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Efficient and Robust Automated Machine Learning,” Adv. Neural Inf. Process. Syst. 28, pp. 2944–2952, 2015. [35] P. Kadlec and B. Gabrys, “Architecture for development of adaptive online prediction models,” Memetic Comput., vol. 1, no. 4, pp. 241–269, 2009. [36] R. Bakirov, B. Gabrys, and D. Fay, “On sequences of different adaptive mechanisms in non-stationary regression problems,” in Proc. Int. Jt. Conf. Neural Networks, vol. July, 2015, pp. 1–8. [37] P. Kadlec and B. Gabrys, “Local learning-based adaptive soft sensor for catalyst activation prediction,” AIChE J., vol. 57, no. 5, pp. 1288–1301, 2011. [38] L. Fortuna, S. Graziani, and M. G. Xibilia, “Soft sensors for product quality monitoring in debutanizer distillation columns,” Control Eng. Pract., vol. 13, pp. 499–508, 2005. [39] L. Fortuna, A. Rizzo, M. Sinatra, and M. G. Xibilia, “Soft analyzers for a sulfur recovery unit,” Control Eng. Pract., vol. 11, pp. 1491–1500, 2003.

14

[40] F. Hutter, H. Hoos, and K. Leyton-Brown, “An Efficient Approach for Assessing Hyperparameter Importance,” ICML 2014, vol. 32, no. 1, pp. 754–762, 2014. [41] P. Kadlec, B. Gabrys, and S. Strandt, “Data-driven Soft Sensors in the process industry,” Comput. Chem. Eng., vol. 33, no. 4, pp. 795–814, 2009. [42] M. Martin Salvador, M. Budka, and B. Gabrys, “Adapting Multicomponent Predictive Systems using Hybrid Adaptation Strategies with Auto-WEKA in Process Industry,” in AutoML ICML 2016, 2016. [43] S. Sadiq, M. Orlowska, W. Sadiq, and C. Foulger, “Data Flow and Validation in Workflow Modelling,” in ADC ’04 Proc. 15th Australas. database Conf., vol. 27, 2004, pp. 207–214. [44] M. Budka and B. Gabrys, “Correntropy-based density-preserving data sampling as an alternative to standard cross-validation,” in IJCNN 2010, 2010, pp. 1–8. [45] ——, “Density-Preserving Sampling: Robust and Efficient Alternative to Cross-Validation for Error Estimation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 1, pp. 22–34, 2013.

Manuel Martin Salvador (MEng’09 – MSc’11) graduated with distinction as a Computer Engineer from the University of Granada (Spain) in 2009, where he also received the MSc in Soft Computing and Intelligent Systems in 2011. Manuel has worked in a range of R&D projects for the industry in several sectors: renewable energy, process industry and public transport. Currently he is a PhD candidate at Bournemouth University and works as R&D engineer at We Are Base. His main research areas include automatic data preprocessing, predictive modelling and adaptive systems.

Marcin Budka received his dual BA+MA degree in finance and banking from the University of Economics, Katowice, Poland, in 2003, BSc in computer science from University of Silesia, Poland, in 2005 and PhD in computational intelligence from Bournemouth University, UK, in 2010, where he currently holds the Principal Academic in Data Science position. His current research interests include deep learning, metalearning, adaptive systems, ensemble models and complex networked systems with focus on evolution and dynamics of social networks.

Bogdan Gabrys is a Data Scientist and a Chair in Computational Intelligence at the Faculty of Science and Technology, Bournemouth University, UK. He received an MSc degree in Electronics and Telecommunication (Specialization: Computer Control Systems) from the Silesian Technical University, Poland in 1994 and a PhD in Computer Science from the Nottingham Trent University, UK in 1998. Over the last 20 years, Prof. Gabrys has been working at various universities and R&D departments of commercial institutions. His research activities have concentrated on the areas of data science, complex adaptive systems, computational intelligence, machine learning, predictive analytics and their diverse applications. Prof. Gabrys has published over 150 research papers, chaired conferences, workshops and special sessions and been on programme committees of a large number of international conferences with the Data Science, Computational Intelligence, Machine Learning and Data Mining themes. He is frequently invited to give keynote and plenary talks at international conferences and lectures at internationally leading research centres and commercial research labs. He is a Senior Member of the IEEE and a Fellow of the Higher Education Academy (HEA) in UK. More details can be found at: http://bogdan-gabrys.com

Suggest Documents