Data Driven Modelling Based on Recurrent Interval ...

Data Driven Modelling Based on Recurrent Interval-Valued Metacognitive Scaffolding Fuzzy Neural Network Mahardhika Pratama, Edwin Lughofer, Jie Lu, Meng Joo Er, Sreenatha Anavatti Abstract— the Metacognitive Scaffolding Learning Machine (McSLM), combining the concept of metacognition – what-to-learn, how-to-learn, and when-to-learn, and the Scaffolding theory – a tutoring theory for a learner to learn a complex task, has been successfully developed to enhance the capability of Evolving Intelligent Systems (EIS) in processing non-stationary data streams. Three issues, namely uncertainty, temporal behaviour, unknown system order, are however uncharted by any existing McSLMs and all McSLMs in the literature are designed for classification problems. This paper proposes a novel McSLM, called Recurrent Interval-Valued Metacognitive Scaffolding Fuzzy Neural Network (RIVMcSFNN) and used to solve regression and time-series modelling problems from data streams. RIVMcSFNN presents a novel recurrent network architecture as a cognitive constituent, which features double local recurrent connections at both the hidden layer and the consequent layer. The new recurrent network architecture is driven by the interval-valued multivariate Gaussian function in the hidden node and the nonlinear Wavelet function in the consequent node. As with its predecessors, the RIVMcSFNN characterises an open structure, where it can automatically grow, prune, adjust, merge, recall its hidden node and can select relevant data samples on the fly using an online active learning methodology. The RIVMcSFNN is also equipped with the online dimensionality reduction technique to cope with the curse of dimensionality. All learning mechanisms are carried out in the single-pass and local learning mode and actualise the plug-and-play learning principle, which aims to minimise the use of pre-and/or post-training steps. The efficacy of our algorithm was tested using numerous data-driven modelling problems and comprehensive comparisons with its counterparts. The RIVMcSFNN demonstrated substantial improvements in both accuracy and complexity against existing variants of the McSLMs and EISs. Index Terms—fuzzy neural network, type-2 fuzzy system, online learning, metacognitive learning, evolving fuzzy system I. Introduction The underlying motivation of Evolving Intelligent Systems (EIS) is to cope with two main issues – large data streams and dynamic learning environments (Angelov, 2004). EIS feature a flexible working principle, where it can start its learning process from scratch with an empty knowledge base and can self-organise its structure from data streams in the single-pass learning mode (Lughofer, 2008). Such a learning trait is very relevant to deal with today’s real-world big data applications, because data streams may contain various concept drifts: slow, rapid, abrupt, gradual, local, global, cyclical or otherwise (Bose, van der Aalst, Zliobaite & Pechenizkiy, 2014). In addition, EISs are computationally efficient, because their single-pass learning manner incurs a low computational burden (Angelov, 2011). Nonetheless, EIS is still cognitive in nature, because it needs to learn all the presented data streams without being able to solve the issue of what-to-learn and when-to-learn (Elwell and Polikar, 2011). The Metacognitive Learning Machine (McLM) enhances the adaptive nature of EIS by translating the metamemory model of Nelson and Narens (1990). In essence, the metacognition is about the ability of human beings to assess new knowledge based on previous information and the learning environment (Joysula et al., 2009; Flavell, 1979). Referring to the meta-memory model of Nelson and Narens (1990), the metacognition is composed of three fundamental components: termination of study (when to learn), selection of processing method (how to learn), and item selection (what to learn) (Isacson & Fujita, 2006). The three component are formalised in the realm of machine learning with a sample deletion strategy (what-to-learn), a sample learning strategy (how-to-learn) and a sample

1

reserved strategy (when-to-learn) (Suresh et al., 2010). Nevertheless, the McLM still does not characterise the plugand-play learning concept, or suffers from the absence of important learning modules, leading to the need for preand/or post-processing phases. Note that the plug-and-play algorithm brings a new paradigm of an online learning machine, where all learning components are carried out incrementally in the single learning phase, thus mitigating the use of the pre and/or post-processing steps (Lughofer et al., 2015(a)). The pre-and/or post-processing steps impose an offline process, which is intractable for online real-time processes. The concept of the plug-and-play learner was realised by Pratama et al., (2015(b)) by incorporating the Scaffolding theory as the framework of the how-to-learn feature of McLM – McSLM. Scaffolding theory in psychology is a tutoring theory for a student to solve a complex learning task by means of the active and passive supervision mechanisms (Vygotsky, 1978; Reiser, 2004). Passive supervision concerns the output-feedback process, which can be represented by the parameter learning scenario. Active supervision comprises three parts: complexity reduction, problematizing, and fading (Wood, 2001). The complexity reduction component aims to alleviate learning complexity and is formulated by the sample deletion process, the feature selection method, or data normalisation. The problematizing component deciphers the problem characteristics: data distribution, nonlinearity degree, etc. This component is closely related to the drift handling strategy in the machine learning context. Because one problem may contain various concept drifts, this component is built upon several learning techniques: drift detection, forgetting mechanism, rule growing, etc. The fading constituent is employed to avoid redundancy (Elwell and Polikar., 2011). In relation to the machine learning algorithm, this learning property can be implemented by merging and pruning of model components: neurons, rules, etc. Existing variants of McSLM are however not robust against uncertainty, temporal system dynamics and unknown system order, because currently applied algorithms are crafted in the type1 feed-forward network architecture. Moreover, almost all McLMs in the literature target classification problems, whereas to date, only the work of Das et al., (2015) is designed for regression problems. It is worth-noting that the scaffolding learning strategy has also not been explored by Das et al., (2015). A novel McSLM, namely Recurrent Interval-Valued Metacognitive Scaffolding Fuzzy Neural Network (RIVMcSFNN), is proposed in this paper to cope with the issues of the uncertainty, temporal system dynamics and unknown system order in learning from large data streams. The RIVMcSFNN features fully flexible and computationally efficient learning principles by automating the structural and parameter learning scenarios in respect to the nonlinearity of real data distribution. The algorithmic development of the RIVMcSFNN combines two sound theories in cognitive psychology –metacognitive learning and the scaffolding theory. The RIVMcSFNN adopts the plug-and-play learning paradigm, where all learning modules are carried out altogether without any additional steps. The cognitive component of the RIVMcSFNN is driven by a new recurrent network architecture, which puts forward double local recurrent connections at both the rule layer and the consequent layer. The recurrent link at the rule layer aims to induce the temporal firing strength by forming internal memory component, while the local feedback loop at the rule consequent functions as an internal memory component to improve its reaction speed against rapidly changing environments. To the best of our knowledge, the existing recurrent fuzzy neural networks in the literature still implement a single recurrent connection using either the local (Juang et al., 2010), global (Juang et al., 1999), or interactive (Lin et al., 2013) feedback loops. The new recurrent network architecture evolves a generalised type of

2

interval-valued fuzzy rule, where the rule antecedent is crafted by the interval-valued multivariate Gaussian function with uncertain means, while the rule consequent is built upon the nonlinear wavelet function. The amendment of the rule antecedent triggers a non-axis parallel ellipsoidal cluster with uncertain means, which provides a more appropriate cluster shape than standard rules. On the other hand, the nonlinear wavelet function allows data to be mapped into various resolution levels, which rectifies the linear output mapping capability of the zero or first-order Takagi Sugeno Kang (TSK) fuzzy system (Sugeno et al.., 1993). This fuzzy rule variant differs from its predecessor (Pratama et al., 2015), because this rule makes use of the wavelet function in lieu of the Chebyshev polynomial and embeds uncertain means in the interval-valued fuzzy set. The training procedure of the RIVMcSFNN starts with the what-to-learn scenario, which selects incoming data streams to be exploited for model updates in the how-to-learn component. In essence, the what-to-learn module is capable of avoiding the over-fitting issue and expediting execution time. The RIVMcSFNN utilises a novel sample deletion strategy, termed the Sequential Entropy Method (SEM). The SEM method is inspired by the concept of neighbourhood probability, where it measures the entropy of neighbourhood probability (Xiong et al., 2014). Nonetheless, the SEM method enhances the work of Xiong et al. (2014) by computing the neighbourhood probability using the concept of local density in the single-pass learning mode (Lewis & Carlett, 1994, Settles, 2010). The how-to-learn scenario based on the Scaffolding theory processes qualified data streams from the what-to-learn component to adapt the cognitive component of the RIVMcSFNN. The passive supervision of the Scaffolding theory is depicted by the parameter learning scenario, employing the Fuzzily Weighted Generalized Recursive Least Square (FWGRLS) method (Pratama et al., 2014(b)) and Type-2 Zero Error Density Maximization (T2ZEDM) (Pratama et al., 2015(b)). Moreover, the active supervision is represented by four new learning modules as follows: 1)

Generalised Type-2 Datum Significance (GT2DQ) and Generalized Type-2 Rule Significance (GT2RS)

methods: the GT2DQ and GT2RS methods are proposed to respectively govern the rule growing and pruning processes of the RIVMcSFNN. This method can be seen as an enhanced version of the T2DS and T2ERS methods (Pratama et al., 2015(c)), because it merges two cursors of the datum and rule significance method into a single compact form. Furthermore, we adopt the concept of Bortman & Aladjem, (2009) by putting forward the Gaussian Mixture Model (GMM) to approximate a complex input density function but extend this method for the generalised interval-valued fuzzy rule (Vokuvic & Miljkovic, 2013). In the realm of the Scaffolding theory, the rule growing module is inherent to the problematizing component because it is capable of tracking changing system dynamics as with the drift detection method. The rule pruning module plays the fading role of the Scaffolding theory, because it discards superfluous rules, contributing little during their lifespan. 2)

Type-2 Relative Mutual Information (T2RMI) method: the T2RMI is proposed to forget obsolete fuzzy rules

by examining the correlation between the fuzzy rules and the desired variables. The original version of the RMI method was originally initiated in (Gan et al., 2014). In this paper, the type-2 version of the RMI method is developed to adapt to the working principle of the interval type-2 fuzzy system and works on the strictly sequential working framework. It is worth noting that the T2RMI does also detect outdated fuzzy rules, based on their relevance to the current data trend. Furthermore, we also extend this method to undertake the rule recall scenario to deal with cyclic concept drift. Cyclic concept drift can be understood as a circumstance, when an old data distribution re-emerges in

3

the future. It is worth noting that our rule recall scenario differs from (Pratama, 2015(a)), because it is not part of the rule growing module. In other words, a rule base can be expanded not only by the GT2DQ method, but also by the T2RMI method by reactivating previously pruned rules. Furthermore, the concept of relevance is introduced to monitor the validity of previously pruned rules in lieu of data density. This method happens to be very useful in coping with concept drifts. It is thus categorized as the problematizing component of the Scaffolding theory. 3)

Sequential Markov Blanket Criterion (SMBC): to the best of our knowledge, the curse of dimensionality is

an open problem for most EIS in the literature owing to the absence of the feature selection scenario. Although several online feature selection scenarios in (Angelov, 2010; Lughofer, 2011(a); Pratama et al, 2014(b), Lughofer et al, 2015(a)) have been recently designed, these approaches determine the importance of input attributes merely based on the relevance of input attributes in explaining the target concept, but discounts the possible redundancy issue among input attributes. In this paper, a novel online feature selection criterion, namely SMBC, is put forward to depict the online version of MBC (Yu et al., 2004). The SMBC not only investigates the relevance of input attributes, but also explores the redundancy issue of input features to gauge their significance - the complexity reduction role. 4)

Type-2 Geometric Criteria (T2GC): the geometric criteria of Lughofer et al, (2015(a)) are extended to suit

the working principles of the generalised interval-valued fuzzy rule of the RIVMcSFNN. The T2GC method focuses on resolving the redundancy problem of the network architecture, thereby refining the interpretability of rule semantics and alleviating structural complexity at once. The fading aspect of active supervision is characterised by the T2GC method. On the other hand, the when-to-learn module is carried out by means of the standard sample reserved strategy, which deals with those accepted by the what-to-learn module but does not meet the learning criteria of the how-tolearn module. The local forgetting strategy and the rule splitting strategy can be added as the problematizing component in the RIVMcSFNN (Pratama et al., 2015(c)). These modules are ignored here to keep the paper compact and to reduce the complexity of the algorithms. This paper proposes an effective solution to surmount the problem of uncertainty, temporal system dynamics and unknown system order in the data stream mining environment. The RIVMcSFNN makes four major contributions: 1) it provides a framework of the metacognitive scaffolding learning algorithm to address the regression problem, 2) it presents a novel recurrent network topology, featuring double local recurrent links, 3) it puts into perspective a new type of the interval-valued fuzzy rule, 4) it designs four new learning modules, namely GT2DQ, T2RMI, SMBC, T2GC as aforementioned in 1) - 4) above. The efficacy of the RIVMcSFNN was rigorously validated using various real-world and synthetic data streams and comparisons with state-of-the art learning algorithms, showing that the RIVMcSFNN produced more reliable predictive accuracy while retaining lower complexity. The remainder of this paper is structured as follows: Section 2 details the literature review on related work; Section 3 elaborates the cognitive component of the RIVMcSFNN; Section 4 describes the metacognitive learning policy of the RIVMcSFNN; Section 5 outlines proof of concepts and comparisons with prominent algorithms. In this section, the key components of the RIVMcSFNN, which distinguishes itself from the related works, are also discussed; concluding remarks are given in Section 6. II. Literature Survey of the State-of-The Art Work The EIS concept was pioneered by Juang and Lin (1998) with the proposal of SONFIN. SONFIN has both structural and parameter learning scenarios and works completely in the single-pass learning mode. Kasabov and Song

4

(2002) proposed the DENFIS, in which the Evolving Clustering Method (ECM) is devised to evolve the fuzzy rule. Nevertheless, SONFIN and DENFIS use a distance-based input partitioning approach, which is not robust against outliers. A prominent work in the field is eTS, proposed by Angelov and Filev (2004). eTS is built upon an online version of mountain clustering (Yager, Filev, 1994) and adopt the concept of recursive density estimation. After the proposal of eTS appeared in the literature, the area of EIS grew exponentially. eTS was simplified in simp_eTS (Angelov & Filev, 2005), replacing the concept of information potential with the scatter concept. The SAFIS was developed in (Rong et al., 2006), which amends the rule growing and pruning modules of GAP-RBF and GGAP-RBF (Huang et al. 2004; Huang et al., 2005) for the fuzzy system. The bottleneck of eTS, simp_eTS and SAFIS are found in the univariate Gaussian fuzzy rule, which does not feature the scale-invariant property. Furthermore, Lughofer (2008) proposed the FLEXFIS, exhibiting an incremental version of vector quantisation. The FLEXFIS, however, suffers from the absence of rule base simplification procedure, which imposes expensive structural complexity. Thus, it was extended in Lughofer, (2012) to FLEXFIS++ approach, which overcomes this deficiency by automatically merging redundant rules. Angelov, (2010) developed eTS+ embedding several online rule base simplification methods and an online dimensionality reduction in the original eTS algorithm. The simp_eTS+ was proposed in (Angelov, 2011). Simp_eTS+ modifies eTS+ by employing the density increment method to govern the rule addition process. eTS+ and simp_eTS+ trigger the axis-parallel ellipsoidal cluster, which is not effective in dealing with non-axis parallel data distribution. eMG was proposed by Lemos et al. (2011), which extends ePL (Lima, 2006) wtih a multivariable Gaussian function case generalizing the standard TSK fuzzy rule. The eMG is based on the notion of participatory learning but does not yet incorporate the online dimensionality reduction technique. Another prominent EIS, namely AnYa, was proposed in (Angelov, Yager, 2011) with the underlying notion of the cloud-based rule. Although AnYa is equipped with an online feature selection, it is based on the concept of relevance, which is prone to discontinuity problem. Pratama et al. (2014(a)) put forward the PANFIS, which modifies the statistical contribution theory of SAFIS to the multivariate Gaussian function. More recently, PANFIS was extended in (Pratama et al., 2014(b)), which incorporates a new online feature selection scenario. FLEXFIS is extended in (Lughofer et al., 2015(b)) employing the Gaussian function with a non-diagonal covariance matrix, geometric rule merging criteria to achieve maximal compactness and a smooth feature weighting concept to softly reduce curse of dimensionality effect (i.e., features with low weights have little influence in all distance calculations). This work implements the statistical contribution measure of the GENEFIS to approximate feature contribution. As with AnYA, the statistical contribution only takes into account feature saliency without considering mutual information across input attributes. Aforementioned algorithms make use of the feedforward network structure, which does not properly address the temporal system dynamic. The feedforward network structure is well-known to be over-dependent on the time-delayed input attributes. The idea of EIS has also been implemented in the recurrent network topologies. This idea in recurrent network has been pioneered by Juang and Lin, (1990) with RSONFIN inserting a global feedback loop in the original SONFIN. RSONFIN is akin to the SONFIN, which is constructed with a distance-based clustering approach. This methodology is akin to its predecessor, which is prone to outliers. The so-called TRFN is another example of EIS created in the global recurrent network topology, and combines the gradient descent and GA methods for the adaptation process of

5

EIS (Juang, 2002). The use of GA, however, limits scalability in the online environment due to its computationally prohibitive nature. Moreover, the global recurrent network architecture is deemed to be less effective for the online learning process because it undermines the local learning paradigm, which assumes the EIS is a loosely coupled fuzzy model. The global recurrent network architecture augments the input dimension, which may lead to the curse of dimensionality. The local recurrent network architecture was devised for EIS in (Juang, 2010), where it possesses a local recurrent connection in the rule layer. The idea of interactive recurrent network architecture was put forward in (Lin et al., 2013). As with its global counterpart, the interactive structure may harm the spirit of the local learning scenario because of interconnections of recurrent components across different rules. All algorithms discussed so far are crafted in the type-1 fuzzy system, which features crisp and certain membership. The type-1 fuzzy system is deemed not adequately robust to cope with the issue of uncertainty, particularly when dealing with the imprecise, inexact and inaccurate nature of real-world data streams because it necessitates an accurate parameter identification strategy. Note that the uncertainty of real-world data streams occurs because of disagreements in expert knowledge, noisy measurement, and noisy data. The issue of uncertainty led to the proposal of the type-2 fuzzy system which introduces fuzzy membership degree triggered by the fuzzy-fuzzy set (Zadeh, 1975). However, the type-2 fuzzy system has over-complex working principle, which cannot be processed by the well-established type-1 fuzzy mathematics. The type-2 fuzzy system also incurs prohibitive computational complexity, resulting from the type reduction mechanism from type-2 to type-1. An interval type-2 fuzzy system was developed in (Liang, Mendel, 1999) to mitigate the complexity of the type-2 fuzzy system, which assumes the secondary grade of the type-2 fuzzy system to be unity. Note that the interval type-2 fuzzy system with a single interval primary membership is equivalent to the interval-valued fuzzy system (Bustince et al., 2015). The notion of the interval type-2 fuzzy system has been integrated to EIS in (Juang et al. 2008), which essentially improves the SONFIN with the type-2 fuzzy set. This concept was later extended to the local recurrent structure in (Juang et al., 2009) and to the interactive recurrent structure in (Lin et al., 2013). The distance-based clustering method makes this method vulnerable to outliers. The hybrid learning scenario of the interval type-2 EIS was proposed in (Castro et al., 2009), where the gradient descent method is designed for three different configurations of the interval type-2 network architectures. This work implements a static network structure, which does not cope with changing learning environments. The interval type-2 fuzzy system was also brought into the Mamdani type fuzzy system in (Tung et al., 2014). Juang et al. (2013) addressed the interpretability problem of the interval type-2 EIS by proposing a new parameter learning scenario, which is capable of improving the rule semantics of the interval type-2 EIS. Both works suffer from the absence of feature selection, which requires the feature selection process as a part of preprocessing part. All interval type-2 EISs in this paragraph suffer from the scalability issue as a result of the intensive use of the KM type reduction method. The evolving interval type-2 fuzzy classifier, namely GT2FC, was proposed in (Bouchachia, Vanaret, 2014). Although GT2FC in (Bouchachia, Vanaret, 2014) works in the truly sequential learning scenario, it is however a zeroorder classifier, which sets the rule consequent as the class label. Hence, the working principle can avoid the use of the KM method but the zero-order classifier is usually less accurate than the first-order or higher classifier because it does not predict the decision surface of the classification problem. It is worth noting though if there is no (monotonic)

6

order of class indices, accuracy of a higher order classifier is compromised. The q coefficient was proposed in (Abiyev & Kaynak, 2010) as an alternate of the KM method in the fixed-structure interval type-2 FNN. The notion of the qdesign factor was adopted in the interval type-2 EIS in (Lin et al., 2014(a),(b)). Note that the interval type-2 fuzzy systems reviewed here are essentially the interval-valued fuzzy system because they employ the single interval primary variable - a special case of the interval type-2 fuzzy system (Bustince et al., 2015). Although a large volume of works has been devoted to EIS research in the last decade or so, the underlying design of EIS is cognitive in nature, because it mainly deals with the issue of how-to-learn and discounts the two important issues of when-to-learn, what-to-learn. This implies that all data streams have to be learned in order without paying close attention to their effect on the training progress and without aptitude to select ideal time instants in consuming training data. The concept of the McLM was initiated in the seminal work of Surest et al. (2010), called Self-Adaptive Resource Allocation Network (SRAN). The SRAN argues that the adaptive and flexible nature of EIS can be improved by translating the meta-memory model Nelson & Narens, (1990) into the machine learning context. This notion also incorporates the issue of when-to-learn and what-to-learn, uncharted by conventional EIS, into the training process. This work has been extended for various cognitive components: fully complex network architecture (Savitha et al., 2012), type-1 fuzzy systems (Subramanian et al., 2014(a)) and interval type-2 fuzzy systems (Subramanian et al., 2014(b)). The projection learning algorithm was crafted for McLM in (Babu & Suresh, 2013). However, the McLM suffers from the absence of important learning modules in the main learning engine, consequently resulting in preand/or post-training steps being executed. This drawback noticeably distracts from the underlying spirit of the online real-time learner, which can be envisaged as a plug-and-play learning algorithm. Although the sample selection scenario has been integrated in these works, they still actualize the fully supervised learning scenario, charging costly annotation efforts by operator. The McSLM aims to tackle this issue by referring to the Scaffolding theory – a prominent tutoring theory in psychology by which a learner can solve a complex learning task – to develop the how-to-learn component of the metacognitive learning scenario. The idea of McSLM was pioneered by Pratama et al. (2015(a)) and later this work was modified in (Pratama et al., 2015(b))). Three issues remain unsolved in this works: uncertainty, temporal system dynamics and the unknown system order. The uncertainty issue arises because existing McSLMs are still constructed in the type-1 FNN architecture, which is well-known to suffer from the issue of uncertainty. McSLMs are also built on the traditional feed-forward network topology, and are not robust against temporal system dynamics. The feed-forward network architecture entails prior knowledge of the lagged input features to form the I/O relationship of the regression model. In addition, the vast majority of McLMs and McSLMs in the current literature are designed for the classification problem and to the best of our knowledge only that of Das et al. (2015) handles the regression cases. Das et al ’s work still leaves open questions, because it shares similar characteristics with Subramanian et al. (2014(b)). Table 1 presents classification of all algorithms we survey over in this section. III. Cognitive Component of RIVMcSFNN This section details the cognitive component of the RIVMcSFNN, which produces the final output. The RIVMcSFNN utilizes a new six-layered recurrent network architecture which features double local recurrent connections at both the hidden layer and the consequent layer. The interval-valued multivariable Gaussian function with uncertain means is used in the hidden layer, and the functional-link wavelet polynomial is utilised in the rule

7

layer. The network architecture of the RIVMcSFNN is depicted in Fig.1. Suppose that a data stream at n-th time instant arrives in the form of Dn  ( X n , Tn ) , where X n is an input variable vector X n  ( x i ,.., x j ,..., x nu ) and Tn is a desired variable vector Tn  (t1 ,..., t o ,..., t no ) . nu and no stand for the number of input and output dimensions, respectively. Because the RIVMcSFNN works completely in the online real-time mode, only the n-th data stream

Dn  ( X n , Tn ) is seen. After the learning process for a data stream is completed, the data stream is immediately discarded without any future access to the data stream. Because of the possibly unbounded characteristic of data stream, the total number of data stream N is unknown a priori. The generic operations in each layer are defined as follows: fa P stands for the forward activation function in layer P  1,..., 6 and the corresponding output of each layer is denoted by fo P . The mathematical operation of each layer is discussed as follows: Layer 1 (Input Layer): An external crisp input data stream X n is fed into this layer and is directly passed to layer 2

fo 1j  fa 1 ( x j )  x j . Although no specific operation is performed in this layer, the RIVMcSFNN is equipped with the online dimensionality reduction method, realising the complexity reduction of Scaffolding theory and is capable of alleviating the number of input dimensions nu. Layer 2 (Rule Layer): the RIVMcSFNN utilises the interval-valued multivariate Gaussian function with uncertain means in the hidden layer, which produces an interval-valued firing strength as follows: 2 ~ 2 ~ ~ ~ fo  fa 2 ( fo 1 )  exp( ( FAn2  C i ) i1 ( FAn2  C i )), fo 2  [ fo 2 , fo ]

(1) ~ where C i  [C i , C i ] is the uncertain centroid of the i-th rule, formed by the lower and upper centroids C i  C i . The uncertainty degree of the centroid determines the footprint of uncertainty of the i-th cluster, which contributes to the fuzziness of the membership degree (1). Unlike the type-1 fuzzy system, generating crisp firing strength, the intervalvalued fuzzy system is more robust against uncertainty because it has a sort of degree of tolerance against uncertainty as a result of its fuzzy membership degrees.  i1   nunu is the non-diagonal inverse covariance matrix, whose element 

controls j , ˆj

the orientation and size of the ellipsoidal cluster and also describes an interrelation of two input

features j , ˆj . The multivariable Gaussian function evolves the arbitrarily rotated ellipsoidal cluster, which does not necessarily span in the axis-parallel direction. This advantage leads to more appropriate input space partition than the commonly used uni-variable or multivariable Gaussian function in the main axis, especially when data streams are not distributed in the main axes. This unique trait allows the fuzzy rule demand to be suppressed at a low level, ultimately compensating for the possible increase of parameters as a result of a full covariance matrix. Furthermore, the multivariate Gaussian function features the scale-invariant trait and sustains the interrelation among input variables, which disappeared under the conventional fuzzy rule using the product t-norm operator (Pratama et al., 2014(a)). Suppose that we deal with the Multi-Input Single Output system, the IF-Then rule of the RIVMcSFNN is expressed as follows:

~ Ri : IF X n is fo i2 Then y i  x e i  i

(2)

8

where x e i ,  i are respectively an extended input variable resulted from a nonlinear mapping of the wavelet coefficient

Xe  1 R and a weight vector    R1 . The details of the rule consequent can be found in the consequent layer section (Layer 5). The non-axis-parallel ellipsoidal rule, however, suffers from the transparency issue, because the atomic clause of the human-like linguistic rule is omitted in (2). In other words, it completely runs in the high dimensional space and does not have fuzzy set representation. The non-axis-parallel ellipsoidal rule cannot be applied directly to the interval type-2 fuzzy environment because its inference scheme is undertaken in the fuzzy set level. In light of this issue, the transformation strategy must be carried out, which aims to formulate the fuzzy set representation of the non-axis parallel ellipsoidal cluster. To this end, we extend the second method of (Pratama et al., 2014(b)) for the interval-valued ellipsoidal cluster. The underlying reason to use this method is because of its fast mechanism, which is scalable in the online learning situation. The transformation strategy is defined using the average cardinality principle as follows:

i 

(r i  r i )

(3)

2 ii

where i,i denotes the diagonal element of the covariance matrix and r i , r i shows the lower and upper Mahalanobis ~ ~ ~ distances ( X n  Ci )i 1 ( X n  Ci ), Ci  [C i , C i ] . No transformation is required to the centre of the multivariate

Gaussian function, because it is applicable directly in the fuzzy set level. After forming the fuzzy set representation of the interval-valued multivariable Gaussian function, the inference scheme of the interval-valued fuzzy system starts with the fuzzification operation in respect to the upper and lower Gaussian membership functions with uncertain means c~i , j  [c i , j , c i , j ] as follows: ~ 2.1  exp( ( fo i, j

~ fa 2j  c i, j

 i. j

2 ~ ~i i i i ) 2 ) , N (c j ,  j ; fa j ) ci  [c j , c j ]

fo i , j

 N (c j i ,  i , j ; fa 2j ) fa 2j  c j i   , i i  1 c j  x j  c j,2  i 2   N (c j ,  i , j ; fa j ) fa 2j  c j i

fo 2.1 i, j

(c j i  c j i )  N (c i ,  i , j ; fa 2j ) j  xj   2 i 2 i  N (c j ,  i , j ; fa j ) (c j  c j i )

2.1

xj 

(6)

(7)

(8)

2

2.1 ~ The fuzzy membership is defined as the interval-valued membership degree as foi2,.j1  [ fo 2.1 , fo i , j ] . After finding the i, j

expression of the fuzzy set, we can transform (2) into a more interpretable form as follows: 2.1 Ri : IF x1 is fo~i2,1.1 AND x j is fo~i2,.j1 …AND x nu is fo~nu Then y i  x ei  i

(9)

The transformation strategy also overcomes the issue of transparency of the multivariable Gaussian function. The validity of

fo~2  [ fo 2 , fo 2 ] is proven in Appendix A.

Layer 3 (Spatial Firing Layer): the upper and lower membership degrees of each fuzzy set are then connected using the product t-norm operator to generate the interval-valued spatial rule firing strength as follows:

9

nu

fo 3  i

 fa j 1

nu

3 i, j



 fo j 1

2..1 i, j

3

nu

, fo i 

 fa j 1

3 i, j

nu



 fo

2.1 i, j

(10)

j 1

The product t-norm operator represents AND in the fuzzy rule definition (9). The firing strength is widely used to illustrate the degree of compatibility of the existing rule. It is also employed as a rule growing cursor in almost all the interval type-2 EIS reviewed in Section II. Nevertheless, the bottleneck of such approach is akin to the distance-based clustering approach, which is not robust against outliers. Layer 4 (Temporal Firing Layer): to better address temporal system dynamics, RIVMcSFNN incorporates a local recurrent connection that feeds the spatial firing strength of each fuzzy rule in previous observations back to itself, and outputs a temporal firing strength. The local feedback loop characterises an internal memory component, which is an effective solution to handling the temporal problem because it captures the past behaviour of the system. In addition, the local recurrent connection allows to overcome dependency on delayed input attributes, which determines the system order. As a result, it reduces the number of input features for the modelling process. Since this layer is influenced by past and present firing strengths of fuzzy rules, the temporal firing strength is outputted as follows: 4

4

4

fo i ,o  oi fa i  (1  oi ) fo i (n  1) , fo i ,o  oi fa i  (1  oi ) fo i (n  1) 4

4

4

(11)

where oi  [0,1] is a recurrent weight. We make use of a unique temporal firing strength for the i-th rule of the o-th class, because system dynamics may vary in different local regions across different target variables. It is worth noting that, in the feed-forward network architecture, a regression model is usually a function of present and previous input and/or output. This trait results in over-dependence on the system order or the number of delayed input attributes. There exist several types of recurrent loops in the literature: global (Juang and Lin, 1999), local (Juang, 2010) and interactive (Lin et al., 2013), but the local feedback loop is deemed the most appropriate in the realm of EIS because its working principle aligns very well with the local learning property of EIS. Layer 5(Consequent Layer): RIVMcSFNN is inspired by the idea of the functional link neural network using the wavelet transform to construct the extended input feature x e (Abiyev et al., 2008). The wavelet function features the multi-resolution property, which is capable of extracting important information on various resolution levels, thus increasing predictive accuracy while retaining low network complexity. In the realm of EIS, this strategy is an attempt to rectify the output mapping capability of the zero or first order TSK rule consequent, which does not fully explore local output regions. Nonetheless, the use of the wavelet function often slows down the reaction speed against changing data distributions (Genjefar et al., 2014). This phenomenon usually occurs when a large number of rules are evolved in the training process. The recurrent connection in the consequent layer can be integrated to correct this shortcoming (Ganjefar et al., 2014). Suppose that we select the Mexican hat function as the mother wavelet function, the translated and dilated version of the mother wavelet function is defined as follows:

 i , j ( z i , j )  (1  z i , j ) exp(  2

z i, j 2 2

), z i , j  (

d i , j a i , j bi , j

)

(12)

where a i , j , bi , j stand for the dilation and translation parameters and d i , j is a temporal input variable resulting from the local recurrent connection in the consequent layer. The output of the local recurrent layer is formulated as follows:

10

d i , j (n)  fa 5j (n)   i , j  i, j (n  1)  fo1j   i , j  i , j (n  1)  x j   i , j  i, j (n  1)

(13)

where  i, j is a weight of the self-feedback loop. The extended input variable is induced by using the product operator: nu

xe i 



i, j ( z i, j )

  i, j (

d i, j  ai, j bi , j

j 1

)

(14)

The end output of this layer is formalised by weighting the extended input variable as follows

fo i5  x e i  i

(15)

where  i is a connection weight between the temporal firing layer and the output layer. Notwithstanding that the wavelet transform has been profoundly studied in the existing literature (Abiyev et al., 2008; Abiyev et al., 2013; Ganjefar et al., 2014), to the best of our knowledge, this approach has not been implemented with the interval-valued multivariable Gaussian function and the double-recurrent network architecture. Furthermore, other variants of functional-link neural networks have been proposed in the literature. Polynomial, Chebyshev (Patra et al., 2002) and Trigonometric (Lin et al., 2012) neural networks are among the most renowned functional-link types. The wavelet function is preferred over these variants for RIVMcSFNN because it incurs fewer parameters to be stored in the memory and operates in a low degree of freedom, which is on par to the zero-order TSK rule consequent. The wavelet function is also more adaptive than these variants to deal with a non-stationary environment because its shape is adjustable by manipulating the translation and dilation parameters. Layer 6 (Output Layer): in this layer, the type reduction mechanism from the interval-valued set to the type reduced set (type-1 fuzzy set) is committed, using the q design coefficient in lieu of the prominent KM procedure. The working principle of the KM method involves a demanding iterative procedure because the rule consequent has to be reordered first in ascending order before iteratively obtaining the cross-over points with the KM method. This layer also performs the defuzzification function, leading to the final crisp output of the RIT2McSFNN. In short, the final crisp output of RIVMcSFNN is produced as follows: 4

y o  fo  6

(1  q o ) fo i ,o fo i5 R

 i 1



fo 4 i ,o

q o fo 4 fo i5 i ,o

R



(16)

4 fo i ,o

i 1

where R is the number of fuzzy rules and q is the design factor q  1no . The type reduction mechanism via the q design factor functions by steering the proportion of the upper and lower rules to the final crisp output of RIVMcSFNN. We modify the normalisation term of the original q design factor (Abiyev et al., 2010) because the original expression can result in an invalid interval. This point is evidenced as follows: given that we have the temporal 4 firing strength fo~i4,o  [ fo 4 , fo i ,o ]  [0.65,0.75;0.58,0.62;0.34,0.44] , the original normalisation expression results in i ,o

an invalid interval [0.414,0.414;0.369,0.343;0.22,0.24] . Our amendment is based on a division law of interval arithmetic (Moore et al., 2009). Furthermore, four basic interval operations, namely addition, subtraction, division, and multiplication are outlined in the Appendix B. IV.

Metacognitive Component of the RIT2McSFNN

11

This section details the metacognitive learning policy of the RIT2McSFNN. The pseudo-code of the RIT2McSFNN is provided in the Algorithm 1. The learning structure of the RIVMcSFNN is visualised in Fig.2. I.

What-to-Learn The what-to-learn component is used to discard inconsequential data streams, and is driven by the Extended

Sequential Entropy Method (ESEM). The what-to-learn component in the literature can be grouped into two versions: supervised and semi-supervised. The supervised variant is exemplified by the sample deletion strategy based on the hinge-loss function (Suresh et al., 2010), whereas the semi-supervised version is run with the use of the online active learning strategy (Pratama et al., 2015(a)). Nonetheless, none of these address the classification problem, which depends on the contour of the decision surface. To the best of our knowledge, the what-to-learn component in (Das et al., 2015) is the only method in the existing literature to address the regression problem. However, the work in (Das et al., 2015) is akin to its predecessors in (Suresh et al., 2010), which monitors the sample potential via the hinge-loss error function. Arguably, it is rather sensitive to the dynamics of system errors, because system errors are possibly high in the over-fitting case. The prominent facet of the ESEM is its aptitude to quantify the sample entropy incrementally. By extension, it signifies the entropy of the neighbourhood probability, thus being able to delineate the relationship between data streams and existing data clouds. The idea of neighbourhood probability was introduced in (Xiong et al., 2014) but it constitutes a batched learning scenario. The entropy measure is combined with the density concept, which is used to hamper the outliers to be accepted in the training process. In short, the probability of data streams occupying the existing fuzzy regions can be calculated as follows: N i 1  M (X , X ) N n N n 1 P( X  N )  i n i R Ni M ( X , X ) N n   N i  1n  1 i

(17)

where X N denotes the latest incoming data stream and X n indicates the n-th support of the i-th cluster, while

M ( X N , X n) is defined by the similarity measure. Nevertheless, (17) requires revisiting all preceding data streams, attracting considerable computational power. This problem is tackled by formulating the recursive form of (17): ( N i 1) u

Ni



  (x

M (Xn, Xi)

n 1

Ni



n 1

u

n, j

 xN , j )2

j 1

( N i  1)u

 N i , j   N i 1, j  x N i 1, j ,  N i   N i 1 

(

 (N

u

i

 1) x N , j 2  2

j 1



x j 1

( N i  1)u

N , j N i , j

  Ni )

(18)

u

x j 1

N i 1, j

2

The entropy of the membership is quantified as follows: R

H (  X n )   P( X n  N i ) log P( X n  N i )

(19)

i 1

Note that the uncertainty of membership signifies the uncertainty of the datum to the existing network structure and a data stream with high uncertainty should be accepted to be a training sample, because it may minimise the uncertainty in learning the target function. Nonetheless, this concept opens the possibility for outliers to be fed to the how-to-learn

12

component. The nearest neighbours approach is explored to correct this shortcoming, where the uncertainty measure (19) is weighted by its average distance to the R densest local regions of the cognitive component (Hajmohammadi et al., 2015). In essence, the R most populated samples are represented by the focal points of data clouds, because they have been carefully extracted by the T2DQ method. The average distance between the queried sample and the focalpoints is computed as follows: R

A( X ) 

 similarity( X , C ) i

i 1

(20)

R

where similarity ( X , C i ) can be defined as any distance function to compute the pair-wise similarity value between two examples: Cosine, Euclidean, etc. Accordingly, the final expression of the ESEM method can be formed as follows:

H  H ( N X n )  A( X )

(21)

The condition to accept a data stream is set as follows:

H 

(22)

where  is an uncertainty threshold. The uncertainty threshold is not kept constant rather is adjusted to adapt to rapidly changing learning environments. One can envisage that the time-varying property of the system dynamics causes a more intense training process, whereas the static case is often trivial for training to progress. To this end, the threshold is set as  N 1   N (1  s) , where it augments  N 1   N (1  s) , when the training samples can be admitted from the training process H   to mitigate the computational burden and vice versa. s is the step size, which is assigned as

s  0.01 referring to the rule of thumb in (Zliobaite et al., 2014).

II.

How-to-learn This section elaborates the how-to-learn scenario of the RIT2McSFNN, derived from Scaffolding theory in

cognitive psychology.

A)

The Rule Growing Module: a new rule growing scenario, namely the GT2DQ, is developed, which modifies

the neuron significance in (Bortman & Aladjem, 2009; Vokuvic & Miljkovic, 2013) to the context of the intervalvalued multivariable fuzzy rule. Bortman & Aladjem (2009) bring the definition of the neuron statistical contribution of Huang et al. (2004, 2005) to a more general case, where data streams do not follow uniform data distribution, and remove the nu-fold numerical integration, which is only compatible for a small input dimension, to derive the final expression of the neuron significance. The key idea of this method lies in the use of the GMM as the input density function to cope with complex and even irregular data clouds. Vokuvic & Miljkovic (2013) extend the formula of neuron significance for the multivariable Gaussian neuron. We further extend the definition of neuron significance to the generalised interval-valued neuron. In light of the neuron significance definition, the significance of i-th multivariable interval-valued rule is expressed as the Lu  norm of the error function weighted by the input density function as follows:

Ei   i

u

(1  q)(

 exp( u x  c

 nu

2 i  i

) p( x)dx)

1

u

 i

u

q(

 exp( u x  c

 nu

2 i  i

) p( x)dx)

1

u

(23)

13

1 where the Gaussian term under the integral is written as (2 / u ) nu / 2 det(  i ) 1 / 2  N ( x; c~i  i

u

), c~i  [c i , c i ] . One

can realise that neuron significance relies on the input density p(x) . The input density p(x) is assumed to follow simple data distributions in (Huang et al., 2005) or uniform data distribution in (Pratama et al., 2014(a)). This flaw can be remedied using the GMM to cope with complex input density p(x) as follows: M

p( x) 



m N ( x; v m ,  m )

(24)

m 1

where N ( x; v m ,  m ) is a multivariable Gaussian probability density function with mean vector v m  1nu and covariance matrix  m   nunu .  m is the mixing coefficients satisfying

M



m

 1,  m  0 . (18) can be further

m 1

derived using the GMM as the input density p(x) as follows:

(1  q)(( 2 / u ) det(  i ) 1 / 2 u

Ei   i  i

u

q(( 2 / u ) det(  i )

1 / 2

M

   N ( x; c m

m 1

M

  m

m 1



i

 i1 / u ) N ( x; v m ,  m )dx)

1

u

 nu

(25)

N ( x; c i  i1

/ u ) N ( x; v m ,  m )dx)

1

u

nu

The integral term of (25) can be solved as an integral of product of two Gaussian distributions

 N ( x; c~ ,  i

1 i

/ u ) N ( x; v m ,  m )dx  N (c~i  v m ;0,  i1 / u   m ) . Accordingly, the final expression of the GT2DQ

 nu

method is formalised as follows:



n

Ei  i u (1  q) (2 / u ) 2 det(i ) 1/ 2 N i T



1/ u



n

 i u q (2 / u ) 2 det(i ) 1/ 2 N i T



1/ u

(26)

where  is a vector of mixing coefficients   [ 1 ,..,  m ,..,  M ]  1m and N i , N i are defined as follows:

  N (c

 )

N i  N (c i  v1 ;0,  i1 / u  1 ), (c i  v 2 ;0,  i1 / u   2 ),..., (c i  v m ;0,  i1 / u   m ),.., (c i  v M ;0,  i1 / u   M ) , Ni

i

 v1 ;0,  i1 / u  1 ), (c i  v 2 ;0,  i1 / u   2 ),..., (c i  v m ;0,  i1 / u   m ),.., (c i  v M ;0,  i1 / u   M

We make use of the L2  norm in this paper u=2 because it is commonly used in the literature. Moreover, the parameters of the GMM, namely the mean v m , the covariance matrix  m , the mixing coefficients  m , and the number of mixture models M, are obtained using pre-recorded data points N prehistory as in (Huang et al., 2005; Bortman & Aladjem, 2009; Vokuvic & Miljkovic, 2013). Being granted access to prehistory samples is in practise not difficult, especially in the big data era in which we live today. Moreover, the number of pre-recorded samples happens to be significantly smaller than the total number of the training data points N prehistory  N . The sensitivity of the RIVMcSFNN for various numbers of the pre-recorded samples is studied in section V. It shows that N prehistory is not case-sensitive.

14

For the sake of the rule growing process, c i , c i ,  i1 are replaced by c R 1 , c R 1 ,  R11 . In other words, a hypothetical rule is created based on an incoming data stream. The hypothetical rule is crafted as follows:

max((Ci  Ci 1 ), (Ci  Ci 1 )) ~ C R 1  X N  X , diag ( R 1 )  1 ln(  0.5)

(27)

where  is a predefined constant, which determines the degree of rule base completeness and is simply set as 0.5 as in the existing literature. The initialisation strategy of the covariance matrix (Wu et al., 2003) is selected because it ensures the rule base to attain the   completeness condition. X is an uncertainty factor, which controls the uncertainty level to be embedded into a new interval-valued rule. It is set as 0.1 for simplicity. The hypothetical rule is ascertained to be a new rule, given that some rule growing criteria are satisfied. In the realm of the GT2DQ method, the hypothetical rule is deemed important for the training process if it offers a significant statistical contribution as follows:

max ( Ei )  ( E P 1 )

(28)

i 1,...,R

The hypothetical rule is deemed important provided (28) is met. It can be observed in (28) that we come up with a different rule growing condition from its predecessors, where the hypothetical rule is appended as a new rule when making a higher statistical contribution than the existing rules. This case reflects the compatibility of the hypothetical rule to the current data trend over existing ones. In addition, the GT2DQ conveys the possible future contribution of the hypothetical rule because the statistical contribution is approximated from the whole training region, as indicated by integration using the input density function. This makes the D2DQ method more robust with outliers. On the other hand, we would also like to avoid the use of predefined thresholds as in the case of its predecessors, which is problem specific and usually requires laborious trial-error process to find the best value for a specific problem. Although an exact solution to choose the threshold is provided by Bortman & Aladjem, (2009), it is impractical or even impossible to obtain the minimum root mean square error in the online real-time environment. Notwithstanding that the GT2DQ method contains an implicit distance measure of an incoming data stream as a result of setting the hypothetical rule in (27), it still does not pinpoint the clear location of an incoming data stream in the input space. The location of the data stream plays an important role in circumventing the significant overlap of fuzzy rules in the input space. To correct this shortcoming, another rule growing criterion is integrated using the distance-based input partitioning principle, which is accomplished using the spatial firing strength in (10). This is defined as follows: 3

FS   , FS  max (q fo 3  (1  q) fo i ) i 1,..,R

i

(29)

where  refers to the critical value of the chi-square distribution  2 with nu degree of freedom and a significance level of  ,

  exp(  2 ( )) (Tabata & Kudo, 2010). A typical value of 

is 5%. We apply the q design factors

in computing FS in order to take into account the effect of the upper and lower rules. The rule growing condition (29) ensures that the hypothetical rule lies in a sufficiently distant proximity to the existing rules, thus inducing a low risk of overlap. The same approach is also implemented by Huang et al., (2005), Bortman & Aladjem, (2009), Vokuvic &

15

Miljkovic, (2013), where a distance-based rule growing method is fitted in the fuzzy rule generation process. Nevertheless, our strategy differs somewhat from this work, as the spatial firing strength is utilised as an alternate to a point-to-point distance measure. This strategy eases the selection of the predefined threshold  , because we can refer to the chi-square distribution to find a plausible value for the threshold. Note that the rule which has the highest spatial firing strength as indicated by the second part of (29) is also called the winning rule. The winning rule can be also chosen based on the rule posterior probability, as shown in (Pratama et al., 2014(b)). Although the Bayesian method is more accurate than the firing strength method, it incurs additional computational cost. The hypothetical rule is ascertained as a new rule, provided that it complies with (27) and (28). The antecedent part of a new rule is set as the hypothetical rule (24), whereas the consequent part is assigned as follows:

 R 1   win ,   

(30)

where  is a large positive constant, fixed at   10 . Because the rule consequent of the RIVMcSFNN is in the same 5

order as the zero-order TSK fuzzy system, the output covariance term is a constant – a single dimension only. The new local sub-model – rule consequent – is allocated as the winning rule to shorten the convergence time because the winning rule is supposed to portray similar local output behaviour with the new rule. In the training process, the contribution of the hypothetical rule perhaps does not suffice to trigger the rule addition process, but it is still useful for refining the coverage of the network structure. This situation often occurs when the hypothetical rule offers minimum statistical contribution. In other words, (25), (26) are not met. Adding a new rule in this case is counterproductive because rule redundancy will be aggravated. This situation can be addressed by simply fine-tuning the antecedent part of the winning rule to absorb the information conveyed by the newest datum, while still retaining the current configuration of the network structure as follows: ~ ~ N win N 1 ~ N 1 ( X N  Cwin N 1 ) ~ Cwin N  C  , C win  [C win , C win ] win N win N 1  1 N win N 1  1  win ( N )  1 

 win ( N  1) 1 1



1 1 N  1))(  N  1 ))T ˆ ˆ  ( win ( N  1) ( X N  C win win ( N  1) ( X N  C win 1 1   ( X N  Cˆ win N  1 )  win (old )  1( X N  Cˆ win N  1 )T

N win N  N win N 1  1 where   1 ( N win

N 1

(31) (32) (33)

 1) and Cˆ win  (C win  C win ) 2 . This adaptive mechanism is derived from the concept of the

sequential maximum likelihood principle, extended to the case of the interval-valued multivariate Gaussian function. We make use of the mid-point of the uncertain centroids to adapt the certain input covariance matrix. The direct adjustment of the inverse covariance matric is applied without the re-inversion process. The re-inversion process retards the model update and even causes unstable computation when the covariance matrix is ill-defined. In the light of Scaffolding theory, the rule growing and adaptation mechanism in this sub-section can be classified as the problematizing component of active supervision because it relates to the drift handling approach as a result of governing the model update in accordance with the learning context. From the viewpoint of how drift is overcome, the RIVMcSFNN adopts a passive approach by continuously updating its structure as new information is presented and does not rely on a dedicated drift detection approach (Ditzler et al., 2015). B)

The Rule Pruning Module: The concept of the neuron significance is also utilised as the rule pruning module

because it is capable of pinpointing a superfluous fuzzy rule which does not play a significant role during its lifespan.

16

A rule pruning method, namely the GT2RS method, is proposed in this paper, which enhances those of (Bortman & Aladjem, 2009; Vokuvic & Miljkovic, 2013) using the interval-valued multivariable Gaussian function. The GT2RS method shares the same principle as its rule growing counterpart, where the contribution of a fuzzy rule is judged based on its statistical significance (28). In short, a fuzzy rule is discarded from the training process if the following condition is observed. N

N

E i  mean( E i )  2std ( E i ) , mean( Ei ) 



Ei , n

n 1

 (E

i,n

, std ( Ei,n ) 

 mean( Ei )) 2

n 1

(34) N 1 where the mean and standard deviation of (34) can be calculated recursively with ease. Condition (34) examines the N

statistical contribution of i-th rule during its lifespan and to analyse the downtrend of the statistical contribution of ith rule. The efficacy of the GT2RS method is verifiable from its rigorous approximation of rule significance, which takes into account the overall training region. Furthermore, the GMM-based input density function p(x) is capable of handling complex and irregular data distributions. This also implies that the future contribution of the i-th fuzzy rule is also considered when gauging rule significance. Moreover, the GT2RS formula (28) also takes into account the impact of the local sub-model i , which is often overlooked by the vast majority of rule pruning methods. The GT2RS method represents the fading facet of active supervision in Scaffolding theory. C)

Rule Forgetting Mechanism: Although the rule pruning scenario mainly represents the fading aspect of

Scaffolding theory, it can also function as the problematizing aspect of Scaffolding theory, when using data distribution as a factor in the rule pruning decision. This is evidenced by its capability to capture obsolete rules, which are no longer relevant to the current data trend due to shift or drift in the data concept. The RIVMcSFNN incorporates another rule pruning method – the T2RMI method - to detect obsolete rules, where the key idea is to investigate the correlation between the fuzzy rule and the target concept. Note that the underlying difference between the T2RMI method and the RMI method in (Gan et al., 2014) is noticeable in the incremental working framework of the T2RMI method. Moreover, the T2RMI method is specifically tailored to the working framework of the interval-valued fuzzy system. The relationship between the two variables can be analysed using the linear or nonlinear measure. Although the linear measure is simple to use and consumes low computational power, the linear measure is, however, inaccurate because the interaction between two variables is nonlinear in nature (Mitra et al., 2002). The RIVMcSFNN uses the maximum compression index (MCI) (Mitra et al., 2002) to improve the robustness of the linear correlation measure. In contrast with other linear correlation measures such as the Pearson coefficient, the MCI is insensitive to rotation. The MCI is another salient feature of the T2RMI method from the RMI method because the RMI method is still supported by the classic symmetrical uncertainty approach. The T2RMI method is formulated as follows:

 ( fo~i3 , y o )  q o  ( fo 3i , y o )  (1  q o ) ( fo i 3 , y o )  ( fo i 3 , y o )   ( fo i 3 , y o ) 

1 (var( fo 3 )  var( y o )  (var( fo 3 )  var( y o )) 2  4 var( fo 3 ) var( y o )(1   ( fo 3 , y o ) 2 ) ) i i i i 2 cov( fo 3 , y o ) i

(35) (36) (37)

3

var( fo ) var( y o ) i

17

where var( fo ), cov( fo , y o ),  ( fo , y o ) respectively denote the variance of the i-th lower fuzzy rule, the 3

3

i

3

i

i

covariance of the i-th lower fuzzy rule and the output variable, and the Pearson index of the i-th lower fuzzy rule and the output variable, which all can be calculated on the fly with ease. The same strategy also applies in the case of the 3

i-th upper fuzzy rule  ( fo i , y o ) . The fuzzy rule is represented by the spatial firing strength, because the spatial firing strength abstracts the relevance of the fuzzy rule in the input space. In principle,  ( fo~i3 , y o ) signifies the eigenvalue for the normal direction to the principal component of two variables ( fo~i3 , y o ) , where maximum information compression is achieved when data is projected along its principal component direction. Therefore, the MCI is able to enumerate the cost of discarding the i-th rule from the training process, aiming to attain the maximum amount of information

compression.

The

MCI

method

has

interesting

properties

as

follows:

1)

0   ( fo 3 , y o )  0.5(var( fo 3 )  var( y o )) , 2) a maximum correlation is indicated by  ( fo 3 , yo )  0 , 3) a symmetric i

i

i

property  ( fo , yo )   ( yo , fo ) , 4) because the mean expression is overlooked, it is invariant against the translation 3

3

i

i

of the dataset, 5) it is also robust against rotation. It can be seen that the perpendicular distance of a point to a line is not affected by the rotation of the input features. The fuzzy rule is deemed obsolete if the following condition is observed: N

i ,o  mean(i ,o )  2std (i ,o ) , mean(i,o ) 



N

in, o

n 1

, std (i,o ) 

 (

n i ,o

 mean(i ,o )) 2

n 1

(38) N 1 The T2RMI method adopts the same condition as the GT2RS method to prune the fuzzy rule because this situation N

pinpoints a fuzzy rule which is no longer relevant to the current data distribution because of concept drift. D)

The Rule Recall Module: The T2RMI method can also serve as the rule recall indicator because it tracks the

relevance of the fuzzy rule with the output of the RIVMcSFNN. In other words, a fuzzy rule is not permanently discarded and is added to a list of rules pruned by the T2RMI method R*  R * 1 , where R* is the number of rules deactivated by the T2RMI method. Such a rule is able to be revived in the future provided that it becomes relevant again to the system’s output. The rule recall technique is an effective solution for cyclic drift, where old data distribution re-emerges, thus increasing the relevance of obsolete rules. Note that introducing a completely new rule in the presence of cyclic drift is not consistent with the flexible and evolving nature of EIS and also omits the learning history in a local region. The rule recall scenario is applied subject to the following criterion.

max ( i* )  max ( i )

i *1,..,R*

i 1,..,R

(39)

The condition above reveals a situation where the validity of an obsolete rule is higher than any of the existing rules. Hence, an obsolete rule brings the most compatible concept to delineate the current data trend and should be reactivated as follows: ~ ~  1    1,  C  C ,   ,  R 1 i * R 1 R 1 i* i* R  1 i*

(40)

Although the rule recall scenario requires the obsolete rule to be retained in the memory, the computational burden is still alleviated because pruned rules are overridden from any learning scenario except (35). This rule recall concept

18

also differs from the rule recall scenario in (Pratama et al., 2015(a)) because it is not a part of the rule growing module. In other words, this module also functions as another rule generation strategy. Please refer to Algorithm 1 for clarity. Furthermore, we adopt the concept of fuzzy rule relevance through the T2RMI method instead of the density concept. Since the rule recall module is capable of overcoming cyclic drift, it portrays the problematizing part. E)

Rule Merging Mechanism: The transparency aspect is an important matter of consideration in the design

process of EIS, because most EIS in the literature are built on the fuzzy system, which aims to address the black-box nature of other machine learning variants. In practice, it may occur that two rules, initially portraying disjoint regions, become significantly overlapping, because the next training samples fill the gap between them. This phenomenon is even more apparent in the context of data stream mining, where access to a complete dataset when clustering data clouds is impossible to be obtained. This consequently opens the possibility for two rules to move together which may end in a significantly overlapping position (Lughofer et al., 2011). Hence, an online elimination mechanism of local redundancies – the online rule merging technique - is required to lower complexity while improving rule interpretability. This module corresponds to the fading aspect of Scaffolding theory. Some attempt has been devoted to integrating the online rule merging strategy in EIS (Lughoher et al., 2011, Pratama et al., 2014(b)). Nonetheless, these approaches are over-dependent on a problem-specific predefined threshold, limiting the flexibility of EIS. They merely merge two rules based on their similarity degree without looking at their geometric interpretation in the product space deeply. The RIVMcSFNN is equipped with the type-2 geometric criteria (T2GC). The T2GC constitutes a type-2 version of the geometric criteria of Lughofer et al. (2015), which was designed for classical type-1 fuzzy rules. The T2GC is built on a two-faceted strategy: overlapping degree and homogeneity. Note that these two criteria are applied merely to check the similarity of the winning rule because the winning rule is the only one receiving the rule premise adaptation (35)-(37) - a major cause of overlap. Therefore, this procedure aims to relieve the computational burden. 

Overlapping Degree: The overlapping degree examines whether or not two rules are redundant by studying

their similarity level. Because we want to create the threshold-free rule merging process and the RIVMcSFNN is constructed with the multivariable Gaussian function, the Bhattacharyya distance is utilized (Bhattacharyya, 1943). The underlying advantage of the Bhattacharyya distance lies in its capability to provide an exact measure when two clusters are disjoint, touching and overlapping with the absence of the predefined threshold. The overlapping degree between the winning rule and the other rules

i  {1,..., R} \ {win} is defined as:

s1 ( win, i )  (1  qo ) s1 ( win, i )  qo s1 ( win, i )

(41)

1 1 det(  1 ) s1 ( win, i )  ( cwin  ci )T  1 ( cwin  ci )  ln( ) 1 8 2 det(  win ) det(  i1 )

(42)

19

where 

1

1  ( win   i1 ) 2 . The lower rule s1 ( win , i ) is calculated similarly to s1 ( win , i ) in (42). Two clusters

are overlapping when s1 ( win , i )  0 , and are disjoint when s1 ( win , i )  0 , whereas s1 ( win , i )  0 signifies that two clusters are touching. In the RIVMcSFNN, the rule merging process is deemed necessary, provided two rules are overlapping and/or touching as follows

s1 ( win , i )  0

(43)

It is worth noting that the use of Bhattacharyya distance is fit for the RIVMcSFNN’s rule, because the multivariate Gaussian function in the Bhattacharyya distance has a one-to-one relationship with that of the RIVMcSNN. 

Homogeneity Criterion: Homogeneity of clusters plays a crucial role in merging two clusters because merging

non-homogeneous clusters is inherent to cluster delamination, undermining generalisation and representation of local data clouds (Lughofer et al, 2015(a), Lughofer et al, 2015(b)). Cluster delamination refers to an over-sized cluster, covering two or more distinguishable data clouds. The homogeneity of two clusters can be perceived from the size of the cluster, resulting from the rule merging process. This measure is formulated by examining the volume of the merged clusters in contrast with its independent volume as follows:

V merged  Vmerged  u(V i  V i  V win  V win )

(44)

Equation (44) exhibits a minor prospect of cluster delamination because the volume of the merged cluster does not exceed the volume of two independent clusters: thus, the two clusters form a joint homogeneous region. The term u is involved in (44) to hinder the curse of dimensionality. If all rule merging conditions are met, two merging candidates are coalesced. Because a rule containing more supports should have higher influence to ultimate shape and orientation of the merged cluster, the rule merging procedure is steered by the weighted average strategy (Lughofer et al., 2011(e)) as follows: ~ ~ C old N win old  C i old N i old ~ ~ C merged new  win , C i  [C i , C i ] , N merged new  N win old  N i old old old N win  N i new   1merged 

  1win

old old old old old    1 old N old  N win   i N i i ,  merged new  win win i old old old  N old N win  N i N win i

old

N

(45) (46)

Two fuzzy rules are weighted with their accumulated supports to strengthen the leverage of a more populated cluster to the merged cluster. This strategy is to assure adequate coverage of the underlying data distribution and to avoid a loss of cluster support. F)

Online Feature Selection Mechanism: The feature selection mechanism plays a crucial role to the success of

EIS, because it not only reduces computational complexity, but also makes modelling problems easier to solve. Because of its capability to simplify the problem’s complexity, feature selection characterises the complexity reduction part of Scaffolding theory. Nevertheless, the vast majority of EIS and McLM in the existing literature assumes that feature selection is a part of a pre-processing step. Some attempt was devoted to devise an online feature selection scenario to be consolidated in the main training process. The first approach is with the use of the online dimensionality reduction technique (Angelov, 2010; Pratama et al., 2014b). This approach often results in discontinuity of the learning process, which incurs instability of the training process. The online feature weighting approach (Lughofer, 2011, Lughofer et. Al., 2015 (a)) is developed as an alternative, which minimises the contribution

20

of inconsequential features by allocating a low weight. Although this strategy rectifies the setback of the learning performance, the complexity issue remains unsolved, because it still keeps superfluous input attributes in the memory. Furthermore, currently applied online feature selection scenarios in EIS have not considered redundancy among input attributes and the importance of input attributes is investigated by measuring the relevance between input attributes and target variables. The feature selection scenario of the RIVMcSFNN is built upon the MBC (Yu et al., 2004), where it is modified to work completely in the single-pass learning scenario – the SMBC. The salient trait of the SMBC lies in the synergy of the relevance and the relevance in inspecting the contribution of the input attribute. Furthermore, the SMBC assures the stability of the input pruning process because the importance of input attributes is judged from its mutual information to other input features, which does not depend on a variation of data distribution. Adding an input feature which shares a strong similarity to the existing ones will not help to boost the learning performance. According to the Markov blanket theory, the input feature can be classified into four groups in respect to its contribution: irrelevant, weakly relevant, weakly relevant but non-redundant, and strongly relevant. The goal of the SMBC is to eliminate irrelevant, weakly relevant input features from the training process and only those of weakly relevant but non-redundant, and strongly relevant variables are maintained in the training process. To this end, two tests, namely C-Correlation and S-Correlation, are developed, where the C-Correlation focuses on the issue of relevance and the F-Correlation concerns the issue of redundancy. The two correlations are defined as follows: Definition 1 (C-Correlation) (Yu et al., 2004): The relevance of the input feature is indicated by the correlation of input feature

x j and target variable to , which are measured by the C-correlation C ( x j , to ) .

Definition 2 (F-Correlation) (Yu et al., 2004): The issue of redundancy is signified by the similarity degree of two different input variables x j , x j1 ,

j  j1 . The measure of similarity between two input attributes is called the F-

correlation F ( x j , x j1 ) . The MCI method in (35) - (37) is adopted to analyse the C and F-correlation and is done by simply replacing

( fo~i3 , yo ) in (35) - (37) with ( x j , t0 ), and ( x j , x j1 ) . Otherwise, the symmetrical uncertainty method, combined with the differential entropy approach, can be used as an alternative. The differential entropy is rather inaccurate, because it assumes the training samples to be uniformly distributed. The SMBC is implemented in a two-staged manner, where the F-correlation is carried out first in order for the inconsequential features to be eliminated. This scenario is to relieve complexity, because the next step, the C-correlation, can be run using a smaller number of input variables. That is, it merely evaluates those coming through C ( x j , to )   . Conversely, those satisfying the condition are pruned without significant loss of accuracy because they are deemed irrelevant to the learning context.



is a predefined threshold, set as its default value

  0.8 . After completing the procedure of the C-correlation, the

F-correlation takes place, where it mainly seeks for weakly relevant but redundant input attributes by investigating the similarity between two relevant and weakly relevant input variables, violating C ( x j , to )   . Since an input feature

x j forms an approximate Markov blanket for x j1 if C ( x j , to )  C ( x j1 , to ) and C ( x j , to )  F ( x j , x j1 ) , the 21

remaining input variables, after finishing the C-correlation, are sorted in descending order. The similarity check procedure starts from that of the higher position. In other words, a more relevant feature is exploited to filter other input attributes which forms the approximate Markov blanket. Specifically, in the F-correlation phase, an input feature is pruned when C ( x j , to ) 

F ( x j , x j1 ) . This procedure is pictorially illustrated in Fig.4. Note that

C ( x1 , to )  C ( x2 , to )  C ( x3 , to )  ....  C ( xu , to ) . It can be perceived that x1 is a point of departure of the Fcorrelation and finds the approximate Markov blanket with x 2 , x 4 .It proceeds afterward with x3 , which happens to filter out x6 . G)

Adaptation of Network Parameters: The free parameters of the RIVMcSFNN, namely the dilation and

translation parameters, the design coefficients and the recurrent weights, are fine-tuned using the zero-order density maximization (ZEDM) principle. The ZEDM method advances the conventional gradient descent scenario by minimising the error entropy as an alternate to the mean square error (MSE) as the cost function. Minimising the error entropy is equivalent to reducing the proximity between the probability distribution of the target function and the predictive output, thereby leading to a more reliable approximation of high order statistical behaviour than that of the MSE-based approach. Although this approach has been articulated in (Pratama et al., 2015(a)), this scenario is recounted here because the use of the ZEDM method for the recurrent weights and the dilation, and the translation parameters are uncharted. Because the exact and precise model of the error entropy is too complex to be derived with the first-principle technique, the cost function is formulated using the Parzen Window density estimation method:

fˆ (0)  where

1 N 2

N

 exp(  n 1

e n ,o 2

2

) 2

1 N 2

N

 K( n 1

 e n ,o 2 2

2

)

(47)

en ,o is the system error of the o-th output variable,  is a smoothing parameter, simply fixed as 1, and N is

the total number of samples seen so far. The adaptation process is formed using the gradient descent scenario as follows: e 2 E N fˆ (0) 1 n, o q ( N )  q ( N  1)    q ( N  1)   K (  ) o o q q o q N 2  2 q n 1 o o 2 e N fˆ (0) 1 E n, o a j ( N )  a j ( N  1)    a j ( N  1)   )  K ( i i a i a N 2 j 2 a a j n 1 i i 2 e N fˆ (0) 1 E n, o b j ( N )  b j ( N  1)    b j ( N  1)   K ( )  i i b i b N 2 j 2 b b j n 1 i i 2 e N fˆ (0) 1 E n, o  j ( N )   j ( N  1)     j ( N  1)   K ( )  i i   j i  N 2 2  j n 1 i i 2 e N fˆ (0) 1 E n, o  j ( N )   j ( N  1)     j ( N  1)   K ( )  i i   j i  N 2 2  j n 1 i i

(48)

(49)

(50)

(51)

(52)

22

where  q , a ,b , , denote the learning rates, elicited using the Lyapunov stability criteria to guarantee the asymptotic convergence and

( yo  to ) 2 E stands for the gradient of the squared error with respect to the free parameters to 2 x N

be tuned. Note that the recursive formula of

 K ( e n 1

. The gradient of the parameters of interest

E

ai , j

2 n ,o

, E

2) is expressed as

bi , j

, E

qo

, E

 eN , o 2 N e 2 )  exp( n )  AN  AN  1  exp( 2 2 n 1

 i , j

, E

i , j can be obtained using

the chain rule as follows: P

 fo

E  ( y n  t n )( i 1 P q o

4 i ,o

 fo k 1

P

fo i5  4 i ,o

E E yo fo i , o zi , j  ai , j yo foi5 i , o zi , j ai , j 5 i

 fo i 1

P

4 i ,o

 fo k 1

fo i5 (53)

) 4 i ,o

R 4  R 4 o  fo i , o i (1  qo )  fo i , o i io qo     zi , j 1   2  ( yn  tn )( )( zi , j exp( )(3  zi , j )) i 1 R  i 1 R  4 bi , j 2 4   fo fo   k ,o k ,o k 1 k 1   2

E E y o fo i5  i z i , j E   z i , j (d i , j  bi , j ) 5 bi , j y o fo i  i z i , j bi , j a i , j E E y o fo i5  i z i , j d i , j E    i , j (n  1) 5  i , j y o fo i  i z i , j d i , j  i , j a i , j

(54)

(55)

(56)

4 4 3 E E  yo  foi , o yo  foi , o  (1  qo ) foi5 q fo5 3 4 (57) 4   )  (( foi  foi , o (n  1)) Ro i )))    ( yn  tn )((( foi  foi , o (n  1)) R 4 4 i , j yo   fo i , j  fo i , j  4 i,o  i,o   fok4  fok k 1

k 1

It is well-known that the success of the gradient descent method is determined by the learning rate because it controls step size of the model update. A too small learning rate results in a sluggish model update or in an extreme case, the parameter never converges, whereas a too large learning rate causes the parameter to oscillate. Hence, the stable interval of the learning rate is derived using the Lyapunov stability method as follows: Theorem 1: First, we form a vector of the learning rate W

 [ q , a , b , ,  ] and W  [qo , a, b, ,  ] is a

vector containing its respective parameter. We define PW max as PWo ,max

The asymptotic convergence is ensured when the learning rate vector

W

 max [ n 1,...,N

y o y o y o y o y o , , , , ]. q o a b  

revolves around 0   W 

2 N 2 ( PWo , max ) 2 AN

.

Proof: The proof for Theorem 1 is akin to that in (Pratama et al., 2015(c)). The difference is that we now have extra parameters

(a, b,  ,  ) to be adjusted. Being well-discussed in the literature, the proof is omitted here.

Rather than being fixed, the learning rates are adapted in accordance with the learning context to expedite the convergence. When the cost function (47) increases from the previous training episode, the learning rate augments,

23

whereas it diminishes when the cost function decreases. In other words, we refer to the dynamic of the cost function to adjust the learning rates as follows: ˆ N ˆ N 1   1W ( N  1), f (0)  f (0) ,where 0  2  1  1   2W ( N  1), fˆ (0) N  fˆ (0) N 1 

(58)

W ( N )  

where  5  (1,1.5] ,  4  [0.5,1) label the learning rate factors, which steer the change of the learning rates. It is worth noting that as verified in (Pratama et al., 2015(a)), these parameters are not problem-specific and are set as 1  1.1 ,

 2  0.9 for all simulations here. This learning mechanism represents the passive supervision of Scaffolding theory, because the adaptation process relies on the gradient of parameters in respect to the cost function. H)

Adaptation of Rule Consequent: The adaptation of the rule consequent represents the passive supervision of

Scaffolding theory because it relies on the system error, actualising the action-consequent mechanism. The rule output of the RIVMcSFNN is updated using the Fuzzily Weighted Generalised Recursive Least Square (FWGRLS) method (Pratama et al., 2014), which forms a local version of the Generalized Recursive Least Square (GRLS) method (Xu, Wong & Leung, 2006). The salient characteristic of the FWGRLS method is its weight decay term which is capable of forcing the weight vector to hover around a small bounded interval. Moreover, the local learning scenario renders a flexible mechanism and greater robustness, because each rule is fine-tuned separately. Therefore, all the learning procedures of a specific rule do not affect the stability and convergence of other rules. The local learning scenario also increases the interpretability of the TSK fuzzy rule (Lughofer, 2013), because a rule consequent can be interpreted as a local hyperplane, snuggling along the real trend of the approximation curve and reflects a particular operating region of the target space. As the FWGRLS method has been well-deliberated in the literature (Pratama et al., 2014(a)-(c)), we do not recount this method in the paper. III.

When-To-Learn A data stream can be meaningful for future model updates although it does not trigger sufficient conflict to activate

the sample learning process in the current episode: (28) and (29) are not complied. This situation is pinpointed when a data stream meets (28) but is not accepted by (29). This aspect is corroborated by the finding of the GT2DQ method, where a data stream with a high statistical contribution may play an important role in a future training episode. Such a sample should be reserved for future use and called the reserved sample ( X n , Tn )

 ( XS NS 1 , TS NS 1 ) , where NS

is the number of reserved samples. The reserved sample is supposed to advance the completeness of the network structure because it may convey new system behaviour uncharted by the underlying training samples. The condition to reserve the data stream for future use is shown as follows:

max ( Ei )  ( E R 1 ) and FS  

i 1,...,R

(59)

Reserved samples are injected for the training process after the main training samples have been fully consumed. In the realm of data stream mining, learning on reserved samples is triggered when the system is idle. The when-tolearn procedure thus controls the termination of the training process. In an ideal case, the training process ends when all samples have been learned. As the RIVMcSFNN deals with the life-long learning scenario, the training process is terminated when the number of reserved samples does not change. IV.

Complexity Analysis

24

This section discusses the computational burden and memory demand of the RIVMcSFNN. The RIVMcSFNN actualises the metacognitive scaffolding principle which possesses the three learning components, what-to-learn, howto-learn and when-to-learn and adopts Scaffolding theory to construct the how-to-learn phase. The what-to-learn scenario, built on the online active learning scenario, namely the uncertainty measure, incurs

WTL  O(R)

computational cost. Since the spirit of the when-to-learn phase is to train the model using the reserved sample, the computation burden of this module is in the order of WNTL

 NS * HTL , where NS denotes the number of reserved

samples and HTL stands for the computational load of the how-to-learn phase. The how-to-learn component based on the

scaffolding

concept

attracts

the

computational

burden

in

the

order

of

O(7 R  R *  u 2  u  u 2 R  2uR  RM ) . The total computational complexity is then the summation of the three constituents. Noticeably, this burden is still comparable with its counterparts, such as rClass (Pratama et al., 2015(a)) and gClass (Pratama et al., 2015(b)), where the computational complexity can be observed in these original publications. The resultant computational burden is then WTL  HTL  WNTL , where

 stands for the likelihood

of the data stream being accepted for the model update. The computational burden of the RIVMcSFNN is however more economical than standard EIS, because it is equipped by the what-to-learn, exempting superfluous samples for model updates. This claim is numerically validated in Section V. The

cognitive

constituent

of

the

RIVMcSFNN

generates

the

number

of

parameters

3  ( R  u )  (u  u )  R  2  ( R  m) as a result of double local recurrent connections, wavelet parameters, and multivariable interval-valued hidden node. The memory demand is deemed more frugal than the interval-valued network structure of eT2Class (Pratama et al., 2015(c)) because the network topology RIVMcSFNN does not incorporate the interval uncertainty in the rule consequent. In addition, the consequent part of the RIVMcSFNN has a lower DoF than that of eT2Class, driven by the nonlinear Chebyshev polynomial. A.

V. Proof Of Concepts Sensitivity Analysis of the Pre-recorded History: This section studies to what extent the number of pre-

recorded samples N prehistory is used to initialise the parameters of the GMM, affecting the learning performance of the RIVMcSFNN. This section aims to confirm our claim that

N prehistory is not problem-specific and significantly

smaller than the number of data points N. The sensitivity analysis is carried out using the formula of (Juang et al., 2010): k 1

sensitivity 

 ( EC ( N j 1

j 1 prehistory

)  EC ( N prehistory )) j

(60)

k ( max   min )

where k is the number of variations to test the sensitivity of the N prehistory ,

 max ,  min are

the maximum and

minimum values of the evaluation criteria respectively, and EC is the evaluation criteria. This formula investigates the linear correlation between the variation of the parameter of interest and the learning criterion. We rely on three learning criteria, the Non-Dimensional Error Index (NDEI), the number of fuzzy rules, and the execution time. The analysis was undertaken using the prominent Box-Jenkins gas furnace problem, which aims to model the CO2 level in off gas. The prediction problem is guided by two input attributes: the methane flow rate u (n) , and its previous one-

25

step output t (n  1) . From the literature, it is concluded that the best regression model of the Box-Jenkins gas furnace problem is given by

yˆ (n)  f (u (n  4), t (n  1)) . This problem consists of 290 samples, where 200 samples are

used for the training process and the remainder are injected as the testing samples. We set k=5 and vary the number of pre-recorded samples as N prehistory

 10,30,50,80,100 to analyse the impact of the parameter choice on the

learning performance of the RIVMcSFNN. The learning performance of the RIVMcSFNN is evaluated using three criteria: the number of fuzzy rules, the execution time, and NDEI. The numerical results are reported in Table 2. It is shown in Table 2 that the number of pre-recorded samples to estimate the parameters of GMM incurred negligible impact on the learning performance of the RIVMcSFNN. This fact also confirms the viability of the RIVMcSNN in the online learning scenario because a small number of the initial samples sufficed to construct the reliable GMM to estimate the complex input density p(x) . We choose

N prehistory  10 in all simulations of this paper.

It is worth noting that some prehistory samples are not hard to collect in practise. Efficacy of Learning Components: The advantages of RIVMcSFNN’s learning components are explored in

B.

this section. The investigation focusses on studying to what extent each learning module contributes to resultant learning performance of the RIVMcSFNN. To this end, the same case study, namely the BJ furnace problem and the same experimental procedure as in Section V.A were deployed to test the effectiveness of each learning module and the RIVMcSFNN was simulated under six different learning configurations: A) the RIVMcSFNN is built upon the feedforward network architecture. This setting is useful to draw conclusions on the benefits of the double recurrent network architecture; B) the what-to-learn and when-to-learning parts of the RIVMcSFNN are removed. In this setting, the RIVMcSFNN functions as with the standard EFS to illustrate the learning performance of the RIVMcSFNN in the absence of the sample deletion strategy and the sample reserved strategy; C) we switch off the T2GC and T2GRS methods, which play the fading role of Scaffolding theory; D) This learning configuration rules out the T2RMI method, performing the rule forgetting and recall scenarios. This setting aims to examine the impact of the problematizing component of the Scaffolding scenario; E) This configuration depicts the learning configuration of the RIVMcSFNN when all learning modules are integrated. All numerical results are summarized in Table 3 and the learning performance of all configurations is portrayed in the four criteria: fuzzy rule, runtime, NDEI and training samples. Referring to Table 3, each learning module plays a substantial role in the numerical results of the RIVMcSFNN, where the absence of a particular learning module undermines the learning performance. The what-to-learn and whento-learn components of the RIVMcSFNN were capable of reducing the number of training samples, thus significantly speeding up the execution time with a subtle loss of predictive accuracy. The structural complexity of the RIVMcSFNN dramatically increased, while suffering from a substantial decline in accuracy without the fading part of Scaffolding theory. Furthermore, the efficacy of the problematizing component was proven with the configuration D, where the accuracy and runtime of the RIVMcSFNN were affected when the problematizing component was switched off. C.

Numerical Studies in Various Real-World and Synthetic Problems: This section elaborates the efficacy of the

RIVMcSFNN in tackling various real-world and artificial case studies, where five problems, namely tool-wear

26

prognosis, appraisal of residential premise price, identification of SISO temporal system, prediction of S&P 500 index time series, and modelling of Nox emission, are put forward to assess the viability of the RIVMcSFNN. Furthermore, the RIVMcSFNN is compared with prominent machine learning algorithms against five criteria: predictive accuracy, fuzzy rule, input attribute, runtime, training sample, and network parameters. 10 learning algorithms, namely eT2Class (Pratama et al, 2016), Simp_eTS (Angelov et al, 2005), eTS (Angelov et al, 2004), BARTFIS (Oentaryo et al, 2014), PANFIS (Pratama et al, 2014a), GENEFIS (Pratama et al, 2014b), DFNN (Wu et al, 2001), GDFNN (Wu et al, 2003), ANFIS (Jang, 1993), are consolidated in our numerical study for comparison purpose and their salient characteristics are detailed as follows: 

Evolving Learning Algorithm: eT2Class, simp_eTS, eTS, BARTFIS, PANFIS and GENEFIS can be

classified as an evolving learning algorithm which incorporates an open structure philosophy while adopting the single-pass learning mode. It can be perceived as a predecessor of the metacognitive learning, because they merely focus on the how-to-learn process without the what-to-learn and the when-to-learn. eT2Class, simp_eTS, eTS, BARTFIS, PANFIS and GENEFIS can be distinguished mainly in their structural learning scenarios and all of which except GENEFIS are not yet equipped by an online feature selection. 

Semi-evolving Learning Algorithm: DFNN and GDFNN exemplify a semi-evolving learning algorithm,

because they demonstrate a flexible working framework with rule growing and pruning mechanisms but they are only applicable for offline learning scenario. It needs to revisit previously seen samples when observing new samples. GDFNN is an enhanced version of DFNN with ellipsoidal rule and some additional learning modules. 

Offline Learning Algorithm: ANFIS is an offline algorithm, because it relies on a static structure and works

on a batched learning principle due to its multi-pass nature. Nonetheless, it is deemed a hard benchmark of the first and second variants of learning algorithm, because it permits an iterative working procedure over multiple epochs. It has a full access to a complete dataset during the training process but it is impractical for the online real-time learning scenario due to its computationally prohibitive nature. Our numerical studies were carried in the environment of Microsoft Surface Book with an Intel (R) core (TM) i7– 2600 CPU, 3.4 GHz processor and 8 GB memory. 

Tool wear prediction of a high speed milling process: This case study discusses tool wear prognosis in the

high speed milling process, namely a ball-nose end milling process (Courtesy of Dr. Li Xiang, Singapore). Tool wear prognosis of the high speed milling process remains a very complex issue to be resolved by academia and industry because of the use of multi-point cutting tools at high speed, varying machining parameters, inconsistency and variability of cutter geometry/dimensions. Advanced machine learning techniques, which can produce accurate tool wear prediction in the online real-time mode, are urgently required because this will avoid an unnecessary stoppage of the machining process in checking a tool’s condition. Our experiment was performed with a CNC milling process (Rőders Tech RFM760) with a spindle rate up to 42000 RPM, where raw data are acquired using a seven-channel DAQ attached to dynamometer, accelerometer, and an acoustic emission (AE) sensor. The dynamometer and accelerometer measure the cutting force and vibration in three cutting axes (X,Y,Z) respectively, while the AE signal is captured by the AE sensor. The tool wear was determined from a visual inspection of a flank wear using an Olympus SZX16 microscope. We relied on the force signal to predict the tool wear, where 12 features were extracted because

27

the force signal provides the most detailed information on tool wear. A total of 630 data tuples were generated for our experiment and our experiment was run in accordance with the 10-fold cross-validation procedure (CV) to cope with the data order dependency problem (Stone, 1974). The RIVMcSFNN was configured in two settings: with and without the online feature selection process (Section II.F). The numerical results refer to the average of the numerical results in the 10-fold CV process and are summarised in Table 4. Fig. 4(a) displays the evolution of the training samples in each time stamp and Fig. 4(b) visualizes the trace of the input features. From Table 4, it can be seen that the RIVMcSFNN outperforms the other algorithms in attaining the trade-off between complexity and accuracy. The RIVMcSFNN was capable of achieving the highest accuracy, while experiencing the fastest execution time. The RIVMcSFNN also generated relatively few number of network parameters, just behind eTS and simple_eTS, structured under the type-1 feedforward fuzzy system. This result occurs because the RIVMcSFNN evolved the fewest number of rules and is equipped with online feature selection, lowering the input dimension. Moreover, the RIVMcSFNN is the sole algorithm, which is capable of undertaking the online sample selection scenario. To the best of our knowledge, very few EISs in the literature have integrated the online sample selection scenario for the regression problem and the online sample selection mechanism is limited to the classification problem only. Fig. 4(a) illustrates that the RIVMcSFNN is capable of dynamically ruling out inconsequential samples from the training process, while Fig.4(b) shows the online feature selection strategy of the RIVMcSFNN. It can find a lower dimensional space without significant loss of accuracy, where the problem is best represented by discarding superfluous samples. 

Modelling of S&P 500 Index Time Series: This case study concerns an application of the RIVMcSFNN in a

real-world financial problem, namely S&P 500 index time series problem. This problem is worth considering to evaluate the potency of EIS because of the volatile nature of real-world financial problems, thus requiring a selforganizing model which can adapt to any variations of data streams. The experiment was undertaken using daily S&P 500 index values collected from the Yahoo! Finance website in the time period of January 3rd, 1950 to March 12th, 2009. 14893 data, non-uniformly distributed in the interval of [16.66,1565.15], were obtained and utilised to train the model. Reverse data were exploited to test the generalization performance. The S&P index time series data are highly dynamic and volatile where the most complex part is seen after 1980. The RIVMcSFNN was set in two settings: with and without the online feature selection – Section II.F. As with the previous case study, the learning performance of the consolidated algorithms are portrayed from six angles and the numerical results are tabulated in Table 5. The RIVMcSFNN was capable of delivering the most accurate prediction while retaining the lowest structural complexity. The effectiveness of the online feature selection is shown in Table 4, where the RIVMcSFNN attained the highest predictive accuracy when the online feature selection method was activated. This also suppresses the network parameters to a modest level. The online sample selection module, what-to-learn, can rule out superfluous samples from the training process, thereby leading to significant reduction in the training samples and an improved training speed. It is worth mentioning that this learning strategy rejects redundant samples, thus boosting accuracy. 

Prediction of Nox Emission of a Car Engine: This section studies the learning performance of the

RIVMcSFNN in modelling the Nox emission of a car engine (Lughofer, 2011(c)). This problem offers relevant characteristics to evaluate the efficacy of EIS: highly uncertain and non-stationary features. The highly uncertain

28

characteristic is justified by the noisy nature of this problem because data were sampled from raw data of a real car engine, while the non-stationary component is induced by two underlying attributes in the engine control, namely rotation speed and torque, which were not fixed but were varied to simulate various driving behaviours. Furthermore, this case study possesses an appealing trait to examine the online feature selection strategy of the RIVMcSFNN because the predictive task is guided by 170 input attributes. The 170 input features were extracted from 17 different physical variables, recorded in 10 consecutive measurements by hard sensors mounted in the car engine. To demonstrate the effectiveness of the recurrent network structure and the online feature selection strategy, the RIVMcSFNN was run in two experimental settings: (A) the RIVMcSFNN was executed with all 170 input attributes, (B) the RIVMcSFNN was performed without any lagged input variables – it only used the last time-step sample n-1. The online feature selection strategy was switched on for both configurations. The numerical results are summarized in Table 6. Fig. 4(c) displays the predictive trend of the RIVMcSFNN. The RIVMcSFNN produced the highest predictive accuracy while generating the fewest number of fuzzy rules. The RIVMcSFNN experienced the most instantaneous execution time than eTS and ANFIS as a result of the what-tolearn component which can reduce the number of training samples. This case study also showed that the RIVMcSFNN produced reliable numerical results in the absence of time-delayed input attributes. This fact bears out the potency of the double recurrent network architecture of the RIVMcSFNN. This can be observed from Fig. 4(c), where the RIVMcSFNN can model the Nox emission of a car engine accurately. 

Identification of the Temporal SISO Dynamic System: This case study aims to investigate the spatio-temporal

learning property of the RIVMcSFNN in the benchmark problem, namely SISO dynamic system identification. This problem possesses challenging characteristics because it features the temporal behaviour resulting from the external control input. The numerical results are reported in Table 7. Fig. 4(d) portrays the predictive trend of the RIVMcSFNN and the trace of the fuzzy rule. This problem is governed by the following nonlinear mathematical model: y (n  1)  0.72 y p (n)  0.025 y (n  1)u (n  1)  0.01u 2 (n  2)  0.2u (n  3)

(61)

where y (n), u (n) are respectively a system output and control input. This problem excludes any lagged input attributes and the regression task is built upon

y (n  1)  f ( y (n), u (n)) . As only two input attributes, namely y(n), u (n) ,

are fed to navigate the predictive task, the RIVMcSFNN is configured in only one setting without the online feature selection mechanism. The training dataset consists of 900 records, where the control input for the first half of the training samples are uniformly distributed in the interval of [-2,2], whereas the remainder of data rely on the sinusoidal function 1.05 sin(n

) u (n) regulated as a 45 as the control input. The validation data comprise 1000 data pairs with

temporal input variable as follows: sin(n ), n  250 25  1,250  n  500 u ( n)    1,500  n  750 0.3 sin(n )  0.1 sin(n )  0.6 sin(n ),750  n  1000  25 32 10

(62)

29

It can be observed from Table 7 that the RIVMcSFNN generated the most encouraging numerical results. The RIVMcSFNN had the highest predictive accuracy while retaining the most compact and parsimonious rule base. This fact justifies the double recurrent structure of the RIVMcSFNN which can address the temporal system dynamic effectively. Moreover, the effectiveness of meta-cognitive learning is noticeable here, where the significant contrast of runtime with eT2Class, representing the pure EIS without the what-to-learn and the when-to-learn, was elicited. Fig. 4(c) shows that the RIVMcSFNN produces the predictive quality of the RIVMcSFNN, demonstrating a very accurate nature. In addition, the evolving characteristic of the RIVMcSFNN is depicted in Fig.4(d), where its fuzzy rule is automatically generated, pruned, recalled, and coalesced during the training process. At around n=300, a fuzzy rule is removed from the training process because it is no longer relevant to the training process. This fuzzy rule is reactivated again at n=450 because the old data distribution re-appears again. This finding is in line with the characteristic of the nonlinear dynamic system, featuring the temporal characteristic in form of a cyclic drift. 

Appraisal of Residential Premise Price: This part of the paper discusses an application of the RIVMcSFNN

in the appraisal of residential property prices. Data were collected from one of the large cities in Poland with 980K residents. 50000 data were recorded during an 11 year-period from 1998 until 2008 (Lughofer, 2011(d)). This problem offers the non-stationary characteristic as the nature of the property market, which is relevant to investigate the potency of EIS. Five input features, namely usable area of premises, age of a building, number of rooms in a flat, floor on which a flat is located, and distance from the city centre are used as input features to predict house prices. The experiment was carried out in accordance with the periodic hold-out process, where 5-consecutive-year data form the training set, while the testing phase was carried out using the subsequent year data. The consolidated numerical results are shown in Table 8. The RIVMcSFNN was compiled in one configuration only, where all learning modules, including the feature selection module, are switched on. Moreover, no time-lagged input features were exploited to perform one step-ahead prediction of residential property prices and relied on the most recent measurement n-1. From Table 8, it can be seen that the RIVMcSFNN arrived at the most encouraging numerical results, where it achieved the high accuracy just behind DFNN and GDFNN, while crafting the fewest number of fuzzy rules. Note that DFNN and GDFNN delivered reliable predictive accuracy at cost of computational complexity. Although the RIVMcSFNN is built on the double recurrent network architecture, generating the generalized interval type-2 fuzzy rule, the RIVMcSFNN scattered relatively low network parameters, which came in second place after GENEFIS. Note that GENEFIS is constructed under the type-1 feedforward fuzzy system, having fewer free network parameters to be fine-tuned during the training process. From this result, it is evident that the cognitive component of the RIVMcSFNN requires lower fuzzy rule demand than the conventional fuzzy system to cover data distribution. Moreover, the feature selection scenario of the RIVMcSFNN is capable of lowering the input dimension without loss of generalization. D.

Statistical Tests: this section presents a statistical test for numerical results of a previous section, which aims

to investigate whether difference in performance between the RIVMcSFNN and its counterparts is statistically significant. Our statistical test is carried out with the two-tailed Bonferroni-Dunn test which is used when all classifiers are compared to each other. To begin with the statistical test, we rank consolidated classifiers based on their numerical results in Tables 4-8 and their rankings are reported in Table 9. From Table 9, the RIVMcSFNN outperforms other 10 algorithms in six evaluation criteria. The next step is to inspect whether the RIVMcSFNN’s numerical results are

30

significantly better than others from statistical viewpoint. Because the RIVMcSFNN is the only method with the whatto-learn component and only RIVMcSFNN and GENEFIS are fitted with the online feature selection component, we exclude the number of samples and the number of input features from our statistical test. To ensure fair comparison, the RIVMcSFNN is merely benchmarked against the evolving algorithms: PANFIS, GENEFIS, eTS, simp_eTS, BARTFIS, and eT2Class. Difference in performance between a pair of algorithms is statistically significant when it differs by at least critical difference (CD):

CD  q

k (k  1) 6N

(63)

where K, N respectively stand for the number of consolidation algorithms K=7, and the number of case studies N=5, while critical value q is computed as the Studentized range statistics divided by √2. For degree of confidence 𝛼 = 0.1, 𝑞𝛼 is 2.241. This accordingly results in 𝐶𝐷 = 2.14. We do not directly compute difference in performance of two algorithms from their rankings in Table 9 rather use the test statistics as follows:

z

( Ri  R j ) k (k  1) 6 N

(64)

This formula is meant to obtain the corresponding probability from the table of normal distribution, which is then compared with an appropriate α. This statistic allows to perform multiple comparison by compensating with α. Table 10 summarises the value of z for multiple comparisons of the RIVMcSFNN against its counterparts. It can be observed in Table 10 that the RIVMcSFNN is significantly better than most of other algorithms in all four criteria. In realm of predictive accuracy, difference in performance of RIVMcSFNN is statistically better than all of its counterparts, while the RIVMcSFNN beats eTS and BARTFIS in the number of rule. The RIVMcSFNN outperforms all algorithms but eTS from statistical standpoint in the criterion of runtimes. Although GENEFIS and Simp_eTS overcome the RIVMcSFNN in realm of network parameters but their differences are minor. In fact, the RIVMcSFNN is statistically better than eT2Class, PANFIS and BARTFIS in the context of network parameters. VI. Contribution of the RIVMcSFNN Versus Similar Approaches This part of the paper elaborates the key aspects of the RIVMcSFNN that differ from the existing work. The RIVMcSFNN is contrasted with state-of-the art EISs: SLFRWNN (Ganjefar et al., 2014), TRFNS (Juang et al., 1998), RSEFNN-LF (Juang et al., 2010), RSEIT2FNN (Juang et al., 2008), MRIT2NFS (Lin et al., 2013), ST2Class (Pratama et al., 2016), McFIS (Subramanian, 2014(a)), gClass (Pratama et al., 2015(b)), rClass (Pratama et al., 2015(a)), and SRIT2NFIS (Das, 2015). Consolidated algorithms are examined using five indicators: rule premise, rule consequent, network architecture, learning policy, and application. The learning features of the consolidated algorithms in the five benchmarked criteria are tabulated in Table 11. It is perceived from Table 11 that the RIVMcSFNN pioneers the metacognitive Scaffolding theory for the regression problem, although the metacognitive scaffolding learning concept can be found in the ST2Class, rClass, and gClass, all of which are still limited for the classification problem. On the other hand, McFIS is still built on the metacognitive learning algorithm without Scaffolding theory to govern the how-to-learn process, which leads to overdependency on the pre-and/or post-training steps. As with the vast majority of metacognitive learning algorithms, the

31

McFIS is designed for the classification problem. To date, only the SRIT2NFS has integrated the metacognitive learning idea for the regression problem but still excludes Scaffolding theory. It suffers from the absence of important learning modules, which may cost the pre-and/or post-training steps. The RIVMcSFNN also proposes some novel learning algorithms which do not exist in the benchmarked algorithms: GT2DQ, T2RMI, SMBC, T2GC. In the realm of the fuzzy rule type, the RIVMcSFNN puts forward a unique fuzzy rule, synergizing the intervalvalued multivariate Gaussian function with the nonlinear wavelet function. It is worth noting that the ST2Class is constructed in the feedforward network architecture and is used for the classification problem only. Hence, the RIVMcSFNN can be assumed as a generalized version of the ST2Class. Furthermore, the RIVMcSFFN adopts the double recurrent network architecture which goes one step ahead from a single-recurrent-link network topology as implemented in the SLWRFNN, TRFNS, RSEFNN-LF, RSEIT2FNN, and MRIT2NFS. Furthermore, RSEFNN-LF, RSEIT2FNN, and MRIT2NFS actualize the concept of online evolving learning, which only focusses on the how-tolearn issue and two other fundamental learning issues, namely what-to-learn and when-to-learn, are uncharted, whereas the SLWRFNN and TRFNS are offline in nature because of the presence of the genetic algorithm (GA) to adjust the network parameters. The SLWRFNN, TRFNS, RSEFNN-LF, RSEIT2FNN, and MRIT2NFS are equipped by more traditional fuzzy rule types than the RIVMcSFNN because they still rely on the uni-variable Gaussian function and the multivariate Gaussian function with diagonal covariance matrix, triggering a less flexible cluster shape than the multivariate Gaussian function with non-diagonal covariance matrix that the RIVMcSFNN has. In addition, the TSK rule consequents are implemented in the SLWRFNN, TRFNS, RSEFNN-LF, RSEIT2FNN, and MRIT2NFS. VII. Conclusion and Future Study A novel metacognitive scaffolding learning algorithm, namely the recurrent interval valued metacognitive scaffolding fuzzy neural network (RIVMcSFNN), is proposed in this paper and aims to address the three open research issues: uncertainty, temporal system dynamics, and unknown system order. It is worth mentioning that although the metacognitive learning machine is well established for the classification problem, very few existing metacognitive learning systems are designed for the regression problem. Furthermore, the RIVMcSFNN puts into perspective a novel recurrent network architecture with double local feedback loops and generates the generalized interval type-2 fuzzy rule combining the interval-valued multivariate Gaussian function and the wavelet function. The RIVMcSFNN is equipped with the four novel learning modules: GT2DQ, T2RMI, SMBC, and T2GC. Rigorous numerical studies and comparisons with prominent algorithms were undertaken to conclude the efficacy of the RIVMcSFNN, where the RIVMcSFNN outperformed its counterparts in four aspects: predictive accuracy, fuzzy rule, runtime, and training samples. Our future work will be devoted to develop the metacognitive ensemble to improve the bias and variance trade-off. VIII. A)

Appendix

The Proof of Interval-Valued Set This section aims to verify the validity of interval arithmetic of RIVMcSFNN (M. Mazandarani, M. Najariyan,

2014(a); M. Mazandarani, M. Najariyan, 2014(b)). Because the logic of (16) has been mentioned, we only cover the proof of

fo~2  [ fo 2 , fo 2 ] . The RIVMcSFNN utilises the interval-valued multivariate Gaussian function with

uncertain means as follows:

32

~2 ~ ~ fo  exp( ( FAn2  Ci ) i1 ( FAn2  Ci ))

(A.1) ~ where Ci  [C i , C i ] . The exponential function is a monotonic function, where [exp( X ), exp( X )] if [ X , X ] . It implies that we can now focus on the element inside the exponential function, because the exponential function does not affect the interval endpoints. Because the uncertainty occurs only in the means of interval-valued Gaussian function, the inverse covariance matrix can be discounted. We need to check whether

( FAn2  C i ) 2 < ( FAn2  C i ) 2

, which should be carried out in the one-dimensional space because of possible different results per input variables. This requires the fuzzy set extraction, which makes possible to obtain the matching degrees per input dimension. The fuzzy set representation is constructed by finding the radii of Gaussian fuzzy set in the one-dimensional space (3), while the centre of Gaussian fuzzy set is akin to that in the multidimensional space. The matching degree per input attribute can be computed by referring to the degree of membership of conventional interval type-2 Gaussian fuzzy set as follows:

fo i , j

 N (c j i ,  i , j ; fa 2j ) fa 2j  c j i   , i i  1 c j  xj  c j  i 2   N (c j ,  i , j ; fa j ) fa 2j  c j i

fo 2.1 i, j

(c j i  c j i )  N (c i ,  i , j ; fa 2j ) j  xj   2 i 2  N (c j ,  i , j ; fa j ) (c j i  c j i )

2.1

xj 

(A.2)

(A.3)

2

This is followed by applying the product t-norm operator to fuse the membership degrees of each input variable as follows: nu

fo 3  i

 j 1

nu

fa 3  i, j

 j 1

3

fo 2..1 , fo i  i, j

nu

 j 1

3

fa i , j 

nu

 fo

2.1 i, j

(A.4)

j 1

It is worth noting that the product t-norm operator does not change the endpoints of interval

2 ,1

2 ,1

[ foi , j foi , j ] and

3 3 [ foi foi ] . This also concludes the proof of fo~2  [ fo 2 , fo 2 ] .

B)

Four Basic Interval Operations This section describes four basic concepts of interval arithmetic, namely addition, subtraction, product and quotient

(Moore et al, 2009). Computing with intervals is equivalent to computing with sets. Their result is a set describing the sum of all pairs of numbers. Hence, the four basic interval operations can be defined by using interval endpoints. We start our discussion by formulating addition of two intervals: 

Addition: Suppose that

a  A where A  a  A and b  B where B  b  B . When performing addition

a  b  A  B , it should satisfy the following condition. A B  a b  A B

(B.1)

This can be then expressed as follows:

33

A  B  [ A  B, A  B ] 

Subtraction: As with addition, we can define

(B.2)

A  B in terms of interval endpoints. Subtraction is formed by

adding the following two inequalities.

A  a  A and  B  b   B

(B.3)

X Y  x  y  X Y

(B.4)

This leads to

X  Y  X  (Y )  [ X  Y , X  Y ] It is worth noting that

(B.5)

 Y  [Y ,Y ] . In other words, we reverse the endpoints when we obtain the negative of an

interval. 

Multiplication: the product of two intervals A and B can be defined A.B in terms of minimum and maximum

of four products of endpoints:

A.B  [min( Z ), max( Z )] , where Z  [ AB, AB, AB, AB] By considering the sign of endpoints, the endpoints of the interval product

(B.6)

A.B, A.B can be described into nine

special cases. These cases are tabulated in Table 12. 

Division: quotient is described as the product of the first term with the inverse of second term as follows:

A

B

 A.( 1 ) , where 1  [ 1 , 1 ] B B B B (B.7)

Note that we assume

0 B . Acknowledgements

This project is fully supported by the La Trobe university start-up grant. The second author acknowledges the support of the Austrian COMETK2 program of the Linz Center of Mechatronics (LCM), funded by the Austrian federal government and the federal state of Upper Austria.

References Abiyev,R.H.,Kaynak,O. (2008). Fuzzy Wavelet Neural Networks for Identification and Control of Dynamic Plants-A Novel Structure and a Comparative Study. IEEE Transactions on Industrial Electronics,55(8),3133-3140. Abiyev,R.H.,Kaynak,O.(2010). Type-2 fuzzy neural structure for identification and control of time-varying plants. IEEE Transactions on Industrial Electronics.57(12),4147-4159. Abiyev,R.H.,Kaynak,O.,Kayacan,K.(2012). A type-2 fuzzy wavelet neural network for system identification and control. Journal of the franklin institute,550,1658-1685. Angelov,P.P.,Filev,D.(2004).An approach to online identification of Takagi-Sugeno fuzzy models. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 34, 484-498. Angelov, P., Filev, D. (2005). Simpl_eTS: A simplified method for learning evolving Takagi-Sugeno fuzzy models, in IEEE International Conference on Fuzzy Systems (FUZZ), 1068-1073. Angelov, P.(2010). Evolving Takagi-Sugeno Fuzzy Systems from Data Streams (eTS+), In Evolving Intelligent Systems: Methodology and Applications (Angelov P., D. Filev, N. Kasabov Eds.), John Willey and Sons, IEEE Press Series on Computational Intelligence, pp. 21-50, ISBN: 978-0-470-28719-4. Angelov,P.P. (2011). Fuzzily connected multi-model systems evolving autonomously from data streams. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics, 41(4), 898-910. Angelov, P., Yager, R.R. (2012). A new type of simplified fuzzy rule-based system. International Journal of General Systems, 41(2), 163-185. Bustince,H., Fernandez, J., Hagras, H., Herrera,F.(2015) Interval Type-2 Fuzzy Sets are generalization of Interval-Valued Fuzzy Sets: Towards a Wider view on their relationship. IEEE Transactions on Fuzzy Systems, (2015) doi: 10.1109/TFUZZ.2014.2362149, in press. Bouchachia,A.,Vanaret,C.(2014). GT2FC: An Online Growing Interval Type-2 Self-Learning Fuzzy Classifier. IEEE Transactions on Fuzzy Systems, 22(4),999-1018. Bose,R.P.J.C., Bortman, M., & Aladjem, M. (2009). A growing and pruning method for radial basis function networks. IEEE Transactions on Neural Networks, 20(6), 1039–1045.

34

van der Aalst,W.M.P., Zliobaite.I,Pechenizkiy,M.(2014).Dealing with concept drifts in process mining. IEEE Transactions on Neural Networks and Learning Systems, 25(1), 154-171. Das, A.K., Subramanian, K., Suresh, S., (2015). An Evolving Interval Type-2 Neurofuzzy Inference System and Its Metacognitive Sequential Learning Algorithm. IEEE Transactions on Fuzzy Systems, 23(6), 2080-2093. Ditzler,G.,Polikar,R.(2012).Incremental learning of concept drift from streaming imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 25(10),2283-2301. Elwell,R.,Polikar,R.(2011). Incremental learning of concept drift in non-stationary environments. IEEE Transactions on Neural Networks, 22(10), 1517-1531. Flavell,J.H.(1996).Piagiet’s legacy.Psychological Science, 7(4) ,200-203 Ganjefar, S., Tofighi, M.(2014). Single-hidden-layer fuzzy recurrent wavelet neural network: Applications to function approximation and system identification. Information Sciences, online and in press. Gan, H., et al..(2014). Nonlinear Systems Modeling Based on Self-Organizing Fuzzy-Neural-Network with Adaptive Computation Algorithm , IEEE Transactions on Cybernetics, 44(4), 554-564. Hajmohammadi, M.S., Ibrahim, R., Selamat, A., Fujita, H. (2015). Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples, Information Sciences, 317, 67-77. Huang, G-B., Saratchandran, P., Sundararajan, N. (2004). An efficient sequential learning algorithm for growing and pruning RBF (GAP-RBF) networks, IEEE Transaction on Systems., Man, Cybernetics., part-B: Cybernetics.34.2284–2292. Huang, G-B., Saratchandran, P., Sundararajan, N.(2005). A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation, IEEE Transaction on. Neural Networks, 16, 57–67. Isaacson, R., Fujita, F.(2006). Metacognitive knowledge monitoring and self-regulated learning: Academic success and reflection on learning. Journal of the Scholarship of Teaching and Learning, 6(1), 39-55. Jang, J-S.R.(1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transaction on System. Man. Cybernetic, part b: cybernetics, 23, 665–684. Juang, C.F., Lin., C.T. (1999). A recurrent self-organizing neural fuzzy inference network,” IEEE Transactions on Neural Networks, 10, 828-845. Juang, C.F., Tsao, Y.W. (2008). A self-evolving interval type-2 fuzzy neural network with online structure and parameter learning. IEEE Transactions on Fuzzy Systems, 16(6), 1411–1424. Juang, C.F., Lin, Y.Y., Tu, C.C.(2010). A recurrent self-evolving fuzzy neural network with local feedbacks and its application to dynamic system processing. Fuzzy Sets and Systems, 161(19), 2552-2568 Juang,C.F.,Chen,C.Y.(2013).Data-driven interval type-2 neural fuzzy system with high learning accuracy and improved model interpretability. IEEE Transactions on Cybernetics.43(6).1781-1795. Joysula,D.P.,Vadali.H., Donahue, B.J., Hughes,F.C.(2009).Modeling metacognition for learning in artificial systems. In proceeding of World Congress on Nature and Biologically Inspired Computing.1419-1424. Lemos,A.,Caminhas,W.,Gomide,F.(2013). Adaptive fault detection and diagnosis using an evolving fuzzy classifier. Information Sciences,220,6485 Lewis,D.,Catlett,J.(1994).Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the International Conference on Machine Learning, 148–156. Lin, Y-Y., Chang, J-Y., Lin, C.T. (2012). Identification and Prediction of Dynamic Systems Using an Interactively Recurrent Self-Evolving Fuzzy Neural Network. IEEE Transactions on Neural Networks and Learning Systems, 24(2), 310-321 Lin, Y-Y., Chang, J-Y., Pal, N.R., Lin, C.T. (2013). A Mutually Recurrent Interval Type-2 Neural Fuzzy System (MRIT2NFS) With Self-Evolving Structure and Parameters. IEEE Transactions on Fuzzy Systems, 21(3), 492-509 Lin,Y.Y.,Chang,J.Y.,Lin,C.T.(2014(a)). A TSK-Type-Based Self-Evolving Compensatory Interval Type-2 Fuzzy Neural Network (TSCIT2FNN) and its applications. IEEE Transactions on Industrial Electronics, 61(1),447-459. Lin,Y.Y.,Liao,S.H,Chang,J.Y.,Lin,C.T.(2014(b)).Simplified Interval Type-2 Fuzzy Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 25(5),959-969. Liang,Q., Mendel,J.M.(2000). Interval Type-2 Fuzzy Logic Systems: Theory and Design. IEEE Transactions on Fuzzy Systems, 8(5), 535-550 Lima, E., Hell, M., Ballini, R., Gomide F, “Evolving fuzzy modelling using participatory learning,” in Evolving Intelligent Systems: Methodology and Applications, P. Angelov, D. Filev, and N. Kasabov, Eds. New York: John Wiley & Sons, 2010, pp. 67–86 Lughofer, E. (2008). FLEXFIS: A robust incremental learning approach for evolving Takagi.–Sugeno fuzzy models. IEEE Transactions on Fuzzy Systems, 16(6), 1393-1410. Lughofer, E. (2011(a)). On-line incremental feature weighting in evolving fuzzy classifiers. Fuzzy Sets and Systems, 163(1), 1-23. Lughofer, E. (2011(b)). Evolving Fuzzy Systems --- Methodologies, Advanced Concepts and Applications, Springer, Heidelberg. Lughofer, E., Macian, V., Guardiola, C., Klement, E.P. (2011(c)). Identifying Static and Dynamic Prediction Models for Nox Emissions with Evolving Fuzzy Systems. Applied Soft Computing, 11. Lughofer, E., Trawinski, B.,Trawinski.,K., Kempa,O., Lasota, T. (2011(d)). On The Employing Fuzzy Modeling Algorithms for The Valuation of the Residential Premises. Information Sciences, 181, 5123-5142. Lughofer, E., Bouchot, J.-L., Shaker, A. (2011(e)). On-line Elimination of Local Redundancies in Evolving Fuzzy Systems, Evolving Systems, 2(3), 165-187. Lughofer, E. (2012), Flexible Evolving Fuzzy Inference Systems from Data Streams (FLEXFIS++), in: Learning in Non-Stationary Environments: Methods and Applications, editors: Moamar Sayed-Mouchaweh and Edwin Lughofer, Springer, New York, 2012, pp. 205-246 Lughofer, E. (2013), On-line Assurance of Interpretability Criteria in Evolving Fuzzy Systems - Achievements, New Concepts and Open Issues, Information Sciences, vol. 251, pp. 22-46. Lughofer, E., Sayed-Mouchaweh, M, (2015(a)). Autonomous Data Stream Clustering implementing Incremental Split-and-Merge Techniques --Towards a Plug-and-Play Approach, Information Sciences, vol. 204, pp. 54-79. Lughofer, E., Cernuda, C., Kindermann, S., Pratama, M. (2015(b)) Generalized Smart Evolving Fuzzy Systems. Evolving Systems, 6(4), 54-79. Mitra,P.,Murthy,C.A.,Pal,S.K.(2002).Unsupervised feature selection using feature similarity, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3),301-312. Mazandarani, M., Najariyan, M. (2014(a)). Differentiability of type-2 fuzzy number-valued functions. Commun Nonlinear Sci Numer Simulat, 19, 710-725. Mazandarani, M., Najariyan, M. (2014(b)). Type-2 Fuzzy Fractional Derivatives. Commun Nonlinear Sci Numer Simulat, 19, 2354-2359. Nelson,T.O.,Narens,L.(1990). Metamemory: A theoretical framework and new findings. Psychology of Learning and Motivation, 26,125-173.

35

Oentaryo, R.J., Er, M.J., Linn, S., Li, X. (2014). Online probabilistic learning for fuzzy inference systems. Expert Systems with Applications, 41(11), 5082-5096. Patra, J.C., Kot, A.C.(2002).Nonlinear dynamic system identification using Chebyshev functional linkartificial neuralnetworks, IEEE Transactions on Systems, Man, Cybernetics—Part B: Cybernetics, 32(4), 505–511. Pratama.M.,Anavatti,S.,Angelov,P.P.,Lughofer,E.(2014(a)), PANFIS: A novel incremental learning machine. IEEE Transactions on Neural Networks and Learning Systems, 25(1),55-68. Pratama,M.,Anavatti,S.G.,Lughofer,E.(2014(b)). GENEFIS: towards an effective localist network. IEEE Transactions on Fuzzy Systems, 22(3),547-562. Pratama, M., M-J. Er, Anavatti, S.G., Lughofer, E., Wang, N., Arifin,I., (2014(c)), “ A novel meta-cognitive-based scaffolding classifier to sequential non-stationary classification problems”, In proceeding of 2014 International Conference on Fuzzy Systems, 369-376 Pratama,M.,Anavatti,S.,Lu,J.(2015(a)).Recurrent classifier based on an incremental meta-cognitive scaffolding algorithm. IEEE Transactions on Fuzzy Systems, 23(6), 2048-2066. Pratama, M., Lu, J., Anavatti, S., Lughofer, E., Lim, C-P. (2015(b)). An incremental meta-cognitive-based scaffolding fuzzy neural network, Neurocomputing, Vol.171, 89-105. Pratama,M.,Lu,J.,Zhang,G.,Anavatti.S.(2015(c)). Evolving Type-2 Fuzzy Classifier. IEEE Transactions on Fuzzy Systems, in press (10.1109/TFUZZ.2015.2463732). Pratama, M., Lu, J., Lughofer, E., Zhang, G., Anavatti, S. (2016). Scaffolding type-2 classifier for incremental learning under concept drifts, Neurocomputing, in press (10.1016/j.neucom.2016.01.049). Rong, H-J., Sundararajan, N., Huang, G-B. Saratchandran, P. (2006). Sequential Adaptive Fuzzy Inference System (SAFIS) for Nonlinear System Identification and Time Series Prediction. Fuzzy Sets and Systems, 157(9), 1260-1275. Rong, H-J., Sundarajan, N., Huang, G-B., Zhao, G-S. (2011). Extended Sequential Adaptive Fuzzy Inference System for Classification Problems. Evolving System, 2 (2), 71-82. Savitha,R.,Suresh,S.,Sundararajan,N.(2012). Metacognitive Learning in a Fully Complex-Valued Radial Basis Function Neural Network. Neural computation, 24(5), 1297-1328. B. Settles (2010), Active Learning Literature Survey, Computer Sciences Technical Report 1648, University of Wisconsin–Madison. Subramanian,K.,Suresh,S., Sundararajan,N.(2014(a)).A Meta-Cognitive Neuro-Fuzzy Inference System (McFIS) for sequential classification systems. IEEE Transactions on Fuzzy Systems, 21(6), 1080-1095. Subramanian,K.,Das,A.K,Suresh,S.,Savitha,R.(2014(b)). A meta-cognitive interval type-2 fuzzy inference system and its projection based learning algorithm. Evolving Systems, 5(4),219-230. Suresh,S., Dong, K., Kim.H.(2010),A sequential learning algorithm for self-adaptive resource allocation network classifier,” Neurocomputing, 73(16), 3012–3019. Stone, M, (1974).Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of Royal Statistic Society, 36, 111-147. Tabata, K., Kudo, M.S.M.(2010). Data compression by volume prototypes for streaming data, Pattern Recognition, 43.(9).3162—3176 Tung,S.W., Quek,C., Guan,C. (2013).eT2FIS: An Evolving Type-2 Neural Fuzzy Inference System. Information Sciences, 220,124-148. Vukovic, N., Miljkovic, Z.(2013). A growing and pruning sequential learning algorithm of hyper basis function neural network for function approximation, Neural Networks, 46, 210-226. Vygotsky,L.S. (1978). Mind and Society: The Development of Higher Psychological Processes, Cambridge, U.K: Harvard University Press. Vigdor,B.,Lerner.B.(2007).The Bayesian ARTMAP,” IEEE Transactions on Neural Networks, 18(6).1628–1644. Wang, N., Er, M-J., M.X. (2009). Fast and Accurate Self Organizing Scheme for Parsimonious Fuzzy Neural Network, Neurocomputing,72. Wood,D.(2001).Scaffolding contingent tutoring and computer-based learning”, International Journal of Artificial Intelligence in Education, 12(3), 280-292. Wu, S-Q., Er, M-J. (2000). Dynamic fuzzy neural networks—a novel approach to function approximation, IEEE Transaction on Systems Man Cybernetics, part b: Cybernetics, 30, 358–364. Wu, S-Q., Er, M-J.,Gao, Y. (2003). A fast approach for automatic generation of fuzzy rules by generalized dynamic fuzzy neural networks, IEEE Transaction on Fuzzy Systems,9(4),578–594. Xu, Y., Wong, K.W., Leung, C.S.(2006). Generalized Recursive Least Square to The Training of Neural Network. IEEE Transaction on Neural Networks, 17(1). Xiong,S.,Azimi,J.,Fern,X.Z.(2014). Active Learning of Constraints for Semi-Supervised Clustering. . IEEE Transactions on Knowledge and Data Engineering.26(1),43-54. Yager, Y.Y., Filex, D.P. (1994). Approximate clustering via the mountain method, IEEE Transactions on Systems, Man and Cybernetics, 24(8), 1279 – 1284. Yu,L.,Liu,H.(2004). Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research, 5, 1205-1224. Zadeh, L.A.(1975).The concept of a linguistic variable and its application to approximate reasoning. Information Sciences, 8(3), 199–249 Zliobaite,I., Bifet,A., Pfahringer.B., Holmes,B.(2014).Active Learning with Drifting Streaming Data. IEEE Transactions on Neural Networks and Learning Systems, 25(1), 27-39.

36

Algorithm 1: Pseudo code of RIVMcSFNN Classifier Define: Input attributes and Desired

class

labels:

( X n , Tn )  ( x1 ,..., x u , t1 ,.., t m )

( XS n , TS n ) Reserved samples: Predefined Thresholds:

 ( xs1,n ,..., xs u ,n , ts1 ,..., ts m )

1  1.1,  2  0.9,   0.8, s  0.01,   0.5,   exp(   2 ( )) /*Phase 4.1: What to Learn Strategy/* For i=1 to P do Compute the probability of the training sample to belong to existing clusters (17) End for Compute the sample entropy (21) IF (22) Then Accept data stream for the how-to-learn phase /*Phase 4.2: How to Learn Strategy /* /*Passive Scaffolding Theory/* /*Phase 4.2.A: Rule Growing Process – the problematizing phase /* For i=1 to P do /*Measuring Data Significance /* Compute the GT2DQ method (26) and the spatial firing strength (29) for all rules End For Compute the GT2DQ method for a hypothetical rule (26) For i=1 to P* do update the MCI for P* rules(35) End For IF (28) and (29)/ * Add a new rule/* Add a hypothetical rule as a new rule (27), (30) Else IF (59) Append the reserved samples with

N P1  1 the

iz Compute the overlapping degree (43), the homogeneity criterion (44) End For End For IF (43),(44)Then Coalesce the fuzzy rules (45),(46) End IF /*Phase 4.2.D: Rule Recall Mechanism – the problematizing phase/* IF (39) Then Recall previously pruned rule and add a new rule (40) End IF /*Phase 4.2.F: Online Feature Selection Mechanism – the complexity reduction phase/* For j=1 to u Do For o=1 to m Do Compute the symmetrical uncertainty IF SU ( x j , t o )   Then Prune the j-th feature Else IF Append j-th as relevant variables x re , N re  N re  1 End IF End For End For For j=1 to N re For re=1 to N re ,

current

sample

( XS NS 1 , TS NS 1 )  ( X N , T N )

j  re

Analyze the redundancy of input attributes SU ( x j , x re ) IF SU ( x j , T )  SU ( x j , x re ) Then

Else /*Fine-tuning Phase of the winning rule/* Update the premise parameters of the winning rule (30)-(34)

N

N

win win and increase the number of populations of winning rule End IF /*Phase 4.2.C: Rule Forgetting Strategy – the fading phase /* For i=1 to P do Enumerate the T2RMI method (35)-(37) IF (48)Then Deactivate the fuzzy rules subject to the rule recall mechanism P*=P*+1 End IF /*Phase 4.2.B: Rule Pruning Strategy – the fading phase /* IF (34)Then Prune the fuzzy rules End For /*Phase 4.2.E: Rule Merging Strategy – the fading phase/* For i=1 to P do For z=1 to P do

1

Prune re-th input variable End IF End For End For /* Passive Scaffolding Theory /* /*Phase 4.2.G: Rule Consequent Adaptation /* For i=1 to P do Adjust the fuzzy rule consequents using the FWGRLS method /*Phase 4.2.F: Adaptation of Wavelet Parameters, Recurrent Weights and Design Factor /* For j=1 to u do Adjust the dilation and translation parameters (49)-(52) End For For o=1 to m do Fine-tune the design factors (48) End For End For End IF (22)

37

References Juang and Lin, (1998) Angelov and Filev, (2004) Angelov and Filev, (2005) Rong et al, (2006) Angelov, (2010) Angelov, (2011) Lemos et al, (2011) Angelov, Yager, (2011) Pratama et al, (2014(a)) Pratama et al, (2014(b)) Lughofer et al, (2015(b)) Juang and Lin, (1999) Juang, (2002) Juang, (2010) Lin et al, (2013) Juang et al, (2008) Juang et al, (2009) Juang et al, (2013) Tung et al, (2014) Bouchachia, Vanaret, (2014) Lin et al, (2014(a)) Lin et al, (2014(b)) Suresh et al, (2010) Savitha et al., (2012) Subramanian et al., (2014(a)) Subramanian et al., (2014(b)) Babu and Suresh, (2013) Pratama et al, (2015(a)) Pratama et al, (2015(b))

Table 1. Classification of state-of-the art algorithms Working Principle Structure Hidden Node Evolving Feedforward Type-1 spherical rule Evolving Feedforward Type-1 spherical rule Evolving Feedforward Type-1 spherical rule Evolving Feedforward Type-1-spherical rule Evolving Feedforward Type-1 axis-parallel rule Evolving Feedforward Type-1 axis-parallel rule Evolving Feedforward Type-1 non-axis-parallel rule Evolving Feedforward Type-1 cloud-based rule Evolving Feedforward Type-1 non axis-parallel rule Evolving Feedforward Type-1 non axis-parallel rule Evolving Feedforward Type-1 non axis-parallel rule Evolving Global Recurrent Type-1 spherical rule Evolutionary Global Recurrent Type-1 spherical rule Evolving Local Recurrent Type-1 spherical rule Evolving Interactive Type-1 spherical rule Evolving Feedforward Type-2 spherical rule Evolving Local Recurrent Type-2 spherical rule Evolving Feedforward Type-2 spherical rule Evolving Feedforward Type-2 spherical rule Evolving Feedforward Type-2 non axis-parallel rule Evolving Feedforward Type-2 compensatory axis-parallel rule Evolving Feedforward Type-2 axis-parallel rule Metacognitive Feedforward Type-1 spherical rule Metacognitive Feedforward Type-1 spherical rule Metacognitive Feedforward Type-1 spherical rule Metacognitive Feedforward Type-2 spherical rule Metacognitive Feedforward Type-1 spherical rule Metacognitive scaffolding Feedforward Type-1 spherical rule Metacognitive scaffolding Feedforward Type-2 spherical rule Table 2. Sensitivity analysis of pre-recorded samples No. of samples Rule Runtime NDEI 10

2

0.296

0.288

30

2

0.26

0.288

50

2

0.25

0.288

80

2

0.268

0.288

100

2

0.328

0.288

Table 3. Learning performance of each learning configuration Configuration Rule Runtime NDEI Samples A

2

0.5

0.29

98

B C

2

1.5

0.28

200

4

0.98

0.6

98

D

2

0.68

0.3

98

F

2

0.3

0.288

98

Table 4. Tool condition monitoring problem of a complex manufacturing process Model Type RMSE Rule Input Runtime Sample RIVMcSFNN S-2-R 12 0.7±0.11 0.04±0.01 2 516.5 RIVMcSFNN + Section II.F S-2-R 0.07±0.02 2 9 0.4±0.06 516.5 eT2Class S-2-R 0.05±0.01 5.9±01 12 1.1±0.7 572 PANFIS S-1-F 0.045±0.01 12 0.98±0.1 572 2 GENEFIS S-1-F 0.0418±0.008 4.1±0.9 11.1±0.3 1.12±0.1 572 Simp_eTS S-1-F 0.26±0.08 5 12 1.4±0.3 572 eTS S-1-F 0.046±0.01 5.1±0.3 12 1.3±0.02 572 BARTFIS S-1-F 0.0632±0.01 20.6±4. 12 1.43±0.0 572 FAOSPFNN B-1-F 0.27±0.003 12.7±0.7 12 1.91±0.1 572 DFNN B-1-F 0.12±0.1 101.9±2 12 24.3±2.4 572 GDFNN B-1-F 0.05±0.06 5.3±1.15 12 7.25±1.5 572 ANFIS B-1-F 0.05±0.01 11 12 183.05±5 572 S: Sequential, B: Batch, 1: type-1, 2: Type 2, R: Recurrent, F: Feed-forward

Parameters 432 270 2065 636.7 636.7 137 139.5 762.2 902.2 4444.2 709.2 979

38

Table 5. S&P 500 Index Time Series Model Type NDEI Rule Input Runtime Sample Parameters RIVMcSFNN S-2-R 5 6.22 11493 110 0.01 2 RIVMcSFNN + Section II.F S-2-R 54 0.01 2 3 2.96 7448 eT2Class S-2-F 0.05 5 5 21.6 14893 385 PANFIS S-1-F 0.09 4 5 55.3 14893 144 GENEFIS S-1-F 0.07 5 48.3 14893 72 2 Simp_eTS S-1-F 0.04 7 5 158.00 14893 39 eTS S-1-F 0.04 14 5 89.9 14893 75 BARTFIS S-1-F 0.02 8 5 12.3 14893 128 DFNN B-1-F 0.06 5 5 548.5 14893 14953 GDFNN B-1-F 0.07 4 5 951.4 14893 14957 FAOSPFNN B-1-F 0.07 13 5 159.8 14893 14984 ANFIS B-1-F 0.02 32 5 384.9 14893 15115 S: Sequential, B: Batch, 1: type-1, 2: Type 2, R: Recurrent, F: Feed-forward Table 6. Prediction of Nox emission Model Type RMSE Rule Input Runtime Samples Parameters RIVMcSFNN (A) S-2-R 146 44.9 44384 0.04 2 357 RIVMcSFNN (B) S-2-R 0.09 12 491 432 2 1.32 eT2Class S-2-F 0.045 170 17.98 667 117304 2 PANFIS S-1-F 0.052 5 170 3.37 667 146205 GENEFIS S-1-F 0.048 1.41 667 2 2 18 Simp_eTS S-1-F 0.14 5 170 5.5 667 1876 BARTFIS S-1-F 0.11 4 4 2.55 667 52 DFNN B-1-F 0.18 548 170 4332.9 667 280865 GDFNN B-1-F 0.48 215 170 2144.1 667 109865 eTS S-1-F 0.38 27 170 1098.4 667 13797 FAOS-PFNN B-1-F 0.06 6 170 14.8 667 2883 ANFIS B-1-F 0.15 170 100.41 667 17178 2 S: Sequential, B: Batch, 1: type-1, 2: Type 2, R: Recurrent, F: Feed-forward Table 7. SISO dynamic system identification Model Type RMSE Rule Input Runtime Sample Parameters RIVMcSFNN S-2-R 0.001 2 2 0.34 598 32 eT2Class S-2-F 0.01 4 2 1.1 900 80 Simp_eTS S-1-F 0.08 4 2 1.87 900 84 eTS S-1-F 0.08 8 2 1.88 900 48 BARTFIS S-1-F 0.2 27 2 1.14 900 189 PANFIS S-1-F 0.08 4 2 1.2 900 36 GENEFIS S-1-F 0.07 4 2 1.2 900 36 DFNN B-1-F 0.06 5 2 1.5 900 940 GDFNN B-1-F 0.35 3 2 1.88 900 921 FAOS-PFNN B-1-F 0.38 139 2 60.95 900 1734 ANFIS B-1-F 0.41 5 2 1.3 900 935 *: the result was obtained under different computer environment, S: Sequential, B: Batch, 1: type-1, 2: Type 2, R: Recurrent, F: Feed-forward Table 8. Appraisal of Residential Premise Price ALGORITHMS Type MSE RULE INPUT RUNTIME SAMPLE PARAMETERS RIVMcSFNN S-2-R 0.01±0.002 4.7±0.5 108.83 2 0.1±0.03 2298.5 GENEFIS S-1-F 0.0119±0.0011 5.33±3.14 2.83±1.17 0.3±0.2 2417.7 78.18 PANFIS S-1-F 0.0159±0.0051 6.33±4.14 5 0.25±0.04 2417.7 227.9 eTS S-1-F 0.0497±0.0682 11±1.89 5 0.22±0.02 2417.7 126 Simp_eTS S-1-F 0.03±0.009 3.3±1 5 0.27±0.05 2417.7 41.7 BARTFIS S-1-F 0.0584±0.0223 10.2±0.1 5 0.21±0.09 2417.7 184 ANFIS S-1-F 0.0250±0.0235 32 5 376.5±0.1 2417.7 50512 DFNN B-1-F 0.004±0.07 33.1±0.01 5 543.2±1.1 2417.7 2869.1 GDFNN B-1-F 9.67±1 5 67.8±15.5 2417.7 2529 0.001±0.06 FAOS PFNN B-1-F 0.04±0.03 11.5±0.55 5 0.33±0.05 2417.7 126 eT2Class S-1-F 0.03±0.009 5 4.4±1.3 2417.7 72 2 S: Sequential, B: Batch, 1: type-1, 2: Type 2, R: Recurrent, F: Feed-forward

39

Algorithms

Tool Condition Monitoring (1,1,1,1,1,3)

Table 9. Ranking of Consolidated Classifiers S&P 500 Nox emission SISO time series dynamic (1,1,1,1,1,2) (1,1,3,1,1,3) (1,1,1,1,1,1)

Residential Premise Price (3,1,2,1,1,4)

average

RIVMcSFNN (1.4,1,1.6,1,1,2.6) (A,NR,NI, RT,NS,NP) GENEFIS (2,2,3,5,2,3) (5,1,2,3,2,3) (3,1,1,2,2,1) (4,3,1,3,2,3) (4,3,1,6,2,3) (3.5,1.8,1.8,4,2,2.5) (A,NR,NI,RT,NS,NP) PANFIS (3,1,3,2,2,3) (7,2,2,4,2,6) (4,1,3,4,2,9) (4,3,1,4,2,2) (5,4,3,4,2,6) (4.6,2.2,2.4,3.6,2,5.2) (A,NR,NI,RT,NS,NP) eTS (4,3,3,6,2,2) (3,6,2,7,2,4) (10,5,3,8,2,5) (5,5,1,6,2,3) (8,6,3,5,2,5) (6,5,2.4,6.4,2,3.8) (A,NR,NI,RT,NS,NP) Simp_eTS (8,3,3,6,2,1) (3,4,2,7,2,1) (8,3,3,5,2,4) (5,3,1,5,2,5) (7,2,3,5,2,1) (6.2,3,2.4,5.6,2,2.4) (A,NR,NI,RT,NS,NP) BARTFIS (9,9,8,8,2,6) (2,5,2,3,2,5) (6,2,2,3,2,2) (6,5,1,3,2,6) (9,5,3,2,2,5) (6.4,5.2,3.2,3.8,2,5) (A,NR,NI,RT,NS,NP) ANFIS (5,7,3,11,2,9) (2,8,2,9,3,11) (8,1,3,8,2,6) (11,4,1,2,9,8) (6,10,3,10,2,11) (6.4,6,2.4,8,3.6,9) (A,NR,NI,RT,NS,NP) DFNN (5,9,3,10,2,7) (2,5,2,9,2,8) (9,7,3,11,2,11) (3,4,1,5,2,10) (2,9,2,11,2,10) (4.2,6.8,2.2,9.2,2,9.2) (A,NR,NI,RT,NS,NP) GDFNN (5,5,3,9,2,7) (6,2,2,11,2,9) (11,6,4,10,2,8) (7,2,1,6,2,7) (1,5,2,9,2,9) (6,2,2.4,9,2,8) (A,NR,NI,RT,NS,NP) FAOS PFNN (11,6,3,9,2,8) (6,6,2,9,2,10) (5,4,4,6,2,4) (7,7,1,8,2,10) (8,8,2,7,2,5) (7.4,6.2,2.4,7.8,2,7.4) (A,NR,NI,RT,NS,NP) eT2Class (5,5,3,4,2,8) (4,5,2,3,2,7) (2,1,4,6,2,9) (2,3,1,2,2,3) (7,1,2,8,2,2) (4,3,2.4,4.6,2,5.8) (A,NR,NI,RT,NS,NP) A: Accuracy, NR: Number of Rules, NI: Number of Input Attributes, RT: Runtime, NS: Number of Sample, NP: Number of Parameters Table 10. Values of Z between RIVMcSFNN and other algorithms A NR RT NP

ST2Class rClass gClass McFIS SLWRFNN TRFNS RSEFNN-LF RSEIT2FNN MRIT2NFS SRIT2NFS RIVMcSFNN

GENEFIS

2.1

0.8

3

-0.1

PANFIS

3.2

1.2

2.6

2.6

eTS

4.6

4

1

1.2

Simp_eTS

4.8

2

4.6

-0.2

BARTFIS

5

4.2

2,8

2.4

eT2Class

2.6

1.4

3.6

3.2

Table 11. Summary of learning features of benchmarked algorithms Hidden Layer Output Layer Network Learning Policy Architecture Interval valued multivariate Nonlinear Feedforward Online Metacognitive Gaussian function Wavelet Scaffolding Learning Multivariate type-1 Gaussian Chebyshev Single Local Online Metacognitive function function Recurrent Scaffolding Learning Multivariate type-1 Gaussian Chebyshev Feedforward Online Metacognitive function function Scaffolding Learning Uni-variable type-1 Gaussian Type-1 First Feedforward Online Metacognitive function order TSK Learning Uni-variable type-1 Gaussian Nonlinear Single Local Offline Hybrid Learning function Wavelet Recurrent Type-1 univariable Gaussian First order TSK Single Global Offline Hybrid Learning Recurrent Type-1 univariable Gaussian First order TSK Single Local Online Evolving Learning Recurrent Interval valued univariable First order TSK Single Local Online Evolving Learning Gaussian Recurrent Interval valued univariable First order TSK Interactive Online Evolving Learning Gaussian Recurrent Interval valued univariable Zero Order TSK Feedforward Online Metacognitive Gaussian Learning Interval valued multivariable Nonlinear Double Local Online Metacognitive Gaussian function Wavelet Recurrent Scaffolding Learning

Application Classification Classification Classification Classification Regression Regression Regression Regression Regression Regression Regression

40

Table 13. Endpoints formula for interval multiplication Case

0  A and 0  B

A.B A.B

A  0  A and 0  B

A.B

A.B A.B A.B

A  0 and 0  B

A.B

A.B

0  A and B  0  B

A.B

A  0 and B  0  B

A.B

A.B A.B

0  A and B  0

A.B

A  0  A and B  0

A.B

A  0 and B  0 A  0  A and B  0  B

A.B min[ A B, A B ]

A.B A.B A.B

max[ A.B, A.B]

41

Fig 1. Network Architecture of the RIVMcSFNN

42

Fig 2. Learning Structure of the RIVMcSFNN

43

Fig 3. Selection of Predominant Features

Fig.4: (a) the trace of training samples in the tool condition monitoring problem, (b) the evolution of input feature in the tool condition monitoring problem, (c) the modelling of the Nox emission in a car engine, (d) the evolution of fuzzy rule and the identification of temporal system dynamic

44

Data Driven Modelling Based on Recurrent Interval ...

Data Driven Modelling Based on Recurrent Interval ...

Suggest Documents

Modelling and Analysing Interval Data

Data-Driven Modelling - Springer

Interval-based Modelling with Constraints ... - Semantic Scholar

Data-driven Travel Demand Modelling and Agent-based Traffic ... - Core

Data-driven Damage Model based on Nondestructive

PROSODIC DATA DRIVEN MODELLING OF A ... - CiteSeerX

Data-driven Wake Modelling for Reduced

Data-Driven Modelling: Concepts, Approaches and ... - Springer

Geospatial Data Modelling and Model-driven ...

Data-Driven Modelling: Concepts, Approaches and Experiences

Data-Driven Modelling: Concepts, Approaches and ... - Springer

Data-driven interdisciplinary mathematical modelling quantitatively ...

Data-Driven Modelling of Wind Turbines

Data-Driven Reliability Modeling, Based on Data Mining in Distribution ...

Evidential supplier selection based on interval data fusion

Dynamic Cluster Methods for Interval Data based on ... - Inria

Bitmap-Based On-Line Analytical Processing of Time Interval Data

Data Mining and Data-Driven Modelling in Engineering Geology ...

Wind Power Interval Forecasting Based on Confidence Interval ... - MDPI

Large-Scale Recurrent Neural Network Based Modelling of Gene ...

A robust & reliable Data-driven prognostics approach based on ...

A data-driven approach to alternations based on ...

Subspace Method Aided Data-Driven Fault Detection Based on ...

A Data-Driven Failure Prognostics Method Based on ... - IEEE Xplore