An Incremental Meta-Cognitive-based Scaffolding Fuzzy Neural Network Mahardhika Pratama a*) , Jie Lu a ) , Sreenatha Anavatti b ) , Edwin Lughofer c ) , Chee-Peng Lim d ) a)
b) c) d)
Centre for Quantum Computation and Intelligent System, University of Technology, Sydney, Australia, email:
[email protected],
[email protected]
School of Engineering and Information Technology, University of New South Wales, Canberra, Australia, email:
[email protected]
Department of Knowledge-Based Mathematical Systems, Johannes Kepler University, Linz, A-4040, Austria Email:
[email protected]
Centre for Intelligent System Research, Deakin University, Geelong Waurn Ponds Campus, Victoria 3216, Australia Email:
[email protected]
*Corresponding Author Abstract— the idea of meta-cognitive learning has enriched the landscape of evolving systems, because it emulates three fundamental aspects of human learning: what-to-learn; how-to-learn; when-to-learn. However, existing meta-cognitive algorithms still exclude Scaffolding theory, which can realize a plugand-play classifier. Consequently, these algorithms require laborious pre- and/or post-training processes to be carried out in addition to the main training process. This paper introduces a novel meta-cognitive algorithm termed GENERIC-Classifier (gClass), where the how-to-learn part constitutes a synergy of Scaffolding Theory – a tutoring theory that fosters the ability to sort out complex learning tasks, and Schema Theory – a learning theory of knowledge acquisition by humans. The what-to-learn aspect adopts an online active learning concept by virtue of an extended conflict and ignorance method, making gClass an incremental semi-supervised classifier, whereas the when-to-learn component makes use of the standard sample reserved strategy. A generalized version of the Takagi-Sugeno Kang (TSK) fuzzy system is devised to serve as the cognitive constituent. That is, the rule premise is underpinned by multivariate Gaussian functions, while the rule consequent employs a subset of the non-linear Chebyshev polynomial. Thorough empirical studies, confirmed by their corresponding statistical tests, have numerically validated the efficacy of gClass, which delivers better classification rates than state-of-the-art classifiers while having less complexity. Keyword: Evolving Fuzzy Systems, Fuzzy Neural Networks, Meta-cognitive Learning, Sequential Learning
1. INTRODUCTION The consolidation of the meta-cognitive aspect in machine learning was initiated by Suresh et al. in [7][11] based on a prominent meta-memory model proposed by Nelson and Naren [6]. The works in [7]-[11] identify that the meta-cognitive component, namely what-to-learn, how-to-learn and when-to-learn, can respectively be modelled with sample deletion strategy, sample learning strategy and sample reserved strategy. Nevertheless, their pioneering works still discount the construct of Scaffolding theory [12], rendering a plug-and-play classifier. They have also not addressed the issue of semi-supervised learning, since the what-to-learn phase requires the data to be fully labelled. A novel meta-cognitive-based Scaffolding classifier, the GENERIC-classifier (gClass), is proposed in this paper. The gClass learning engine comprises three elements: what-to-learn; how-to-learn; and whento-learn. The underlying novelty of gClass lies on the use of Schema and Scaffolding theories in the how-
to-learn component to realize it as a plug-and-play classifier. The plug-and-play learning paradigm emphasizes the need for all learning modules to be embedded in a single learning process without invoking any pre- and/or post-training processes. In respect of its cognitive constituent, the gClass fuzzy rule triggers a non-axis-orthogonal fuzzy rule in the input space, underpinned by the multivariate Gaussian function rule premise. Unlike the standard form of TSK fuzzy rule consequents, the rule consequent of gClass is built upon a non-linear function stemming from a subset of non-linear Chebyshev polynomials. All training mechanisms run in the strictly sequential learning mode to assure fast model updates and comply with the four principles of online learning [32]: 1) all training observations are sequentially presented one by one or chunk by chunk to gClass; 2) only one training datum is seen and learned in every training episode; 3) a training sample which has been seen is discarded without being reused; and 4) gClass does not require any information pertaining to the total number of training data. The gClass learning scenario utilizes several learning modules of our previous algorithms in [18], [19]: three rule growing cursors, namely Datum Significance (DS), Data Quality (DQ), and Generalized Adaptive Recursive Theory+ (GART+), are used to evolve fuzzy rules according to the Schema theory [14]; two rule pruning strategies, namely Extended Rule Significance (ERS) and Potential (P+) methods, are assembled to get rid of obsolete and inactive fuzzy rules and portray the fading aspect of Scaffolding theory. The P+ method also deciphers the rule recall process, manifesting the problematizing component of Scaffolding theory to cope with the recurring concept drift; the Fuzzily Weighted Generalized Recursive Least Square (FWGRLS) method is integrated to adjust the rule consequent of the fuzzy rule and in turn delineates the passive supervision of the Scaffolding theory. gClass operates as its counterparts in [7]-[11], where the sample reserved strategy is employed in the when-to-learn process. Nonetheless, several new learning modules are proposed in this paper:
The what-to-learn component is built upon a new online active learning scenario, called the
Extended Conflict and Ignorance (ECI) method. The ECI method is derived from the conflict and ignorance method [2], and the ignorance method is enhanced by the use of the DQ method instead of the classical rule firing strength concept. This modification makes the online active learning method more robust against outliers and more accurate in deciding the sample ignorance. Note that this mechanism can be also perceived as an enhanced version of the original what-to-learn module in [7]-[11]. In [7]-[11], the what-to-learn module is limited to ruling out redundant samples for model updates, and still assumes that data are fully labelled.
A new fuzzy rule initialization strategy is proposed and is constructed by the potential per-class
method. This method is used to avoid misclassifications caused by the class overlapping situation. A number of research efforts have been attempted in [7]-[10] and [70]-[72] to circumvent the class overlapping situation, however they rely on the distance ratio method, which overlooks the existence of
unclean clusters. An unclean cluster is a cluster that contains supports from different classes and is prevalent in real world-problems. This learning aspect actualizes the restructuring phase of Schema theory.
gClass is also equipped with a local forgetting scheme inspired by [28] to surmount gradual
concept drift, where the forgetting intensity is enumerated by a newly developed method, called the Local Data Quality (LDQ) method. It is worth stressing that gradual concept drift is more precarious than abrupt concept drift, because gradual concept drift cannot be detected by standard drift detection or the rule generation method. On the other side, it cannot be handled by the conventional parameter learning method either. This situation entails the local forgetting scheme, which adapts fuzzy rule parameters more firmly and is thereby able to pursue changing data distributions. In the realm of Scaffolding theory, the local drift-handling strategy plays a problematizing role in the active supervision of the theory.
gClass enhances the Fisher Separability Criterion (FSC) in the empirical feature space method
with the optimization step via the gradient ascent method. This step not only alleviates the curse of dimensionality, but it also improves the discriminatory power of input features. Noticeably, it triggers a direct impact on the classifier’s generalization. The online feature weighting technique is employed to address the complexity reduction scenario in the active supervision of the scaffolding concept. The contributions of this paper are summarized as follows: 1) The paper proposes a new class of meta-cognitive classifiers, which consolidates the Schema and Scaffolding theories to drive the how-tolearn module. 2) The paper introduces a novel type of TSK fuzzy rule, crafted by the multivariable Gaussian function in the premise component and the non-linear Chebyshev polynomial in the output component. 3) Four novel learning modules in the gClass learning engine are proposed: online feature selection; online active learning; class overlapping strategy; and online feature weighting mechanism. The viability and efficacy of gClass have been numerically validated by means of thorough numerical studies in both real-world and artificial study cases. gClass has also been benchmarked against various state-ofthe-art classifiers, confirmed by rigorous statistical tests in which gClass demonstrates highly encouraging generalization power while suppressing complexity to an acceptable level. The remainder of this paper is organized as follows: Section 2 discusses related works. Section 3 illustrates the gClass inference mechanism, i.e., its cognitive aspect. Section 4 outlines the algorithmic development of gClass, i.e., its meta-cognitive component. Section 5 deliberates the empirical studies and discussions of the research gap and contribution, which detail the viability and research gap of gClass. Concluding remarks are drawn in the last section of this paper. 2. Literature Review In this section, two related areas are discussed. A survey of the psychological concepts implemented in gClass is undertaken, as well as a literature review of state-of-the art evolving classifiers. 2.1 Human Learning
The main challenge of learning sequentially from data streams is how to deal with the stability and plasticity dilemma [15], [16], which requires a balance between new and old knowledge. In the realm of cognitive psychology, this dilemma is deliberated in Schema theory, which is a psychological model for human knowledge acquisition and the organization of human memory [14], [67], in which knowledge is organized into units, or schemata (sing. schema). Information is stored within the schemata, and Schema theory is thus the foundation of a conceptual system for understanding knowledge representation. In essence, Schema theory is composed of two parts: schemata construction and schemata activation. Schemata are built in the construction phase, and this is achieved by three possible learning scenarios that relate to the conflict level induced by an incoming datum – accretion, tuning and restructuring. Accretion pinpoints a conflict-free situation, where an incoming datum can be well-represented by an existing schema. Tuning represents a minor conflict circumstance in which only the adaptation of a schema is entailed. The most significant case is the restructuring phase, in which a datum induces a major conflict which demands the restructure of an existing schema or its complete replacement. Schemata activation describes a self-regulatory process to evaluate the performance of the schemata, or determines a compatible learning scenario to manage a new example. Scaffolding theory elaborates a tutoring theory, which assists students to accomplish a complex learning task [69]. This goal is achieved by passively and actively supervising the training process. Passive supervision implements a learning strategy by virtue of the experience and consequence mechanism, and depends on the predictive quality of fresh data. Passive supervision is particularly represented by the parameter learning of the rule consequent. Active supervision makes use of more proactive mechanisms and consists of three learning scenarios: complexity reduction; problematizing; and fading [68]. The complexity reduction component aims to relieve the learning burden and can be actualized by data preprocessing and/or feature selection. Problematizing copes with concept drift and can be realized by a local forgetting mechanism and/or rule recall strategy. The fading constituent deciphers a structure simplification procedure which inhibits redundancy in the rule base; this concept is usually executed by the rule pruning technique. In the psychology literature, the ability of human beings to evaluate their knowledge with respect to the environment and their capacity to self-organize that knowledge is well-known as metacognition. Nevertheless, characteristic of mainstream machine learning is cognitive in nature (see [1]-[5] and [60][63]). Conventional machine learning algorithms learn all the streaming data without being able to extract important training samples and are unable to pinpoint compatible time instants in which to consume the training data [52] [65] [66]. A prominent contribution was delivered by Nelson and Narens in [6], which identifies the monitoring and control connection between cognition and metacognition. This work paves the way for a simple but well-accepted model to be emulated by a machine learning algorithm. In
principle, the cognitive component memorizes pivotal information, or examples, acquired from past experiences, while the meta-cognitive element depicts learning strategies to update the cognitive module. The meta-cognitive scenario is composed of: the termination of study (when to learn); the selection of processing method (how to learn); and item selection (what to learn). The termination of study (when-tolearn) decides when the study of an item should end, the selection of processing method concerns the choice of strategies to use when an item is integrated into memory, and item selection (what-to-learn) specifies whether or not an item is worth studying. 2.2 State-of-The Art Learning Algorithms The notion of the meta-cognitive classifier stems from the so-called evolving classifier, which has the characteristic of being fully adaptive and evolving. The evolving classifier can start its learning process from scratch with an empty rule base, and the fuzzy rules can be autonomously generated from data streams. The evolving classifier was pioneered by Angelov and Zhou in [1]. In [4], several evolving classifier architectures were introduced which were driven by eClass and FLEXIS-Class as the base classifiers. Simp_eClass+ was proposed in [3] and eMG_class was put forward in [5]. More recently, an all-pair classifier architecture and a conflict and ignorance online active learning method were devised. In our previous work, we put forward the GENEFIS-class method [20], which amends GENEFIS in [18] to solve classification problems. This work was enhanced in [19], which produced the so-called Parsimonious Classifier (pClass). pClass and GENEFIS-class are underpinned by the non-axis-parallel ellipsoidal cluster, yet they still exploit a standard first order rule output which does not fully disclose a local approximation trait. A seminal work, namely FAOS-PFNN, was proposed in [73]. This work was enhanced in [74] with the use of asymmetric Gaussian function. Nevertheless, these two works merely constitute a semi-online learning algorithm, where they still require a retraining phase using an up-to-date dataset, when encountering a new training pattern. These works are modified in [75], and are called DPand CP-ELM where they presents a recursive version of orthogonal least square developed by sequential partial orthogonalization to grow and to prune the hidden nodes of the ELM. Recently, CP-and DP-ELM were extended in [76] by means of the polynomial weight vector, where it is called BR-ELM and its sequential version is termed in OSR-ELM. Although in-depth studies have been conducted by researchers, their works do not yet incorporate the meta-cognitive learning paradigm. This issue has led to the development of the meta-cognitive classifier in [7]-[11] and [70]-[72], which are built upon the meta-memory model in [6]. Meta-cognitive learning is transformed into the machine learning context with the sample deletion strategy (what-to-learn), sample learning technology (how-tolearn), and sample reserved mechanism (when-to-learn). Arguably, the works in [7]-[11] and [70]-[72] adopt similar learning scenarios, and their main contribution is to create a meta-cognitive learning algorithm with various cognitive components, ranging from Radial Basis Function Neural Network
(RBFNN) to Fuzzy Neural Networks (FNNs). Nevertheless, the meta-cognitive learning research area deserves more profound investigation for two underlying reasons: 1) The use of the scaffolding criteria, which provides a promising direction for a plug-and-play classifier, is still uncharted. Therefore, these machine learning variants usually enforce pre-and post-training processes, which undermine the logic of the online learning machine. 2) The issue of the considerable labelling effort is unsolved, because the traditional meta-cognitive classifiers are designed for a fully supervised learning environment. 3. COGNITIVE COMPONENT OF GCLASS gClass is endowed with a generalized fuzzy rule, in which the multivariate Gaussian function, which possesses a non-diagonal covariance matrix, is utilized as the rule antecedent. This rule premise is an attractive option for covering real-world data distributions because it can evolve non-axis parallel ellipsoids and is capable of conferring more exact coverage of data distributions. It is worth noting that this advantage cannot be achieved with axis parallel rules induced by the classical t-norm operation. This rule premise arguably has two underlying shortcomings. First, the fuzzy set cannot be explicitly formulated, thus inducing less transparent rule semantics. Second, it forces a more demanding memory burden because of the need to store extra parameters in the memory. The first issue can, in principle, be surmounted by our previous work in [18], in which two fuzzy set extraction strategies are developed. The second issue is not necessarily valid, since this rule type can be anticipated to dampen the requirement of fuzzy rules (e.g., being able to perform a more compact representation in the event of longer rotated data clouds with only one rule). Fig.1 depicts two distinct ellipsoidal contours compiled by two types of rule.
Fig.1 Cluster/Rule representations (solid lines: arbitrarily rotated ellipsoids, dotted lines: axis-parallel ellipsoids), in case of the longer thin data cloud using conventional axis-parallel rules either an inexact presentation with one rule or an exact representation with a high complexity (three rules) is enforced, whereas a rotated representation can almost perfectly model the data cloud
Each rule premise is fuzzily associated with a local non-linear sub-model by virtue of the Chebyshev polynomial having its root in [23]. This rule output consummates the local approximation ability via a non-linear mapping of the Chebyshev function augmenting the degree of freedom of rule output. It is worth-noting that the Chebyshev function possesses a simpler expression than the trigonometric rule consequent of [24], [30], [31]. Moreover, our approach is more resilient than the approach in [24],
because the rule consequents are adapted by the local learning scenario, assuring higher flexibility, stability and faster convergence speed [4] [39]. This rule is formed as follows:
Ri : IF X is close to Ri Then y i o x e i where Ri stands for a multi-dimensional kernel, constructed by the multidimensional Gaussian function, o
propelled by a non-diagonal covariance matrix, while y i denotes a regression output of a o-th class in ( 2u 1)m the i-th rule. i labels a weight vector, where m specifies an output dimensionality and u
denotes a number of input features. Conversely, x e constitutes an expanded input vector produced by a non-linear mapping based on the Chebyshev series up to the second order. Inspired by the ChebyshevFunctional Link Artificial Neural Network [23], the mathematical expression of the Chebyshev polynomial is given as follows:
Tn1 ( x) 2 x j Tn ( x j ) T n1( x j )
(1)
with T0 ( x j ) 1 , T1 ( x j ) x j , T2 ( x j ) 2 x j 1 2
Suppose X is a 2-D input pattern X [ x1 , x2 ] . The expanded input vector turns out to be
xe [1, x1 , T2 ( x1 ), x 2 , T2 ( x 2 )] . Note that we include the term 1 in this case to include the intercept of the rule consequent as the standard form of the TSK fuzzy rules [59] (otherwise, all consequent hyper-planes will arrive at the origin, thus leading to untypical gradients). Combining piecewise local predictors yi through a non-linear kernel (rule membership function R ) leads to the predicted output of the model : p
yˆo
p
Ri yi o
i 1 p
R
i
i 1
exp( ( X C ) A i
1
i
i 1 p
exp( ( X C ) A i
i
( X Ci )T ) yi o
1
( X Ci )T )
, y max( yˆ o )
(2)
o 1,..,m
i 1
where p signifies the number of fuzzy rules, Ci 1u denotes the centre of the Gaussian function of the i-th rule and Ai 1 uu stands for the inverse covariance matrix of the i-th rule, which defines the shape and orientation of the ellipsoidal contours. By extension, the Gaussian function is selected because it can forestall undefined input states due to infinite support. It can allow smooth approximation of the local data space, since it is steadily differentiable. In this paper, the MIMO architecture is used to infer the classification result [1]-[5], where the final output is simply assembled by the maximum operator. 4. META-COGNITIVE LEARNING An incoming datum is first vetted by the what-to-learn module (Section 4.2), which aims to rule out inconsequential samples for the model updates. The training samples, admitted by the what-to-learn component, are injected into the how-to-learn module (Section 4.1), which updates the cognitive component. The training samples, which do not satisfy the learning criteria set out in the how-to-learn component, are assigned as reserved samples. The reserved samples are utilized after the main training
patterns have all been consumed, with the aim of filling the gaps unexplored by the main training patterns (Section 4.3). Fig.2 illustrates the learning architecture of gClass, whose learning steps will be detailed in the next section. In addition, an overview of the gClass learning policy is articulated in Algorithm 1. Algorithm 1: Rule Base Management of gClass Define: Input attributes and Desired class labels: ( X n , Tn ) ( x1,...,xu , t1,..,tm )
Reserved samples:
( XS n , TS n ) ( xs1,n ,...,xs u,n , ts1 ,...,tsm ) 15
Predef. Thresholds 1 0.5, 3 0.01, 10 , 10 , 0.05 /*Phase 1: What to Learn Strategy: Active Learning Strategy/* For i=1 to P do Compute the data quality (6) and the output of the classifier (2) End for 5
IF (29) Then /* big IF, spanning until the end */ Label the data steam using expert knowledge /*Phase 2: Input Weighting Mechanism- complexity reduction of active scaffolding /* For j=1 to m do For z=1 to m do j z Compute the recursive formula (26) Construct the kernel gram matrix (25) End For End For Compute the gradient expression (27) Execute the gradient ascent method (28) /*Phase 3: Local Drift Handling Strategy-problematizing of active scaffolding /* For i=1 to P do Compute the local error and forgetting level (22),(23) End For /*Phase 4: Rule Growing, Adaptation of the Fuzzy Rule Premise-Schema theory /* For i=1 to P do Compute the posterior probabilities of the fuzzy rules (5) Update the volume of all rules (4) End For Determine the winning rule win arg max i 1,...,P P( Ri X )
For i=1 to P* do update the P+ method for P* rules End For IF (3)Then
i
max ( ) max ( DQ )
(15)
i Then IF i*1,..,P* i* i 1,.., p 1 Activate rule recal mechanism (17) Else IF
R
2 IF win Then Compute the potential per class method (7)
max ( DQ ) true _ class _ label
o IF o 1,..,m Then Initialize the new fuzzy rule as (8) Else IF Initialize the new fuzzy rule as (9) End IF Else IF Initialize the new fuzzy rule as (10) End IF Assign the output parameters as (11) End IF p Else IF V win 1 Vi i 1 Update the premise parameters of the winning rule (12)-(14) Else Append the reserved samples with the current sample
( XS NS 1 , TS NS 1 ) ( X N , TN ) End IF (3) /*Phase 5: Rule Pruning Strategy-fading of active scaffolding /* For i=1 to P do Enumerate the ERS and P+ methods (15) ˆ 2 Then IF i Prune the fuzzy rules End IF /*Phase 6: Rule Pruning Strategy-problematizing of active scaffolding /*
ˆ 2
Then IF i Deactivate the fuzzy rules subject to the rule recall mechanism P*=P*+1 End IF End For /*Phase 7:Adaptation of rule consequent-Passive Scaffolding Theories /* For i=1 to P do Adjust the fuzzy rule consequents (18)-(21) End For /*Phase 8: When-to-learn-sample reserved strategy /* For n=1 to NS do Execute all training processes from step 1-7 using the reserved
( XS , TS n ) ( xs1,n ,...,xs u,n , ts1 ,...,tsm )
n samples End For End IF (29)
4.1 How to learn 4.1.1 Autonomous Fuzzy Rule Recruitment The fuzzy rules are generated by three rule growing modules which aim to find streaming data with high potential and summarization power [18]-[19]. The first method, namely the Datum Significance (DS) method, is capable of appraising the statistical contribution of streaming data, indicating the expected contribution of rules to the overall system output, whereas the second approach makes use of (Generalized Adaptive Resonance Theory+) GART+, which is useful in inhibiting the cluster delamination effect by confining the size of the fuzzy region. The third method, namely the Data Quality
(DQ) method, determines the spatial proximity between the datum and all previous data to establish whether or not it occupies a valuable fuzzy region in the input space. This leads us to the following condition of fuzzy rule generation. p
V P 1 max(Vi ) and V win 1 i 1,..,P
V
i
and ( DQ N max ( DQi )orDQ N min ( DQi ) ) (3)
i 1
i 1,..,P
i 1,...,P
where Vi denotes the volume of the i-th rule and DQ N is the quality of N-th datum, while 1 labels a predefined constant whose value is stipulated in the range of [0.1,0.5]. Generally speaking, 1 governs the stability-plasticity dilemma, where the allocation of a lower value encourages plasticity, inducing a high number of rules and vice versa. If the first part of the condition in (3) holds, a data stream inevitably contributes well during their lifespan. The second part addresses the situation in which the volume of the winning rule is oversized; an adaptation of the winning rule would thus have the effect of enlarging the coverage span of the winning rule, exacerbating the condition of cluster delamination (covering more than one data cloud). The last part of (3) can be used as a precursor of data density. This strategy is capable of indicating an incoming datum lying on a populated fuzzy region or an uncharted input space, signifying a shift in the data trend. Note that the rule growing cursor performs a knowledge exploratory mechanism or quantifies the degree of conflict induced by the datum. Accordingly, the rule growing criteria determine suitable learning modules to be performed during the training process. In the realm of psychology, the conflict measure is in line with schemata construction in Schema theory. 4.1.2 Hyper-Volume Calculation The hyper-volumes of non-axis parallel ellipsoids can be simply elicited by the use of the determinant operator. However, this is a heuristic approach, which is rather inaccurate. We can enumerate a hypervolume of a generalized fuzzy rule more exactly as follows:
Vi
2*
u j 1
(ri / ij ) * u / 2
(u / 2)
, (u ) x u 1e x dx
(4)
0
where ri is the Mahalnobis distance radius of the i-th fuzzy rule, which defines its (inner) contour (with the default setting of 1) , ij is the j-th eigenvalue of the i-th fuzzy rule, and is the gamma function. To expedite the computation of the gamma function, a look-up table can be generated a priori and used during on-line learning with the current data set. 4.1.3 Winning Rule Elicitation The winning rule is selected using the Bayesian theory rather than the traditional firing strength measure, where the fuzzy rule, having the maximum posterior probability, is chosen as the winning rule, i.e., win arg max i1,...,p Pˆ ( Ri X ) . The advantage of this theory is its prior probability, which is capable of determining the winning rule in the probabilistic fashion. Such strategy is deemed efficient to deduce
the winning rule when two or more rules occupy similar proximities to the training datum. The posterior, prior probabilities as well as the likelihood function are illustrated respectively as follows: P ( Ri X )
pˆ ( X Ri ) Pˆ ( Ri ) p
pˆ ( X R ) Pˆ (R ) i
Pˆ ( Ri )
i
log( N i 1) p
log(N
i
Pˆ ( X Ri )
1)
1 1
(2 ) 2 Vi
1
exp( ( X Ci ) Ai 1 ( X Ci )T ) (5) 2
i 1
i 1
where N i stands for the number of populations of the i-th cluster. Note that the prior probability formula
Pˆ ( Ri ) is softened from the original version for allowing newly born clusters to win the competition and to develop its shape. Note that the new cluster is usually populated with a smaller number of samples than the old clusters, thus impeding them to be selected as the winning rule. 4.1.4 Recursive Computation of Actual Data Quality (DQ) According to [32], the DS method is unappealing to be benefited as the sole rule growing method, if the data distribution is not uniformly distributed. The work in [32] offers a solution to cope with this issue. Nevertheless, it is not compatible for the online learning scenario, because it is based on the sliding window-based approach. To remedy the bottleneck, we can estimate the quality of a new datum with respect to existing clusters on-the-fly without the use of past training stimuli as follows: N 1
DQ N
R
N
1
N 1
n 1
1
DQ
n (X n
X N ) AN 1 ( X n X N ) T
UN (6) U N (1 a N ) 2b N c N
n 1
N 1
DQ
n
n 1
Fig.2 learning architecture of gClass where U N U N 1 DQ N 1 , a N X N AN 1 X N T , bN DQ N 1 X N N , N N 1 AN 1 X N 1T i
i
c N c N 1 DQ N 1 X N 1 AN 1 X N 1 . X N denotes the latest incoming datum and X n labels the n-th
incoming datum. This formula effectively quantifies the firing strength of a hypothetical rule (the latest datum) in recursively accommodating already-seen training samples without maintaining them in the memory. In other words, it approximates the zone of influence of a cluster with respect to other training stimuli seen thus far. In the third term in (3), the first part, i.e. DQ N max ( DQi ) implies that a i 1,.., p
prospective cluster occupies a denser region in the input space than existing rules. Meanwhile, the second situation DQ N min ( DQi ) shows that the prospective cluster digs up an unexplored local region in i 1,...,p
the input space, or indicates a regime shifting property of the system. Note that we can arrive at the data quality for an i-th rule DQi ,i 1,...,p , substituting X N , AN 1 in (9) with C i , Ai 1 , where we specifically update DQi ,i 1,...,p in every training episode. It is worth noting that DQ N min ( DQi ) invokes the rule i 1,...,p
pruning scenario to be instilled in the learning engine to prevent outliers being mounted as new rules. Also, the DQ method is compatible with the generalized TSK fuzzy rule, utilizes the inverse multiquadratic kernel in lieu of the Cauchy kernel, and engages a weighting factor to resolve a large pair-wise distance problem after receiving noisy training samples [33]. In short, it can be envisioned as an extended version of the Recursive Density Estimation (RDE) method in [5]. 4.1.5 Initialization of New Fuzzy Rule Parameters Initialization of the fuzzy rule parameters should be contrived circumspectly in Evolving Fuzzy Classifiers (EFC), because the class overlapping contingency may be apparent. Generally speaking, a newly composed rule should not be adjacent to clusters supported by different classes. Suresh et al. in [7][11] offered a concept to especially deal with this problem by exploiting a distance ratio between interand intra-class clusters. Nevertheless, in real-world streaming data problems, a cluster is likely to comprise supports from different classes (as classes cannot be clearly separated, or samples may be affected by noise). Therefore, a cluster cannot usually be linked to a particular class (also well-known as a clean cluster). This issue is excluded in [7]-[11]. To remedy this stumbling block, we should first canvass the compatibility degree of the winning cluster to check its spatial proximity to a data stream. if Rwin 2 , a datum is adjacent to the winning 2 rule. 2 stands for a predefined constant that can be statistically represented by the critical value of a
distribution with Z degrees of freedom and a significance level of typical value of
2 [36], termed as p ( ) . A
is 5%, and the degree of freedom is represented by the dimensionality of the learning
2 problem, thus setting Z u . Therefore, we set 2 exp( p ( )) . As Rwin 2 , a new rule induced by
the new datum (center equal to datum coordinates) can be claimed as a redundant rule since it lies on the tolerance region. We switch on the so-called DQ per class method, which aims to approximate the class
interactions. The crux of this method is to probe the class relationship between a datum and existing data clouds recursively. In a nutshell, the potential per class method is formulated as follows:
1
DQ o
No u m
(x
1
j
no
xj ) N
(7)
no 1 j 1
( N o 1)
u m
where abn
( N o 1) ( N o 1)( abn 1) cb n 2bbn
u m
( x j N ) 2 , cb no cb no1
j 1
j 1
( x j N o 1 ) 2 , bbno
u m
x
N j
d j no , d no d no1 x No 1 and
j 1
N N o denotes the number of samples falling in the o-th class. Meanwhile, x j stands for the latest no incoming datum of the o-th class and x j denotes the streaming data falling to the o-th class. This
measure is useful for understanding whether or not the newest datum is closer to the data cloud of the same class. A precarious situation may arise if the new datum has a closer relationship with the data samples of different classes max ( DQo ) true _ class _ label . The class overlapping problem, which o 1,..,m
jeopardizes the classification rate, is most likely to occur in this situation. Let ir be the winning intra-class cluster, and ie be the winning inter-class cluster. We initialize a new rule as follows:
c P 1, j x j 2 (cie, j x j ) , dist j 1 c P 1, j cie, j and A p 1 1 (dist T dist) 1
(8)
where c P 1, j is the j-th component of the centroid of the new fuzzy rule (P+1st) and c ie, j is the j-th 1 component of the centroid of the winning inter-class fuzzy rule. A p 1 is the inverse covariance matrix
of the new fuzzy rule and 4 stands for an overlapping factor steering the overlapping degree of the new cluster and the nearest cluster. 3 [0.01 0.1] denotes a shifting factor fixed as 0.01 for all our numerical studies for the sake of simplicity. The predefined parameter 3 is not problem dependent according to our sensitivity analysis. That is, the variation of its values in the specified range does not affect learning performance. A plausible choice of 4 can be gained by setting 4 rir j rie j , where rir j labels a spatial proximity of the datum and the nearest intra-class cluster, whereas rie j exhibits a distance between the datum and the most adjacent inter-class cluster. One can concur that 4 rir j rie j is coherent to serve as 4 since it should be demoted when the datum neighbours the inter-class cluster and vice versa. As a result, (8) essentially shrinks the coverage span of the newly created cluster and shifts the rule centroid away from the inter-class cluster to stave off the class overlapping phenomenon. If the datum lies on the area near the data points in the same class max ( DQo ) true _ class _ label , o 1,..,m
we construct the new fuzzy rule as follows:
c P 1, j x j 2 (cir , j cie, j ) dist j 1 c P 1, j cir , j and A p 1 1 (dist T dist) 1
(9)
where c ir , j is the j-th component of the winning intra-class cluster. A low risk of misclassification is observed by the potential per class method in this case, thereby allocating more confident parameters. Although the new rule may result in overlap with the winning intra-class cluster in the future, this situation does not induce a substantial amendment of the decision boundary, making worse the nonlinearity of the decision surface. Another condition may ensue in the training process and is signified by Rwin 2 . In this case, a low risk of the class overlapping phenomenon is captured, since the new fuzzy rule possibly occupies a remote input region, uncharted by the existing fuzzy rules. We thus tailor the new fuzzy rule as follow:
c P 1, j x j dist j 1 x j cir , j and Ap 11 (distT dist)1
(10)
In all cases, the output parameters and the covariance matrix of the newly crafted fuzzy rules are constructed as follows:
p 1 winner and p 1 I
(11)
where is a large positive constant. The desired setting of is as verified in [38], because it can lead to the best solution as produced by the batch learning process instantaneously. Conversely, the consequent vector is determined as the rule consequent vector of the nearest rule, arguably inheriting its functional trend. This setting can also cope with discontinuities of the approximation surface and can reduce convergence time. 4.1.6 Adaptation of Existing Rules (when (3) is not fulfilled) Appending the fuzzy rules automatically as presented in Section 4.1.6 represents the restructuring phase of the Schema theory, because the datum is conflicting to the current belief of the system. Another circumstance, namely the tuning phase of the Schema theory, may occur in the training process, given that a data stream incurs a minor degree of conflict. This situation is managed by adjusting the premise of the winning rule obtained via the Bayesian concept. The tuning case is reflected as follows p
V P 1 max(Vi ) and V win 1 i 1,..,P
V
i
and ( DQ N max ( DQi )orDQ N min ( DQi ) )
i 1
i 1,..,P
i 1,...,P
The rule adjustment criterion implies that the winning cluster is allowed to expand its size without a risk of the cluster delamination phenomenon. The rule premise adaptations are carried out as follows:
Cwin N
N win N 1
N win N 1 1
Cwin N 1
( X N Cwin N 1) N win N 1 1
1 N 1))( A ( N 1) 1( X C N 1))T A ( N 1) 1 ( Awin ( N 1) ( X N Cwin win N win Awin ( N ) 1 win 1 1 1 ( X N Cwin N 1) Awin ( N 1) 1( X N Cwin N 1)T
N win N N win N 1 1
(12) (13) (14)
where 1 ( N win N 1 1) . Equation (13) is suitable in the online learning scenario, because it does not entail a re-inversion process after adjusting the winning rule. Note that the re-inversion phase is computationally considerable and results in numerical instability (i.e. matrix is ill-defined). 4.1.7 Rule pruning and recall strategies The rule pruning and recall modules, which act as the fading aspect of active supervision in Scaffolding theory, play a vital role in relieving the complexity of the rule base. An over-complex network topology inflicts many detrimental effects, including the over-fitting case, prohibitive memory demand, and considerable computational load. gClass has two mechanisms to pinpoint the significance level of fuzzy rules: the Extended Rule Significance (ERS) method and the Potential+ (P+) method. The ERS method is embedded to seize the statistical contributions of fuzzy rules, while the P+ method traces the footprint of the fuzzy rules or produces the density of fuzzy rules. In essence, the P+ method is capable of pruning out-dated fuzzy rules which are no longer relevant to the capture of recent data trends due to concept drift. The ERS method implies the contributions of fuzzy rules in the future and to the system output. The ERS strategy is therefore effective for getting rid of superfluous fuzzy rules, which contribute little during their lifespan. Both methods are defined as follows:
i
m 2u 1
o 1 j 1
Vi u
y ij o
u
p
V
, i
( N 1) n 1,i 2 ( N 1) n 1,i 2 ( N 2)(1 n 1,i 2 ) n 1,i 2 d i n
(15)
i
i 1
where i denotes the ERS of the i-th fuzzy rule, and i exhibits the P+ value of the i-th fuzzy rule, d i
n
stands for the Mahalanobis distance between the current training sample and the focal point of interest. In principle, the volume of clusters and output parameters as investigated in the ERS method are a point of departure for the appraisal of fuzzy rule contributions with respect to the system’s output and the rule significance in the future. This nevertheless disregards the issue of how strategic the cluster position is in the input space. The P+ method is suitable for covering this gap because it can appraise the evolution of the clusters. We arrive at the following two conditions to deduce whether or not the fuzzy rules are superfluous as follows:
i ˆ 2 or i ˆ 2
(16)
where ˆ , stand for the mean and Standard Deviation (SD) of the P+ method of existing rules, ˆ , label the mean and SD of the ERS method of existing rules. Nevertheless, the fuzzy rules deactivated in the earlier training episodes by the P+ method may become valid again due to the recurring concept drift. In other words, the old data distribution may be reactivated in the future. Adding a totally new rule to capture this concept drift would be counterproductive, because information granules conceived by the old rules would be catastrophically discarded. Hence, if the potential of the pruned fuzzy rules is
* substantiated in the future max ( i* ) max ( DQi ) , where P is the number of rules dispossessed by i*1,..,P*
i 1,.., p 1
the P+ method, already pruned fuzzy rules should be re-activated, because this situation is a firm indication of the cyclic drift in the data streams. Note that the rule recall mechanism should be synchronized with respect to (3) (the requirement of the fuzzy rule generation in (3) is satisfied). The parameter of the recalled rules is then allocated as follows:
C p 1 Ci* , p 1 1 i* 1 , p 1 i* , p 1 i*
(17)
It is worth-noting that the computational burden is still alleviated, because already pruned fuzzy rules are merely utilized to execute the P+ method. 4.1.8 Fuzzily Weighted Generalized Recursive Least Square (FWGRLS) Procedure The passive supervision in the realm of the Scaffolding concept in gClass is governed by the FWGRLS learning methodology. It constitutes a local learning version of the Generalized Recursive Least Square (GRLS) method [40], which strengthens the implicit weight decay effect of the Recursive Least Square (RLS) concept. The implicit weight decay effect reinforces the model’s generalization, because it arguably sustains the output parameters to hover around a small bounded interval. By extension, it boosts the compactness of the rule base, because the output parameters, which are too small, can be captured by the ERS method. The FWGRLS method is formulated as follows:
(n) i (n 1) F (n)(
i ( n) i ( n)
F (n)i (n 1) F T (n)) 1
i (n) i (n 1) (n) F (n)i (n 1) i (n) i (n 1) i (n) (i (n 1)) (n)(t (n) y(n)) y(n) xen i (n) and F (n)
y (n) x en (n)
(18) (19) (20) (21)
( P 1)( P1) where i (n) indicates a diagonal matrix whose diagonal elements consist of the firing
strength of fuzzy rule Ri
. The covariance matrix of the modelling error is shown by (n) , and is
managed as an identity matrix [40] for the sake of simplicity. i is a local forgetting factor, intended to compensate the gradual concept drift, and is further discussed in the next section. is a predefined constant specified as 10 15 and ( i (n 1)) stands for the gradient of the weight decay function. In our case, we choose the quadratic weight decay function ( yi (n 1))
1 ( i (n 1)) 2 , which results 2
in ( i (n 1)) i (n 1) . The weight decay function is capable of reducing the weight vector proportionally to its current values. 4.1.9 Drift Detection Several attempts in the literature were devoted to conquering concept drift in FLEXFIS+ of [41] and eTS+ of [42]. eTS and FLEXFIS+ intrinsically benefit from the concept of age and utility [65], which relies on the global forgetting scheme. The major shortcoming of the global forgetting scheme is that it
assigns the same forgetting level for all fuzzy regions, whereas in fact, concept drift may exist in each local region with different intensities. The local forgetting scheme was recently introduced in [26]. In contrast to the global method, a local drift handling method distributes a unique forgetting value to each rule. This approach is deemed more plausible, since the drift is handled locally with a specific local forgetting degree. Hence, a cluster hampered by high drift intensity is assigned a strong forgetting level and vice versa. Another local forgetting mechanism, namely Local Data Quality (LDQ) method, is proposed below. The key idea is akin to the DQ method in equation (9); however it merely quantifies the distance between the cluster focal point and the training samples supporting this cluster only, thereby being able to reliably monitor a local cluster evolution. The LDQ method of the i-th rule can be written as follows:
1
LDQ N i i
N i 1
1
(X n 1
1
ni
C i ) Ai ( X ni C i )
( N i 1) ( N i 1)(1 ac N i ) 2bc N i cc N i
(22)
( N i 1)
1 T 1 1 T where ac Ni ac Ni 1 X ni Ai X ni , dc Ni dc Ni X ni Ai , bc Ni dc Ni Ci and cc Ni Ci Ai Ci . N i is
the number of samples belonging to the i-th cluster. In principle, a decrease of equation (24) signifies that the data distribution has moved away from its previous region, and hints at local concept drift. To this end, we apply the first order derivative of (22) to determine the local drift rate. Referring to [26], a strong forgetting level is delivered by i 0.9 whereas i 1 designates no forgetting at all to the past data history. The drift handling strategy is organized in such a way as to assure i [0.9,1] as follows: i min(max(1 0.1LDQN i ,0.9),1) , N i N i N i min(trans ,0.99) , trans 9.9i 9.9 i
where
transi labels
i
(23)
the forgetting level of the premise of the i-th rule, which in turn alleviates the cluster
population. Note that the centre and spread drift phenomena of the cluster can be unravelled by lessening the cluster support (thus relaxing the strong converged position). This later enables the dispatch of stronger adjustments to the centroid and covariance matrix of the i-th local input space partition and thus shifts the cluster in respect of changing data distributions. Conversely, the drift in the output concept can be overcome by i explicitly encompassed in the FWGRLS method to adapt the rule output. This mechanism can be employed if the clusters hold adequate supports (containing at least 30 samples) to circumvent the unlearning effect [26]. 4.1.10 Feature Weighting Algorithm via Maximization of Separability Criterion in The Empirical Feature Space Several approaches have been designed to tackle the curse of dimensionality in an evolving system, including input pruning [19], Fisher Separability Criterion (FSC)-based feature weighting [27], and optimization of the separability criterion [43],[44]. Nevertheless, the input pruning strategy is less
effective in ambient learning environments because once features have been removed from the model, they cannot be re-activated without causing discontinuity in the incremental learning process (the structures and parameters of the current models have evolved in another, smaller feature space). A retraining phase has to be undertaken from scratch to guarantee the stability of the classifier [27]. Conversely, the feature weighting algorithms were devised in [43] with the use of the FSC method in the original feature space. FSC in the empirical feature space method is more desirable than FSC in the traditional feature space method, because it is more straightforward to acquire the information of the separability criterion in the orthogonal space. Another salient contribution of our work is the extension of the concept in [28],[29] to the scope of the on-line learning mechanism, enabling the optimization of the separability criterion in the empirical feature space. The optimization process is carried out with the use of the gradient ascent algorithm and the alignment concept. The FSC in the empirical feature space can be mathematically formulated as follows:
K N J trace( S w 1 S b ) tr ( K ) W W
(24)
Fig 3 Conflict and ignorance cases.
where J , S b , S w respectively exhibit the new FSC method in the empirical feature space, the between class scatter matrix and the within class scatter matrix. W signifies the sum of every dimension of matrix i, j W , K denotes a kernel-Gram-matrix. W and K are stipulated as follows:
W
1 diag ( K ooˆ N
K 11 , K 12 ,.., K 1o ,.., K 1m K 21 , K 22 ,.., K 2o ,.., K 2 m / N o ), o 1,...,m , K .................................... K m1 , K m 2 ,.., K mo ,.., K mm
(25)
N N Note that K11 1 1 denotes a kernel-Gram-sub-matrix emanating from data in class 1, while
K12 N1 N 2 labels a kernel-Gram-sub-matrix originating from data in classes 1 and 2, and so on. N o indicates the number of populations of the o-th class. The kernel function can be seen as Linear, Gaussian RBF, and Polynomial kernels, which however do not underpin swift online operations. Therefore, we rectify this drawback with the use of the Cauchy kernel function which constitutes a Gaussian-like
function due to the first order Taylor series approximation of the Gaussian function. The Cauchy kernel function is appropriate for the online learning scenario because it allows recursive operations. The elements of the kernel-Gram-matrix K can be identified by the Cauchy function as follows:
K ooˆ N
( N o 1) ( N o 1)( N o 1) N oˆ 2 N oˆ
where N o
u m
j 1
( x j N o ) 2 , N oˆ N oˆ 1
(26)
u m
(x j 1
j
) , N oˆ
N oˆ 2
um
x
j
No
j 1
N o , Noˆ Noˆ 1 x Noˆ . x j N o
N signifies the j-th element of the N o -th training sample of the o-th class and x j oˆ denotes the j-th
element of the N oˆ -th training sample of the oˆ -th class.
0
and
0
can be initialized as zero. The
alignment matrix is crafted afterwards [48] as follows:
A( K , K * )
K, K * F K
F
K
*
F
tr ( S b ) K F
(27)
*
where A( K , K ) defines the alignment matrix of the kernel-gram-matrix K, and K
F
stands for the
Frobenius norm of the kernel-Gram matrix K . By applying the gradient operator of the alignment matrix, we arrive at the gradient ascent optimization procedure as follows
tr ( S b ) ( A( K , K * )) K F
( W ) K
( K ) N N N 1 N ( A( K , K * ))
(28)
F
where N stands for the weighting factor initialized as 1. N
1 exhibits the learning rate which shrinks n
over time in the training process and is established according to the Robbins-Monroe conditions [66] to guarantee the convergence of the weights. The weighting factor is integrated in all learning scenarios including all distance calculations affecting the rule evolution criterion (thus preventing the evolution of rules when the criteria are violated as a result of unimportant features) and the rule re-setting criteria (the motivation for rule evolution). 4.2 What to Learn The major bottleneck of the what-to-learn learning component in Suresh et al. [7]-[11] concerns the operator annotation efforts, which are laborious to carry out. In this paper, we propose a novel active learning method which enhances the conflict and ignorance concept in [25]. Specifically, we embellish the ignorance aspect of the original conflict and ignorance method using the DQ method to figure out the position of the datum in the feature space. It is notable that the original version simply exploits the firing strengths of fuzzy rules to assess the need for the ignorance aspect. This is deemed inaccurate, because the compatibility measure is merely based on a single sample strategy, and clearly, other samples can affect the ignorance of this sample. In addition, the DQ method is more robust to noise or outliers than the
classical method, because it executes a sort of accumulated ignorance criterion over time. Fig.3 illustrates the conflict and ignorance cases. In Fig.3, querying point 1 is redundant, thus being capable of classifying it safely – learning this sample is not important for refining the decision boundary and even exacerbates the over-fitting problem. Querying point 2 represents a strong conflict condition, and the learning process of such training samples requires that the decision boundary is updated to diminish the number of misclassifications. Querying point 3 represents training samples that lie far away from the current cluster centre. It is beneficial to accommodate these samples in the classifier updates to cross unexplored regions and to avert similar extrapolation cases in the future (fuzzy classifiers, especially, worsen significantly in terms of the correctness of the classification decision in cases of extrapolation). We arrive at the following condition to rule out training samples for model updates as follows:
min ( DQi ) DQ N max ( DQi ) and conf final
i 1,..,P
i 1,..,P
score1 (0.5 ) score1 score2
(29)
where score1 and score 2 label the outputs of the most two dominant classes while denotes the tolerable constant, which is fixed as 0.05 in all of our empirical studies. Note that score1 and score 2 can be given by the classifier’s outputs yˆ 0 , if MIMO or one against all classifier architectures are used. Alternatively, they can be concluded from the weighted voting scheme of the preference relation matrix if the all-pairs architecture is explored [2]. It is conceivable that the first term in (29) signifies that the datum does not bring any new information because a datum is possibly well covered by existing clusters. Such data can be regarded as inconsequential examples. Another noteworthy aspect to override the data sample is a non-conflict case, which is usually provided by a confident classifier prediction, as defined by the second term in (29). Conversely, when score1 and score 2 are almost equal, thus arriving at
conf final 0.5 (0.5 ) , they indicate a hard decision or a conflict with one of the cases (both are almost equally supported by the sample => conflict), which clearly entails a learning process to correct the classifier’s confusion. Table.1 Computational load, memory requirement, structural cost gClass pClass Computationa l load Structural complexity
O( ( m( 2 P1)2 m2 U 5 P mU 2 P* ))
O( p m ( 2u 1) p (u u ) p u )
O( m2u 2 P* 4 P m( P 1)2 )
O( p m (u 1) p (u u ) p u )
GENEFIS-Class
O( P 2 2 P m mp m(u 1)2 mu pu)
O( p m (u 1) p (u u ) p u )
4.3 When to learn If the conditions in what-to-learn or how-to-learn are not satisfied, the datum is pushed into the rear stack and is assigned as a reserved sample ( XS n , TS n ) . This mechanism is widely known as the sample reserved strategy, where learning by means of the reserved samples is undertaken when the system is idle
or all centric data have been depleted. The reserved samples can be used to cover unexplored regions of regular training samples. In theory, the training process is complete when no further sample is available in the data stream. In practice, this is not plausible, because the number of reserved samples can be unbounded, as the nature of data streams. Therefore, the training process is terminated when the number of reserved samples remains the same [7]-[11]. Fig.4 visualizes the flowchart of the gClass learning procedure.
Fig.4 A flowchart of gClass learning scenario
4.4 Computational Complexity The computational complexity of gClass is influenced by every gClass learning module. However, this computational load hinges on whether or not the training samples are accepted by the what-to-learn learning module. The fuzzy rule recruitment scenario charges the computational complexity O( P * 2P) , which is compiled by the DS, DQ, GART+ and rule recall methods. Presumably, the rule pruning
methods have a computational complexity cost in the order of O(2P) , which is generated by the ERS and P+ methods. The allocation of fuzzy rule parameters, enforced by the potential per class composition, bears a computational burden in the order of O(( P m) U 2 ) . Roughly speaking, the feature weighting algorithm based on the optimization of the FSC in the empirical feature space incurs a computational cost in the order of O(m 2 U ) , while the FWGRLS method inflicts computational complexity in the order of
O(M (2P 1) 2 ) .
Consequently,
the
resultant
computational
cost
is
O( (m(2P 1) 2 m 2 U 5P m U 2 P * )) , where expresses the probability of admitting the streaming data. Table 1 details the computational burden and memory demand of the gClass, GENEFISclass and pClass algorithms [18], [19]. The gClass algorithm theoretically has a lower computational burden than the pClass and GENEFIS-class algorithms, because gClass is crafted in the metacognitive learning landscape. In contrast to the pClass feature weighting strategy, the gClass feature weighting technique has less computational cost, since it is not reliant on the Leave-One-Feature-Out (LOFO) mechanism. GENEFIS-class and pClass may be expected to confer fewer parameters to be salvaged in the memory due to a lower degree of freedom rule consequent and a diagonal covariance matrix in the rule antecedent. Nonetheless, the generalized fuzzy rule of gClass can be expected to grow fewer fuzzy rules, as experimentally verified in 5.1 and pictorially illustrated in Fig.1. datasets SEA Iris Wine Electricity pricing Weather Line Circle Sin Sinh Boolean Noise corrupted signal Image segmentation Ionosphere Hyper-planes Thyroid
Table 2. Dataset specifications. Num of input attributes Num of classes Num of data points 3 2 60000 4 3 150 13 3 178 8 2 45312 4 2 60000 2 2 2500 2 2 2500 2 2 2500 2 2 2500 3 2 1200 1 3 100K 19 7 2310 34 2 351 4 2 120 K 21 3 7200
5. PROOF OF CONCEPTS 5.1 Efficacy of gClass Learning Modules This section is intended to evaluate the efficacy of gClass’s learning modules. Three data sets, namely thyroid, wine, and ionosphere, obtained from the University of California, Irvine (UCI) machine learning repository (http://www.ics.uci.edu/mlearn/MLRepository.html), are used to assess the qualities of the proposed learning components. The weather dataset is also used, because this dataset contains severe concept drift. In this section, we evaluate the weather data set from the Offtutt Air Force Base in Bellevue, Nebraska, which is a subset of the U.S National Oceanic and Atmospheric Administration (NOAA) data sets. It covers a long period of 50 years and is available online (ftp://ftp.ncdc.noaa.gov/pub/data/gsod/), hence this version of the weather prediction problem not only depicts a cyclical seasonal change, but also
characterizes a long term climate change. The characteristics of the datasets are shown in Table 2 and the numerical results are tabulated in Table 3. Table 3. the efficacy of gClass Learning module ALGORITHMS gClass
Axes-parallel ellipsoids (section 3)
Linear hyper-plane (section 3)
trigonometric consequent (section 3)
Without drift handling (section 4.1.9)
without meta-cognitive learning (section 4.2 and 4.3)
FSC in the empirical feature space (section 4.1.10)
FSC in the original feature space Section (4.1.10)
Without feature weighting (section 4.1.10)
Classification rate # of Rules Time Num samples Classification rate Rule Time Num samples Classification rate # of Rules Time Num samples Classification rate # of Rules Time Num samples Classification rate # of Rules Time Num samples Classification rate # of Rules Time Num samples
WINE 0.96±0.02 2 0.14±0.007 100.1±5.11 0.94±0.01 2.4±0.21 0.2±0.26 110.2±2.32 0.92±0.08 2 160.2±0.42 0.36±0.006 0.966±0.05 2 0.5±0.06 120.5±4.62 0.96±0.02 2 0.14±0.007 100.1±5.11 0.95±0.05 2 0.2±0.05 160.2±0.42
THYROID 0.941±0.002 4 10.54±1.72 4039.7±294.33 0.92±0.04 4.33±2.3 9.36±0.78 4039.7±294.33 0.91±0.02 5 9.83±0.88 3481.3±598.1 0.932±0.0008 4.33±0.6 10.9±0.2 4039.7±294.33 0.94±0.02 4.33±1.15 11.5±1.98 4041±295.±41 0.938±0.002 4.33±0.6 11.01±0.34 4800
IONOSPHERE 0.91±0.09 2.2±0.45 0.11±0.02 31.6±12.3 0.79±0.2 3.6±1.14 0.05±0.01 18.6±6.42 0.81±0.13 2.2±0.45 0.1±0.01 33.6±9.4 0.84±0.16 2.4±0.55 0.19±0.09 283.6 0.91±0.09 2.2±0.45 0.11±0.02 31.6±12.3 0.88±0.13 2.2±0.45 0.17±0.08 50
WEATHER 0.8±0.03 1.1±0.32 0.86±0.05 977±72.7 0.73±0.05 2.9±0.6 0.58±0.14 977±72.7 0.79±0.05 2.9±0.6 0.89±0.25 977±72.7 0.8±0.03 1.1±0.32 0.9±0.04 978±69.6 0.79±0.04 1.1±0.32 0.86±0.01 977±72.7 0.8±0.03 1.1±0.32 1.1±0.05 1000
Classification rate # of Rules Time Num samples Classification rate # of Rules Time Num samples Classification rate # of Rules Time Num samples
0.93±0.09 2 0.4±0.06 160.2±0.42 0.78±0.05 2 0.46±0.03 131.1±18.73 0.95±0.02 2 0.12±0.007 100.1±5.11
0.935±0.02 4.33±0.8 15.01±0.54 4039.7±294.33 0.929±0.008 4 33.88±2.34 4322±336.79 0.932±0.6 4.67±0.6 9.65±0.59 4039.7±294.33
0.89±0.11 2.2±0.45 0.33±0.11 34.6±12.8 0.88±0.11 2.2±0.45 0.9±0.5 33±12.38 0.9±0.1 2.2±0.45 0.25±0.13 31.4±12.6
0.7±0.04 2.4±0.7 1.6±0.13 977±72.7 0.77±0.05 1.2±0.63 1.82±0.09 959.5±128.1 0.79±0.03 1.1±0.32 0.73±0.05 977±72.7
Section
A
B
C
D
The goal of the empirical study is articulated as follows: 1) The generalized fuzzy rule elaborated in this paper is numerically validated and is benchmarked with other three fuzzy rule exemplars: the axis-parallel cluster [2], linear hyper-plane consequent [19] and non-linear trigonometric consequent [24]. The numerical results are abstracted in Section A of Table 3. 2) We also vet to what extent the local drift handling technique, developed in this paper, is capable of hedging gClass from a downtrend of gClass’s predictive accuracy in the presence of concept drift. We analyze the performance of gClass with the absence of the local drift handling strategy. The experimental results are presented in Section B of Table 3. 3) We aim to study the impact of meta-cognitive learning. We scrutinize the gClass performances without the meta-cognitive learning scenario and only apply the how-to-learn component. Section C of Table 3 summarizes the numerical results of both learning configurations. 4) The leverage of the input weighting algorithm is also investigated, where we benchmark the gClass feature weighting scheme against the FSC method in the empirical feature space-based feature weighting (no optimization) [19], the FSC method in the original feature space-based feature weighting [27] and without the feature weighting algorithm. Section D of Table 3 displays the numerical results of this empirical study. We overlook other learning modules, because they were proposed in our previous works in [17]-[20]. The 10-fold Cross
Validation (CV) procedure is utilized as an experimental procedure for the first two data sets from the UCI machine learning repository. The experimental results are inferred from the average of 10 independent runs of the CV scheme. We carry out the periodic-hold out test as the experimental scenario in the other two study cases to simulate the training and testing phases in real time [52]. Final numerical results are deduced from the average of the 10 independent sub-processes of the periodic hold-out procedure. The MIMO classifier’s architecture is utilized to infer the classification decision for all learning configurations.
Fig.5 (a) the evolution of feature weight for the temperature input feature, (b) local weighting strategy as drift detection, (c) fuzzy rules (d) system error
The four learning modules deliver potent impacts to refine the resultant learning performance. The functional link consequent-based Chebyshev function is effective for boosting the classification rate, while sustaining the most compact and parsimonious rule base and suppressing the training sample consumption to an economical level compared to other fuzzy rule variants. The functional link consequent-based trigonometric function produces numerical results that are equivalent to gClass on the wine and weather data sets. However, it is inevitable that it will store higher parameters than the Chebyshev function. The arbitrarily rotated ellipsoidal clusters noticeably outperform the axis-parallel ellipsoidal clusters in delivering higher classification rates and a more compact rule base in all cases. The drift detection strategy does not significantly affect gClass performance in the drift-free data sets, including the wine and ionosphere data sets. Conversely, performance improvements can be observed in the weather and thyroid data sets that have concept drift. The meta-cognitive learning is capable of reinforcing the generalization ability and curtailing the execution time. The what-to-learn module relieves the over-fitting by discarding redundant samples for the how-to-learn phase, and the when-to-learn module makes adjustments to the fuzzy rules, utilizing the reserved samples. The reserved samples may disclose the uncovered states of already seen samples, thus intensifying the completeness of the rule base. The training data are not fully visited and the labeling process is governed by the active learning-based
what-to-learn component mitigating the execution time, thereby strengthening the scalability in the big data. The virtue of the feature weighting algorithm-based FSC optimization in the empirical feature space can be seen in its positive contribution to the classification rates and ability to expedite the training process. Table 4. numerical results of consolidated algorithms ALGORITHMS SEA dataset
Electricity pricing dataset
Sin dataset
Circle dataset
Line dataset
Sinh dataset
Weather dataset
Hyper-plane dataset
Noise corrupted signal dataset
Boolean dataset
Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time(s) Rule base Num of samples Classification rate Rule Time (s) Rule base Num of samples
FAOS-PFNN
pClass
eClass
0.73±0.14 149.5±91.6 1656±1633 750.5±458.13 4200 0.51±0.08 50.9±23.7 196.71±127.9 517±247.4 3172 0.77±0.13 34.8±5.14 0.24±0.05 141.2 200 0.8±0.12 27.3±5.4 0.15±0.03 111.2 200 0.91±0.06 26±7.2 0.17±0.04 106 200 0.61±0.23 34.9±0.23 0.26±0.08 141.6 200 0.67±0.06 99.8±62.8 209.8±250.1 1006±627.5 1000 0.58±0.3 63.8±3.7 2.7±0.7 840.4 100 0.37±0.37 20.9 86.1±109.7 84.6 7000 0.77±0.13 11.7±3.1 0.04±0.007 61.5 100
0.78±0.04 3.5±2.42 3.86±0.5 70 4200 0.78±0.05 3.2±1.2 4.23±0.8 288 3172 0.82±0.2 3.3±1.2 0.17±0.04 40.6 200 0.72±0.13 2.8±1.1 0.17±0.008 33.6 200 0.91±0.07 2.5±0.71 0.25±0.0009 30 200 0.71±0.09 3.6±1.9 0.27±0.01 43.2 200 0.790±0.03 2.5±0.97 1.12±0.06 205 1000 0.92±0.02 2.2±0.63 1.86±0.07 66 100 0.74±0.12 3±1.2 6.4±0.7 24 7000 0.83±0.2 2.6±0.8 0.08±0.002 52 100
0.76±0.03 15.9±3.4 10.07±2.3 190.8 4200 0.77±0.07 11.9±0.07 4.12±2.2 321.3 3172 0.81±0.5 4±1.14 0.2±0.02 44 200 0.7±0.11 3.6±0.84 0.19±0.01 32.4 200 0.89±0.06 4.4±0.51 0.21±0.009 39.6 200 0.7±0.07 6.3±1.5 0.23±0.02 56.7 200 0.777±0.02 2.7±0.48 1.06±0.07 72.9 1000 0.91±0.02 8.6±2 13.48±3.61 124.4 100 0.72±0.12 3.7±1.3 6.9±1.9 29.6 7000 0.85±0.12 4.7±1.3 0.05±0.01 56.4 100
GENEFISclass 0.76±0.01 2.9±1 3.02±0.26 58 4200 0.75±0.0 3.5±1.5 4.49±0.4 315 3172 0.81±0.2 5.4±2.2 0.32±0.3 58.8 200 0.7±0.03 3.2±1.03 0.25±0.01 38.4 200 0.9±0.07 3.6±0.7 0.24±0.01 43.2 200 0.71±0.06 3.6+0.8 0.25±0.02 43.2 200 0.790±0.01 2.5±0.51 1.2±0.07 202.5 1000 0.91±0.01 3.39±0.12 3.4±0.05 90 100 0.73±0.09 4.5±1.1 7.5±0.9 36 7000 0.82±0.2 2.6±1.1 0.09±0.05 52 100
gClass
OS-ELM
McFIS
0.87±0.09 2.3±0.5 1.03±0.2 46 106.6 0.79±0.08 2.7±0.5 2.3±0.5 243 8.7 0.92±0.3 3.3±0.9 0.16±0.02 39.6 56.4 0.91±0.06 2.4±1.6 0.15±0.02 28.8 56.2 0.94±0.1 2 0.14±0.06 24 11.4 0.71±0.04 2.1±0.3 0.11±0.08 33.6 37.8 0.8±0.03 1.1±0.32 0.86±0.05 99 37.7 0.93±0.02 2.8±0.6 1.55±0.5 84 33.6 0.75±0.11 2 5.95±1.7 12 37.2 0.92±0.2 2.3±0.5 0.01±0.03 46 5.2
0.61±0.001 50 0.006±0.008 300 4200 0.57±0.09 50 2.43±0.2 550 3172 0.8±0.2 50 0.25±0.02 500 200 0.66±0.14 50 0.08±0.02 500 200 0.91±0.08 25 0.04±0.02 250 200 0.68±0.04 50 0.07±0.02 500 200 0.74±0.06 40 0.56±0.7 1080 1000 0.88±0.03 35.3±4.16 1.22±0.13 2118 100 0.72±0.14 50 2.23±0.11 400 7000 0.8±0.17 50 0.24±0.02 250 100
0.73±0.11 9.9±0.4 0.13±0.03 54.6 10.9* 0.5±0.1 9.6±0.7 0.5±0.4 110.9 10.6* 0.76±0.18 9.1±1.2 0.08±0.03 68.6 10.1* 0.8±0.14 9.8±0.42 0.12±0.05 62.8±2.5 10.8* 0.84±0.13 9.4±1 0.1±0.03 60.4±6.4 10.4* 0.64±0.15 10 0.1±0.02 64 11* 0.61±0.14 10 0.41±0.08 108 11* 0.73±0.06 10 0.5±0.1 64 11 0.69±0.14 9.6±3.4 2.6±1 50.2 10.6* 0.86±0.2 7.4±1.9 0.05±0.05 80.2 8.4*
5.2 Benchmarks with State-of-the-Art Evolving Classifiers In this section, gClass is benchmarked with its counterparts: pClass [19], GENEFIS-Class [18], eClass [1], OS-ELM [49], FAOS-PFNN [43] and McFIS [8]. McFIS is akin to gClass and can be categorized as a meta-cognitive classifier. Meanwhile, pClass, GENEFIS-class and eClass are consolidated in our numerical study because they are evolving classifiers. The evolving classifier can be perceived as the predecessor of the meta-cognitive classifier. FAOS-PFNN represents a semi-online classifier, where it adopts the batched structural learning procedure, which ought to revisit all previous data streams in each
training episode. In contrast, OS-ELM is built upon an incremental learning scenario without any structural learning scenario. This machine learning algorithm is deemed to be more traditional than evolving, meta-cognitive, or even semi-online classifiers. All consolidated classifiers are evaluated in 10 synthetic and real-world data streams, characterizing various concept drifts. The synthetic streaming data are pivotal for the analysis of learning performance because it is inconvenient to determine the drift variant and when the drift initiates to interfere to the data distribution in the real-world problems. We explore 9 data streams, namely SEA [51], Electricity pricing 1)
, weather
2)
, hyper-plane [52], and four artificial study cases from the (DDD) database [53],[54],
termed sin, sinh, line circle, and boolean. In addition to these data sets, we make use of our own data set [55], not only featuring various concept drifts, but also demonstrating dynamic class labels. The predefined parameters of McFIS, eClass, GENEFIS-class, FAOS-PFNN, OS-ELM and pClass are arranged as with the rules of thumb in their original publications. Table 5. Classifier ranking according to classification rate pClass eClass GENEFIS-class gClass (CR, R, ET, RB) (CR, R, ET, RB) (CR, R, ET, RB) (CR, R, ET, RB)
Study cases
FAOS-PFNN (CR,R,ET,RB)
OS-ELM (CR, R, ET, RB)
McFIS (CR, R, ET, RB)
Sea dataset
(5,7,7,7)
(2,3,5,4)
(4,5,6,5)
(3,2,4,3)
(1,1,3,1)
(7,6,1,6)
(6,4,2,2)
Electricity dataset
(7,7,7,7)
(2,2,5,3)
(3,5,4,5)
(4,3,6,4)
(1,1,2,2)
(5,6,3,6)
(6,4,1,1)
Sin dataset
(6,6,5,7)
(2,2,3,2)
(4,4,4,3)
(3,3,7,4)
(1,1,2,1)
(5,7,6,6)
(7,5,1,5)
Circle dataset
(2,6,3,6)
(4,2,5,3)
(6,4,6,2)
(5,3,7,4)
(1,1,4,1)
(7,7,1,7)
(3,5,2,5)
Line dataset
(2,7,3,6)
(2,2,5,3)
(5,4,7,2)
(4,3,5,4)
(1,1,4,1)
(3,6,1,7)
(6,5,2,5)
Sinh dataset
(7,6,6,6)
(2,3,7,2)
(4,4,4,3)
(1,2,5,4)
(1,1,3,1)
(5,7,1,7)
(6,5,2,5)
Weather dataset
(6,7,7,7)
(3,1,5,5)
(4,3,4,1)
(2,2,6,4)
(1,1,3,2)
(5,5,2,6)
(7,5,1,3)
Hyper-plane dataset
(7,7,5,7)
(2,1,4,2)
(4,4,7,5)
(3,3,6,4)
(1,2,3,3)
(5,6,2,6)
(6,5,1,1)
Noise dataset
(7,6,7,6)
(2,2,4,2)
(4,3,5,3)
(3,4,6,4)
(1,1,3,1)
(5,7,1,7)
(6,5,2,5)
Boolean dataset
(7,6,2,5)
(3,2,5,2)
(2,4,3,4)
(4,3,6,3)
(1,1,1,1)
(6,7,7,7)
(5,5,4,6)
Average
(5.6, 6.5,5.2,6.4)
(2.4,2,4.7,2.9)
(4,4,5,3.3)
(3.2,2.8,5.8,3.8)
(1,1.1,2.8,1.4)
(5.3,6.4,2.5,6.5)
(5.8,4.8,1.8,3.8)
CR: Classification rate, R: Rule, ET: Execution time, RB: Rule base parameters
The memory demand evaluation can be designated by the rule base parameters. The rule base parameter of gClass, pClass and GENEFIS-class are listed in Table 1. eClass generates the rule base parameters in the order of O(UP P mP(U 1) P) , whereas McFIS, OS-ELM, and FAOS-PFNN attract
O(UP P mP) of the rule base parameters. eClass is driven by the TSK-based spherical cluster fuzzy system, whereas OS-ELM, FAOS-PFNN and McFIS are propelled by the single layer feed-forward network-like topology. We can assess the computational cost of the classifiers with the runtime. Our
numerical study is conducted on an Intel (R) core (TM) i7-2600 CPU @3.4 GHz processor with 8 GB memory. The predictive quality can be demonstrated by the classification rate in generalizing the testing data block. The experiments are conducted with the use of the periodic hold-out process whereas the classification boundary of all classifiers is crafted by the MIMO architecture. Table 4 lists the consolidated experimental results. Fig.5(a,b) shows the evolution of the feature weight and the trace of local forgetting for each fuzzy rule in the weather data set. The fuzzy rule evolution and the trace of predictive error in the hyper-plane data set are illustrated in Fig.5(c,d). Referring to Table 4, gClass prevails over other benchmarked algorithms in the three evaluation criteria: classification rate, fuzzy rule, and rule base parameter. In particular, gClass is capable of delivering the most encouraging accuracy in all the study cases, with 5% to 20% improvement over other consolidated algorithms. gClass overcomes other benchmarked algorithms in the realm of fuzzy rules in 9 out of 10 numerical studies, showing an improvement of 30% to 70% compared to the second-ranked classifier. From the rule base parameter standpoint, gClass is capable of outstripping other classifiers in 7 out of 10 study cases. These numerical results are a firm justification of the generalized fuzzy rule of gClass, which boosts the classification rate while maintaining a frugal memory demand. McFIS outperforms gClass in the runtime category, but note that although McFIS consumes a smaller number of training samples, the what-to-learn module does not transform McFIS into the semi-supervised classifier. Table 6. the performance difference between rClass and other algorithms Algorithms Classification Rates Fuzzy Rules Runtimes Rule Base Parameters gClass vs pClass
1.46
0.94
1.98
1.56
gClass vs eClass
3.12
3.02
2.38
1.98
gClass vs GENEFIS-class
2.4
1.77
3.12
2.49
gClass vs OS-ELM
4.8
5.5
-0.35
5.3
gClass vs McFIS
4.99
3.85
-1.04
2.5
gClass vs FAOS-PFNN
4.78
5.62
2.5
5.2
Fig.5(a,b) visualizes the adaptive characteristic of the feature weighting learning mechanism and the local forgetting degree in the weather dataset. The dynamic of the input weights is in line with the learning rate characteristic of the gradient ascent method, where it shrinks over time in the training process. The regime drifting property is expected to appear in the weather dataset. The concept drift is compensated for by the local drift handling strategy, distributing a unique forgetting degree for each rule to sidestep severe misclassification, as depicted in Fig.5(b). The evolving characteristic of gClass is illustrated in Fig.5(c), where the fuzzy rule can be augmented, recalled, and pruned on the fly during the training process. Conversely, the system error is stable in the bounded range, as presented in Fig.5(d), which confirms the potency of the FWGRLS to perform the stable adaptation of the weight vector.
5.3Statistical tests To confirm our numerical results, statistical tests are performed to reach a clear conclusion about the performance of each classifier [19]. Classifier rankings are shown in Table 5. The number of training samples, exploited in the training process, is not included in the statistical test, since only gClass and McFIS have the ability to reduce the number of training samples. The first statistical test is conducted with the non-parametric Friedman statistical test [56], well-known in the machine learning literature for detecting performance difference between benchmarked algorithms. We accordingly arrive at
F2 30.2,49.9,28.07,45.7 for the four evaluation criteria; the critical value of 0.1 with 6 degrees of freedom is only 10.645, and we can thereby reject the null hypothesis. The Friedman statistical test is deemed defective. We therefore continue our statistical test with the ANOVA test proposed by Iman and Davenport [57]. The ANOVA test offers a better test and generalizes the Friedman statistical test. Consequently, we arrive at FF 24.3,19.8,15.7,18.3 for the four evaluation categories respectively, and the critical value of 0.05 with (6,30) degrees of freedom is only 2.42, thus we can reject the null hypothesis, as with the Friedman test.
gClass
pClass [19] eClass [1] GENEFIS-class [20] PBL-McRBFNN [7] McFIS [8] FAOS-PFNN [73] GEBF-FAOSPFNN [74] CP and DP ELM [75] BR and OSR ELM [76]
Table 7. Summary of benchmarked algorithm learning features Premise part Consequent What-toHow-to-learn When-to-learn part learn Non-axis parallel Sample reserved Non-linear Active Schema and ellipsoids Strategy Chebyshev Learning Scaffolding theory function Non-axis parallel Linear hyperN/A Schema theory N/A ellipsoids plane Spherical clusters Linear hyperN/A Schema theory N/A plane Axis parallel Linear hyperN/A Schema theory N/A ellipsoids plane Spherical clusters Singleton Sample Schema theory Sample reserved consequent deletion Strategy Spherical clusters Singleton Sample Schema theory Sample reserved consequent deletion strategy Spherical Cluster Singleton N/A Schema Theory N/A Consequent Asymmetric axis Singleton N/A Schema Theory N/A parallel Cluster Consequent Spherical Cluster Singleton N/A ELM Theory N/A Consequent Spherical Cluster Polynomial N/A ELM Theory N/A Consequent
Classifier type Semisupervised Fully supervised Fully supervised Fully supervised Fully supervised Fully supervised Fully supervised Fully supervised Fully supervised Fully Supervised
These two tests function to obtain the performance difference between all consolidated classifiers. Nevertheless, they do not make a conclusive finding that gClass outperforms other classifiers. We therefore undertake another statistical test, termed a post-hoc Benferoni-Dunn test [58], where the centric notion is to investigate the difference between the performances of two classifiers. For brevity, we can claim that the performance of two classifiers is substantially dissimilar if their difference exceeds the critical difference CD 2.31 . The performance difference of classifiers Ri , R j is detailed in Table 6. gClass is clearly more reliable than eClass, GENEFIS-class, OS-ELM, FAOS-PFNN and McFIS in the realm of classification rate, whereas gClass outperforms eClass,OS-ELM,FAOS-PFNN and McFIS in the
context of fuzzy rule, rule base parameters. On the other hand, gClass is slightly inferior to OS-ELM and McFIS but is substantially superior to eClass, FAOS-PFNN and GENEFIS-Class in the viewpoint of execution time. 5.4 Conceptual Comparisons In this section, gClass is conceptually compared with nine prominent classifiers recently published in the literature: eClass [1]; pClass [19]; GENEFIS-class [20]; PBL-McRBFNN [7]; McFIS [8]; FAOSPFNN [73]; GEBF-FAOSPFNN [74]; CP-and DP-ELM [75]; BR-and OSR-ELM [76]. The salient characteristics of all the consolidated algorithms are summarized in Table 7. Clearly, gClass employs the most sophisticated fuzzy rule, amalgamating the multivariate Gaussian function in the rule input and the non-linear Chebysev function in the rule output. This fuzzy rule is more appealing than classical fuzzy rules deployed in other algorithms. This has been numerically validated in Section 5.2. Albeit the nonaxis parallel ellipsoidal cluster in the rule antecedent, pClass and GENEFIS-class are still equipped with the standard linear hyper-plane in the rule output, which does not fully explore a local approximation trait. Although GEBF-FAOSPNN is built upon the asymmetric Gaussian function, it still triggers the axisparallel ellipsoidal cluster. The rule premise of FAOS-PFNN, CP-and DP-ELM, BR-and OSR-ELM, PBL-McRBFNN and McFIS is deemed more traditional, because these classifiers cannot deal with different operating intervals of input variables as a result of the hyper-spherical clusters. On the other hand, the rule consequent of FAOS-PFNN, GERBF-FAOSPFNN, CP-and DP-ELM, PBL-McRBFNN and McFIS are crafted by the zero order TSK output, which relies on a lower Degree of Freedom (DoF) function than the first order TSK output of eClass, pClass, GENEFIS-class and the non-linear output of gClass, BR-and OSR-ELM. It is worth noting that BR-and OSR-ELM do not actualize the functional link output weight, because the rule consequent is generated by the polynomial function without a specific nonlinear mapping. In the realm of algorithmic development, FAOS-PFNN, GEBF-FAOSPFNN, CP-and DP-ELM, BR-and OSR-ELM, pClass, eClass and GENEFIS-class are still cognitive in nature and have not yet integrated the meta-cognitive learning principle. Furthermore, although PBL-McRBFNN and McFIS employ the sample deletion strategy, this does not relieve the annotation effort of the operator, because the sample deletion strategy necessitates all data streams to be fully labelled. PBL-McRBFNN and McFIS also suffer from the absence of important learning modules such as the local forgetting mechanism and feature selection mechanism, because they do not incorporate Scaffolding theory. In contrast, gClass exemplifies the semisupervised learning scenario due to its online active learning procedure and demonstrates the plug-andplay learning paradigm enabled by Scaffolding theory in the how-to-learn module. 6. CONCLUSIONS A novel meta-cognitive classifier, namely gClass, is proposed in this paper. The major contribution of gClass has three learning attributes:1) gClass introduces a generalized meta-cognitive learning paradigm,
in which the how-to-learn module is consistent with the Schema and Scaffolding theories; 2) gClass relies on a generalized TSK fuzzy rule, exploiting the multivariate Gaussian function in the premise component and the non-linear Chebyshev function in the consequent component; 3) four brand new learning modules, namely the ECI method, the LDQ method, the class overlapping method and the enhanced FSC in empirical feature space method, are devised in this paper. The efficacy of gClass has been thoroughly examined with 10 real-world and artificial datasets, featuring various concept drifts and dynamic class labels. In summary, gClass produces more encouraging numerical results than its counterparts in achieving a tradeoff between accuracy and simplicity. In our future works, we will expand the metacognitive-based Scaffolding learning theory to the interval type-2 fuzzy system. We will also apply the proposed algorithm to tool wear prognosis and the diagnosis of surface roughness in the ball-nose end milling process. ACKNOWLEDGEMENTS The work presented in this paper is partly supported by the Australian Research Council (ARC) under Discovery Projects DP110103733 and DP140101366 and the first author acknowledges receipt of UTS research seed funding grant. REFERENCES [1] P. Angelov and X. Zhou, ―Evolving fuzzy-rule-based classifiers from data streams,‖ IEEE Transactions on Fuzzy Systems, vol.16 (6), pp. 1462–1475, (2008) [2] E.Lughofer, O.Buchtala, ―Reliable All-Pairs Evolving Fuzzy Classifiers‖, IEEE Transactions on Fuzzy Systems, Vol.21 (4), pp. 625-541, (2013) [3] R.D.Baruah, P.Angelov, J.Andreu, ―Simpl_eClass: Simplified Potential-Free Evolving Fuzzy Rule-Based Classifiers‖, In: Proceedings of 2011 IEEE International Conference on Systems, Man and Cybernetics, SMC 2011, Anchorage, Alaska, USA, pp. 2249-2254 (2011) [4] P. Angelov, E. Lughofer, and X. Zhou, ―Evolving fuzzy classifiers using different model architectures,‖ Fuzzy Sets and Systems, vol. 159 (23), pp. 3160–3182, (2008) [5] A. Lemos, W. Caminhas and F. Gomide, Adaptive fault detection and diagnosis using an evolving fuzzy classifier, Information Sciences, vol. 220, pp. 64--85, (2013) [6] T.-O. Nelson and L. Narens, ―Metamemory: A theoretical framework and new findings,‖ Psychology of Learning and Motivation, vol. 26, no. C, pp. 125 – 173, (1990) [7] G.S. Babu, S. Suresh, ―Sequential Projection-Based Metacognitive Learning in a Radial Basis Function Network for Classification Problems‖, IEEE Transactions on Neural Network and Learning Systems, Vol.24 (2), pp.194-206, (2013) [8] K.Subramanian, S.Suresh, N.Sundararajan, ―A Meta-Cognitive Neuro-Fuzzy Inference System (McFIS) for sequential classification systems‖, IEEE Transactions on Fuzzy Systems, vol.21 (6), pp.1060-1095, (2013) [9] K. Subramanian and S. Suresh, ―A meta-cognitive sequential learning algorithm for neuro-fuzzy inference system,‖ Applied Soft Computing, vol. 12, pp. 3603 – 3614, (2012) [10] G. Sateesh Babu and S. Suresh ―Meta-cognitive rbf network and its projection based learning algorithm for classification problems,‖ Applied Soft Computing, vol. 13 (1), pp. 654 – 666, (2013) [11] K.Subramanian, R.Savitha, S.Suresh, “A meta-cognitive interval type-2 fuzzy inference system classifier and its projectionbased learning algorithm” in Proceedings of the IEEE Symposium Series on Computational Intelligence, pp.48-54, Singapore, (2013) [12] B.J. Reiser, ―Scaffolding complex learning: The mechanisms of structuring and problematizing student work‖, Journal of Learning Sciences, Vol.13 (3), pp. 273-304, (2004) [13] F.C.Bartett, Remembering: A study in Experimental and Social Psychology, Cambridge, UK: Cambridge Press University Press, (1932) [14] J.H. Flavell, ―Piagiet’s legacy‖, Psychological Science, vol.7 (4), pp.200-203, (1996) [15] R. Elwell, R. Polikar, ―Incremental Learning of Concept Drift in Non-stationary Environments‖, IEEE Transactions on Neural Networks, vol.22 (11), pp.1517-1531, (2011) [16] G.A. Carpenter and S. Grossberg, ―A massively parallel architecture for a self-organizing neural pattern recognition machine‖, Computer Vision, Graphics, and Image Processing, vol. 37, pp.54-115, (1987) [17] M. Pratama, S. Anavatti, P. Angelov, E. Lughofer, ―PANFIS: A Novel Incremental Learning Machine‖, IEEE Transactions on Neural Networks and Learning Systems, vol. 25 (1), pp. 55-68, (2014) [18] M. Pratama, S. Anavatti, E. Lughofer, ―GENEFIS:Towards An Effective Localist Network‖, IEEE Transactions on Fuzzy Systems, vol.22 (3), pp.547-562, (2014) [19] M. Pratama, S. Anavatti, E. Lughofer, ―pClass: An Effective Classifier to Streaming Examples‖, IEEE Transactions on Fuzzy Systems,online and in press, 10.1109/TFUZZ.2014.2312983, (2014) [20] M. Pratama, S. Anavatti, E. Lughofer, ―Evolving Fuzzy Rule-Based Classifier Based on GENEFIS‖, in Proceedings of the IEEE Conference on Fuzzy System (Fuzz-IEEE), Hyderabad, India, pp.1-8, (2013) [21] M. Pratama, M-J. Er, X. Li, R.J. Oentaryo, E. Lughofer, I. Arifin, ―Data Driven Modelling Based on Dynamic Parsimonious Fuzzy Neural Network‖, Neurocomputing, vol.110, pp.18-28, (2013)
[22] M. Pratama, M-J. Er, S. Anavatti, E. Lughofer, I. Arifin, ―A Novel Meta-Cognitive-based Scaffolding Classifier to Sequential Non-Stationary Classification Problems‖, in Proceedings of the IEEE World Congress on Computational Intelligence (IEEEWCCI), Beijing, China, pp. 369-376, (2014) [23] J.C. Patra, A.C. Kot, ―Nonlinear dynamic system identification using Chebyshev functional link artificial neural networks―, IEEE Transactions on Systems, Man and Cybernetics-Part B: Cybernetics, vol.32 (4), pp.505-511, (2002) [24] Y-Y. Lin, J-Y. Chang, C-T. Lin, ―Identification and prediction of dynamic systems using an interactively recurrent selfevolving fuzzy neural network‖, IEEE Transactions on Neural Networks and Learning Systems, vol.24 (2), pp.310-321, (2013) [25] E. Lughofer, ―Hybrid active learning for reducing the annotation effort of operators in classification systems‖, Pattern Recognition, vol.45 (2), pp. 884-896, (2013) [26] A. Shaker and E. Lughofer, Self-Adaptive and Local Strategies for a Smooth Treatment of Drifts in Data Streams, Evolving Systems, vol.5 (4), pp. 239-257, (2014) [27] E. Lughofer, ―On-line incremental feature weighting in evolving fuzzy classifiers,‖ Fuzzy Sets and Systems, vol. 163 (1), pp. 1– 23, (2011) [28] M. Ramona, G. Riachard, B. David, ―Multiclass Feature Selection with Kernel Gram-Matrix-Based Criteria‖, IEEE Transactions on Neural Networks and Learning Systems, vol.23 (10), pp. 1611-1622, (2012) [29] H. Xiong, M.N.S. Swamy, M.O. Ahmad, ―Optimizing The Kernel in The Empirical Feature Space‖, IEEE Transactions on Neural Networks, Vol.16 (2), pp. 460-474, (2005) [30] Y.H. Pao, ―Adaptive Pattern Recognition and Neural Networks‖, Reading, MA: Addison-Wesley, (1989) [31] J.C. Patra, R.N. Pal, B.N. Chatterji, G. Panda, ―Identification of nonlinear dynamic systems using functional link artificial neural networks,‖ IEEE Transactions on Systems, Man and Cybernetics, vol.29 (2), pp. 254-262, (1999) [32] H-J. Rong, N. Sundarajan, G-B. Huang, G.-S. Zhao, ―Extended Sequential Adaptive Fuzzy Inference System for Classification Problems‖, Evolving Systems, vol.2 (2), pp. 71-82, (2011) [33] L. Wang, H-B. Ji, Y. Jin, ―Fuzzy Passive-Aggressive Classification: A Robust and Efficient Algorithm for Online Classification Problems‖, Information Sciences, vol.220, pp. 46-63, (2013) [34] B. Vigdor and B. Lerner, ―The Bayesian ARTMAP,‖ IEEE Transactions on Neural Networks, vol. 18 (6), pp. 1628–1644, (2007) [35] E. Lughofer, P. Angelov, ―Handling Drifts and Shifts in On-line Data Streams with Evolving Fuzzy Systems‖, Applied Soft Computing, vol.11 (2), pp. 2057-2068, (2011) [36] K. Tabata, M.S.M. Kudo, ―Data compression by volume prototypes for streaming data‖, Pattern Recognition, vol.43 (9),pp. 3162—3176, (2010) [37] M. Stone, ―Cross-Validatory Choice and Assessment of Statistical Predictions‖, Journal of Royal Statistic Society, vol.36, pp. 111-147, (1974) [38] E. Lughofer, Evolving Fuzzy Systems --- Methodologies, Advanced Concepts and Applications, Springer, Heidelberg, (2011) [39] E. Lughofer, On-line Assurance of Interpretability Criteria in Evolving Fuzzy Systems --- Achievements, New Concepts and Open Issues, Information Sciences, vol. 251, pp. 22--46, (2013) [40] Y. Xu, K.W. Wong, C.S. Leung, ―Generalized Recursive Least Square to The Training of Neural Network‖, IEEE Transactions on Neural Networks, vol.17 (1), pp. 19-34, (2006) [41] E. Lughofer, ―Flexible Evolving Fuzzy Inference Systems from Data Streams (FLEXFIS++)‖, in: Learning in Non-Stationary Environments: Methods and Applications, editors: Moamar Sayed-Mouchaweh and Edwin Lughofer, Springer, New York, pp. 205-246, (2012) [42] P. Angelov, ―Evolving Takagi-Sugeno Fuzzy Systems from Data Streams (eTS+)‖, In Evolving Intelligent Systems: Methodology and Applications, editors: P. Angelov, D. Filev, N. Kasabov, John Wiley and Sons, IEEE Press Series on Computational Intelligence, pp. 21-50, (2010) [43] G.D. Wu, Z.W. Zhu, and P.H. Huang, ―A TS-Type Maximizing-Discriminability-Based Recurrent Fuzzy Network for Classification Problems,‖ IEEE Transactions on Fuzzy Systems, vol. 19 (2), pp. 339-352, (2011) [44] G.D. Wu, P.H. Huang, ―A maximizing-discriminability-based self-organizing fuzzy network for classification problems‖, IEEE Transactions on Fuzzy Systems, vol.18 (2), pp. 362-373, (2010) [45] E. Lughofer, Single-Pass Active Learning with Conflict and Ignorance, Evolving Systems, vol. 3 (4), pp. 251-271, (2012) [46] E. Lughofer and M. Sayed-Mouchaweh, ―Autonomous Data Stream Clustering implementing Incremental Split-and-Merge Techniques --- Towards a Plug-and-Play Approach‖, Information Sciences, vol. 204, pp. 54-79, (2015) [47] W. Chu, M. Zinkevich, L. Li, A. Thomas, B. Zheng , ―Unbiased online active learning in data streams‖, In the Proceedings of 17th ACM SIGKDD international conference on Knowledge discovery and data mining, San Diego, California, pp. 195-203, (2011) [48] X. Zhu, P. Zhang, X. Lin, Y. Shi, ―Active Learning from stream data using optimal weight classifier ensemble‖, IEEE Transactions on System, Man and Cybernetics-part b: Cybernetics, Vol.40 (6), pp. 1607-1621, (2010) [49] N.-Y. Liang, G.-B. Huang, P. Saratchandran, N. Sundararajan, ―A fast and accurate online sequential learning algorithm for feedforward networks‖, IEEE Transactions on Neural Networks and Learning Systems, vol.17 (6), pp.1411-1423, (2006) [50] S. Haykin, Neural Networks: A Comprehensive Foundation (2nd Edition), Prentice Hall inc., Upper Saddle River, New Jersey, (1999) [51] W.N. Street, Y. Kim, ―A streaming ensemble algorithm SEA for large- scale classification‖, in the Proceedings of 7th ACM SIGKDD conference, pp. 377-382, (2001) [52] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, ―MOA: Massive online analysis,‖ Journal of Machine Learning Research, vol. 11, pp.1601–1604, (2010) [53] L.L. Minku, X. Yao, ―DDD: A New Ensemble Approach for Dealing with Drifts‖, IEEE Transactions on Knowledge and Data Engineering, vol.24 (4), pp. 619-633, (2012) [54] L.L. Minku, A.P. White, X. Yao, ―The Impact of Diversity on Online Ensemble Learning in The Presence Concept of Drift‖, IEEE Transactions on Knowledge and Data Engineering, vol.22 (5), pp. 730-742, (2010) [55] E-Y. Cheu, C. Quek, S-K. Ny, ―ARPOP: An Appretitive Reward-Based Pseudo-Outer-Product Neural Fuzzy Inference System Inspired from The Operant Conditioning of Feeding Behaviour‖, in IEEE Transactions on Neural Networks and Learning Systems, vol.23 (2), pp. 317-329, (2012)
[56] R. L. Iman and J. M. Davenport. ―Approximations of the critical region of the Friedman statistic‖. Communications in Statistics, pp. 571–595, (1980) [57] J. Demsar, ―Statistical Comparisons of Classifiers over Multiple Datasets‖, Journal of Machine Learning Research, vol.7, pp.130, (2006) [58] O. J. Dunn, ―Multiple comparisons among means‖, Journal of the American Statistical Association, vol. 56, pp.52–64, (1961) [59] T. Takagi end M. Sugeno, ―Fuzzy identification of systems and its appfications to modeling and control‖, IEEE Transactions on Systems Man Cybernetics. vol.15, pp.116-132, (1985) [60] A. Almaksour, E. Anquetil, ―LClass: Error-driven antecedent learning for evolving Takagi-Sugeno classification systems‖, Applied Soft Computing, vol. 19, pp. 419-429, (2014) [61] M. Han, C. Liu, ―Endpoint prediction model for basic oxygen furnace steel-making based on membrane algorithm evolving extreme learning machine‖, Applied Soft Computing, vol. 19, pp. 430-437, (2014) [62] A. Zdesar, D. Dovzan, I. Skrjanc, ―Self-tuning of 2 DOF control based on evolving fuzzy model‖, Applied Soft Computing, vol. 19, pp. 403-418, (2014) [63] R.-E. Precup, H.-I. Filip, M.-B. Radac, E. M. Petriu, S. Preitl, C.-A. Dragos, Online identification of evolving Takagi-SugenoKang fuzzy models for crane systems, Applied Soft Computing, vol. 24, pp. 1155-1163, (2014) [64] H.-J. Rong, N. Sundararajan, G.-B. Huang and G.-S. Zhao, ―Extended sequential adaptive fuzzy inference system for classification problems‖, Evolving Systems, vol. 2 (2), pp. 71--82, (2011) [65] M. Sayed-Mouchaweh and E. Lughofer, ―Learning in Non-Stationary Environments: Methods and Applications‖, Springer, New York, (2012) [66] J. Gama, ―Knowledge Discovery from Data Streams‖, Chapman & Hall/CRC, Boca Raton, Florida, (2010) [67] L.S. Vygotsky,‖ Mind and Society: The Development of Higher Psychological Processes”, Cambridge, UK: Harvard University Press, (1978) [68] D.Wood, ―Scaffolding contingent tutoring and computer-based learning‖, International Journal of Artificial Intelligence in Education, vol.12 (3), pp. 280-292, (2001) [69] B.J. Reiser, ‖Scaffolding complex learning: The mechanisms of structuring and problematizing student work‖, Journal of Learning Sciences, vol.13 (3), pp.273-304, (2004) [70] R. Savitha, S. Suresh, H.J. Kim, ―A Meta-Cognitive Algorithm for an Extreme Learning Machine Classifier‖, Cognitive Computation, vol.6 (2), pp. 253-263, (2013) [71] K.Subramanian, R. Savitha, S. Suresh, ―A Metacognitive Complex-Valued Interval Type-2 Fuzzy Inference System‖, IEEE Transactions on Neural Networks and Learning Systems, Vol.25 (9), pp. 1659-1672, (2014) [72] K. Subramanian, A.K. Das, S. Suresh, R. Savitha,‖ A meta-cognitive interval type-2 fuzzy inference system and its projection based learning algorithm‖, Evolving Systems, vol.5 (4), pp. 219-230, (2014) [73] N.Wang, M-J.Er, X.Meng, ―A fast and accurate online self-organizing scheme for parsimonious fuzzy neural networks‖, Neurocomputing, Vol.72, pp.3818-3829, (2009) [74]N.Wang, ―A Generalized Ellipsoidal Basis Function Based Online Self-constructing Fuzzy Neural Network‖, Neural Processing Letters, Vol.34, pp.13-37, (2011) [75]N.Wang, M-J.Er, M.Han, ―Parsimonious Extreme Learning Machine Using Recursive Orthogonal Least Squares‖, IEEE Transactions on Neural Networks, Vol.25(1), pp.1828-1841, (2014) [76]N.Wang, M-J.Er, M.Han, ―Generalized Single-Hidden Layer Feedforward Networks for Regression Problems ‖,IEEE Transactions on Neural Networks and Learning Systems, in press, (2014),10.1109/TNNLS.2014.2334366