DC*: an Algorithm for Automatic Acquisition of ...

5 downloads 0 Views 3MB Size Report
3.3 DCClass Universe of Discourse partition. ...... of cuts Sd, an extended set that adds the boundary points (3.1) in dimension d to the set of cuts, is defined as: ¯.
A dissertation submitted in partial fulfillment of the requirements to get the degree of doctor of philosophy

DC*: an Algorithm for Automatic Acquisition of Interpretable Fuzzy Information Granules Marco Lucarelli

University of Bari “Aldo Moro” Department of Informatics

Supervisor Dr. Corrado Mencar

Dedicated to my family, I love you

Contents Abstract

9

1 Introduction

11

2 State of the art

17

2.1

Fuzzy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1

Components of a fuzzy rule based system . . . . . . . . . . . . 19

2.1.2

Type of Fuzzy Rule-Based Systems . . . . . . . . . . . . . . . 26

2.2

Defining Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3

Interpretability Constraints and their assessment . . . . . . . . . . . . 31 2.3.1

Constraints on Fuzzy Sets . . . . . . . . . . . . . . . . . . . . 34

2.3.2

Constraints on Fuzzy Partitions . . . . . . . . . . . . . . . . . 37

2.3.3

Constraints on Fuzzy Rule Bases . . . . . . . . . . . . . . . . 44

2.4

Interpretability assessment . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5

Design of Interpretable Fuzzy Models . . . . . . . . . . . . . . . . . . 47 2.5.1

Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.5.2

Design tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.6

The interpretability-accuracy trade-off . . . . . . . . . . . . . . . . . 50

2.7

Design methods and tools . . . . . . . . . . . . . . . . . . . . . . . . 52 2.7.1

Algorithms and Methods . . . . . . . . . . . . . . . . . . . . . 52

2.7.2

Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3 Double Clustering with A* (DC*)

59

3.1

The Double Clustering Framework . . . . . . . . . . . . . . . . . . . 59

3.2

DC* v1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.1

First clustering step: Data Compression . . . . . . . . . . . . 67

3.2.2

Second clustering step: A* solution search . . . . . . . . . . . 69

3.2.3

Granule Fuzzification and Rule Definition

. . . . . . . . . . . 80

i

Contents 3.2.4

Contents Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4 DC* v2.0 4.1 Solution Search with A* . . . . . . . . . . . . 4.1.1 The New Queue . . . . . . . . . . . . . 4.1.2 The New Heuristic function . . . . . . 4.2 Solution Search with GA-Guided A* . . . . . 4.2.1 DCγ . . . . . . . . . . . . . . . . . . . 4.2.2 GA-guided A* search . . . . . . . . . . 4.3 Granule Fuzzification . . . . . . . . . . . . . . 4.3.1 Strong Fuzzy Partitions based on cuts 4.3.2 Constant Slope . . . . . . . . . . . . . 4.3.3 Variable Fuzziness . . . . . . . . . . . 4.3.4 Core Points . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . 5 Experimental Results 5.1 Dataset Description . 5.2 Experimentations . . 5.2.1 Experiment 1: 5.2.2 Experiment 2: 5.2.3 Experiment 3: 5.2.4 Experiment 4:

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HFP / DC* comparative experimentation The New heuristic function . . . . . . . . GA guided A* . . . . . . . . . . . . . . . Granule Fuzzification . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . .

85 85 86 88 99 100 102 111 111 119 121 123 124

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . .

127 . 128 . 130 . 130 . 137 . 141 . 148

6 Conclusions

155

Bibliography

159

ii

List of Algorithms 3.1 3.2 3.3

The LVQ1 algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 The A* algorithm used for one-dimensional clustering . . . . . . . . . 72 The heuristic function computation algorithm of DC* v1.0 . . . . . . 79

4.1 4.2

Computation of the heuristic value for a singular impure hyper-box . 94 The new heuristic function algorithm of DC* v2.0 . . . . . . . . . . . 96

1

List of Figures 2.1

General structure of a Mamdani Fuzzy Rule-Based System . . . . . . 20

2.2

Example of a fuzzy partition . . . . . . . . . . . . . . . . . . . . . . . 22

2.3

Example of a fuzzy partition of a linguistic variable . . . . . . . . . . 26

2.4

Interpretability constraints and criteria in different abstraction levels.

2.5

Example of a normal fuzzy set (solid) and a sub-normal fuzzy set (dashed). If the two fuzzy sets represent qualities, then the subnormal fuzzy sets expresses a quality that is never completely met by any element of the domain. . . . . . . . . . . . . . . . . . . . . . . . . 35

2.6

A convex fuzzy set (solid) and a non-convex fuzzy set (dashed). It is difficult to label the non-convex fuzzy set with a linguistic term. . . . 36

2.7

Example of non-distinguishable fuzzy sets (dash vs. dash-dots) and distinguishable fuzzy sets (solid vs. dash or solid vs. dash-dots). Non distinguishable fuzzy sets represent almost the same concept, hence it is difficult to assign different labels . . . . . . . . . . . . . . . . . . 40

2.8

Example of completeness violation. In the highlighted regions of the Universe of Discourse (ellipses) 0.5-completeness is not verified. Elements in such region are not well represented by any fuzzy set in the LV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.9

A linguistic variable with three fuzzy sets. While each individual fuzzy sets has a well defined semantics, they are not properly labelled thus hampering their interpretability. . . . . . . . . . . . . . . . . . . 42

33

2.10 Example of sets of fuzzy sets violating leftmost/rightmost fuzzy sets (a) and verifying such constraint (b). . . . . . . . . . . . . . . . . . . 43 2.11 Some interpretability indexes organized in a double-axis taxonomy (adapted from [53]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1

Overview of the Double Clustering Framework . . . . . . . . . . . . . 60

3

List of Figures

List of Figures

3.2

DCf prototype projection (black dots over the axes), prototype clustering over each feature (circled dots), Universe of Discourse partition definition (cuts are in red) and information granule identification (InfGr). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3

DCClass Universe of Discourse partition. . . . . . . . . . . . . . . . . 65

3.4

The first DC* clustering step: Data Compression. In this example a three class dataset is represented by six prototypes (two for each class). 67

3.5

DC* prototype projections and cuts definition. On the left-hand side figure, dashed lines represent the set of cuts Td . On the right-hand side figure, red lines represent the subsets of cuts Sd ⊆ Td . . . . . . . 70

3.6

A bi-dimensional problem with ten prototypes of three different classes (square, circle, triangle). The application of two cuts (chosen among the candidate cuts) provides a partition of the input space in four hyper-boxes: three pure hyper-boxes and one impure hyper-box. . . . 71

3.7

Final state (all pure hyper-boxes) for a bi-dimensional problem. . . . 75

3.8

State expansion. In the bottom of the figure the successor states for a generic state (figure top) are shown. Red lines depict the the Sd set of cuts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.9

Example of DC* search space for a bi-dimensional problem. Highlighted in yellow the optimal solution. Circled in red non-optimal solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.10 Connected impure hyper-boxes. On the features there is the cardinality of all the sets C(d, k). . . . . . . . . . . . . . . . . . . . . . . 77 3.11 Heuristic function evaluation for a bi-dimensional state. On the features the number of Connected impure hyper-boxes is indicated. . . . 78 3.12 In red the last considered cut and the two distances. . . . . . . . . . . 80 3.13 Gaussian fuzzy sets defined over a problem feature . . . . . . . . . . . 81 3.14 Example of a multi-dimensional fuzzy information granule over the Universe of Discourse . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4

4.1

The ∆(σ) identification. In red the cut which characterizes the new state σ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2

Cuts configuration for two different trivial problems with the same number of prototypes. It is possible to see how different data distributions define different cut configurations, leading to different complexity levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

List of Figures 4.3

the “k-chessboard” test dataset with k=5. On the left side the dataset distribution is depicted. On the right side, the solution cut configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.4

DC* v1.0 heuristic function variation for the 5-chessboard problem. . 90

4.5

Example of two non-goal states. In white, two impure hyper-boxes with four prototypes of different classes are depicted. . . . . . . . . . 93

4.6

The same impure hyper-box with the same number of cuts differently applied over the axes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.7

Example of heuristic value computation. . . . . . . . . . . . . . . . . 97

4.8

Comparison between the heuristic functions with two-classes datasets. On the left-hand side the old heuristic. On the right-hand side the new heuristic. Both the heuristic functions provide for the same value of 2 cuts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.9

Example of state representation for the GA computation. . . . . . . . 102

4.10 Father states for a generic state in the search space. It is noteworthy how the number of fathers for a state σ is equal to the number of cuts present in σ itself (in red). . . . . . . . . . . . . . . . . . . . . . . . . 103 4.11 The GA-guided A* workflow. The GA, computed before the A* search, identifies a solution in the search space adopted as PoG by A* that computes the actual optimal solution for the problem. . . . . 105 4.12 In the picture, the feasible solution area (gray plus blue) and the PoG ideal region are depicted. . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.13 Comparison between two different DC* 2.0 solutions. On the right hand side the optimal solution. On the left hand side a sub-optimal solution to the problem. It is noteworthy how the two solutions do not share any cut. Therefore, those are located in very different places in the search space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.14 The search space of DC* v2.0 with GA-guided A*. In the upper part of the search space the explored states are separated from the unexplored states by the frontier (states in the A* priority queue). The optimal solution (in red) and the PoG (in blue) are depicted. . . 108 4.15 Comparison between a SFP derived from prototypes (left) and a SFP derived from cuts (right). It is noteworthy how the clusters results improperly represented in the prototype based approach. . . . . . . 113

5

List of Figures

List of Figures

4.16 A sequence of cuts that prevents the generation of a well-formed triangular fuzzy set (red dashed line). . . . . . . . . . . . . . . . . . . 4.17 Example of a Constant Slope SFP from cuts. Red dots indicate the centers of the partitions and show how those points are well covered by the cores of the fuzzy sets. . . . . . . . . . . . . . . . . . . . . . 4.18 Example of a Variable Fuzziness SFP from cuts. Red dots indicate the centers of the partitions and show how those points are well covered by the cores of the fuzzy sets. . . . . . . . . . . . . . . . . . . . . . 4.19 Example of a Core Points SFP from cuts. In this case, dots represent the core points of each partition. Those have to be covered by the cores of the respective fuzzy sets because full representative (the needed additional knowledge) of the underlying linguistic concept. 5.1 5.2 5.3 5.4 5.5

DC* - HFP. Comparative experimentation framework. . . . . . . . The experiment setup framework to test the heuristic functions. . . The experiment setup framework to test the heuristic functions. . . Fitness function graphs regarding a generic run of each dataset. . . The synthetically generated datasets adopted for the numerical simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Example of a TSFP obtained fixing the fuzzy set cores as the mid points between cuts. It is possible to see how the 0.5 − cut values are not verified over the original cut positions. Hence the cut shifting is depicted and highlighted. . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Example of a T0.5-cuts obtained fixing the fuzzy set cores as the mid points between cuts and respecting the 0.5 − cut values over the original cut positions. It is possible to see how the fuzzy sets overlap each other without deriving a SFP. . . . . . . . . . . . . . . . . . . 5.8 Solution cuts identified by DC* . . . . . . . . . . . . . . . . . . . . 5.9 Fuzzy partitions obtained by DC* on SD4 dataset through the adoption of different strategies. . . . . . . . . . . . . . . . . . . . . . . . 5.10 Comparison between bi-dimensional fuzzy sets generated by the TSFP approach (left-hand side) and the CP approach (right-hand side). .

6

. 116

. 121

. 123

. 124 . . . .

132 138 142 146

. 149

. 150

. 150 . 151 . 153 . 154

List of Tables 5.1

5.2

5.3 5.4

5.5

5.6

Dataset characteristics. *The second feature has been removed because it exhibits a constant value. **Class “4” has been removed since it is not represented by any sample. . . . . . . . . . . . . . . . . 129 A general picture of the experimental results (first part). Each column reports the average results (over 10-fold cross validation) ± the standard deviation. For DC* the number p of prototypes used in the first stage is also reported. In bold, the best results in terms of accuracy/interpretability tradeoff are highlighted. . . . . . . . . . . . 134 (cont’d from table Table 5.2). . . . . . . . . . . . . . . . . . . . . . . 135 Datasets and experimental comparative results. Shuttle with 21 prototypes computed by the original heuristic is incomplete. *The second feature has been removed because it exhibits a constant value. **Class “4” has been removed since it is not represented by any sample.139 Datasets and experimental comparative results between the A* search (referred with classic A* ) and the GA-guided A* approach. Reported values are the averages of ten executions in which prototype positions, for the same dataset, are not changed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 DC* classification error (percentage values) when different strategies are applied to generate fuzzy partitions for each of the five datasets. . 152

7

Abstract Several real world problems require more than just accurate solutions. In many cases, users (physicians, managers, etc.) have to be convinced about the reliability of the knowledge base, and hence they may be interested in systems capable to offer good support in terms of both accuracy and comprehensibility of the knowledge base. When intelligent systems are used to acquire knowledge from data, a methodology is required to derive interpretable knowledge that final users can easily understand. Fuzzy rule-based systems (FRBSs) are tools that enable knowledge representation and inference through fuzzy rules denoted by linguistic terms. The main point of strength of FRBSs is the possibility of establishing a semantic similarity (or co-intension) between the fuzzy sets that are used in their rules and the implicit semantics of the linguistic terms that are used to denote them. In this way the users of a FRBS can read and understand fuzzy rules, as well as revise and integrate rules with domain knowledge. In other words, the FRBS can be interpretable for users. However, when FRBSs are acquired from data through some learning scheme, the semantic co-intension between fuzzy sets and linguistic meanings is often lost. This happens because fuzzy sets are usually shaped in order to optimize a specific performance measure, usually defined in terms of accuracy error. Nevertheless, the loss of semantic co-intension in a rule base determines a FRBS that is no longer interpretable. The development of specific learning algorithms is intended to overcome the interpretability loss in the process of acquiring knowledge from data. Mainly, these learning schemes drive the adaption process so that a number of interpretability constraints is satisfied. Many learning algorithms for acquiring interpretable models require to fix the granularity of fuzzy partitions, i.e. the number of fuzzy sets that partition each input feature: the aim of such algorithms is to find the best shapes of the fuzzy sets in the partition so as to optimally balance accuracy and interpretability of the final system. Moreover, human interaction is required in the model building process as well as in the final model choice. As a result, in many

9

cases a trial-and-error approach is used to select the best granularity for each feature (in terms of model accuracy) without looking at the granule characteristics (number, shape, position, etc.) which usually affect the model interpretability. As a matter of fact, the optimal number of fuzzy sets per feature is often unknown, also could be different for each feature and is strictly problem-dependent. Moreover, the granularity of a solution should be changeable taking into account the user needs, the context and the model complexity. The lack of such methods is filled by the proposed approach named DC*, derived from the Double Clustering framework. The key feature of DC* is its ability of providing an automatic interpretable fuzzy granulation of classified data, exploiting hidden relationships among data, and thus discovering the optimal granularity level for each problem feature. Then, the obtained partition can be translated in a highly interpretable fuzzy rule base. It is worth noting that the whole process requires the definition of only one hyper-parameter representing the maximum granularity level of the final Fuzzy Rule Base - i.e. the maximum number of rules; such parameter is easily understandable and configurable by the user. Specifically, DC* clusters prototype projections on all dimensions simultaneously: in this way it is possible to minimize the number of clusters for each feature. This is accomplished through an informed search procedure based on the A* algorithm. The resulting one-dimensional clusters provide information to define fuzzy partitions that satisfy a number of interpretability constraints and exhibit variable granularity levels. The fuzzy sets in each partition can be therefore labeled by meaningful linguistic terms and used to represent knowledge in a natural language form. Experimental results on benchmark datasets highlight the DC* peculiarities compared with other algorithms in terms of interpretability/accuracy trade-off as well as its efficiency in terms of resources required in the granulation process.

10

1 Introduction When intelligent systems are used to acquire knowledge from data, a methodology is required to derive interpretable results that final users can easily understand. Several real world problems require more than just accurate solutions. In many cases, users (physicians, managers, etc.) have to be convinced about the reliability of the knowledge base resulting from data processing, and hence they may be interested in systems capable to offer good support in terms of both accuracy and comprehensibility of the knowledge base. In order to process information, intelligent systems make use of information granulation. Information Granulation is the process of forming meaningful entities, called information granules, that exhibit functional and descriptive representation of observational data adhering to some level of abstraction [166]. Information granules are generally defined as agglomerates of data, arranged together due to their similarity, functional adjacency, indistinguishability, coherence or alike. They are the building blocks for information abstraction, since information granules highlight high-level properties and relationships about an universe of discourse, whereas they hide useless low-level details pertinent to single data. Once formed, information granules help understand hidden relationships among data. Granular computing [11, 99, 132, 160], is a computing paradigm oriented towards representing and processing information granules; it embraces a number of modeling frameworks based on different forms of representation, depending on the nature of data as well as on the applicative domain. Fuzzy set theory is a convenient modeling framework for Granular Computing, leading to the so-called Theory of Fuzzy Information Granulation (TFIG) [91, 164, 166]. The use of TFIG for granulating data produces information granules that are defined as compositions of fuzzy sets. Fuzzy sets capture the gradualness of concepts conceived by human beings; as a consequence, fuzzy information granules could describe hidden relationships in a way close to mental representation of concepts. This feature helps users to under-

11

Chapter 1

Introduction

stand data through information granules, especially if these are described in natural language. The key factor for the success of fuzzy logic is its ability of modeling perceptions rather than measurements. In many cases, perceptions can be expressed in natural language terms: this makes knowledge expressed in fuzzy logic highly co-intensive with linguistic concepts; hence, it is easily interpretable by users. Nevertheless, interpretability does not come with fuzzy logic ipso facto: it must be ensured by a number of structural and semantic constraints. More specifically, while designing an interpretable fuzzy model the data domain is represented through linguistic variables (usually one for each data feature); given a linguistic variable, the fuzzy sets associated to each linguistic term form a fuzzy partition of the data feature. To ensure interpretability, a number of constraints are imposed on the all model components [115]. With the aim of obtain interpretable models, the granulation process has to be driven to ensure the fulfillment of the interpretability constraints. Deriving fuzzy information granules from data, and describing them in natural language, is not an easy task. Fuzzy sets compounding information granules should accurately represent data and be interpretable, i.e. they should be shaped so as to be tagged with linguistic terms. Also, the number of derived information granules should be as small as possible since the ability of humans to remember descriptions (and hence to understand them) is limited [119]. Interpretability and accuracy are conflicting requirements (thus a trade-off is often demanded) because interpretability preservation constrains the parameters of a model and, therefore, introduces a bias that in most cases prevents the model to reach the same accuracy of an unconstrained – thus not interpretable – model. To reduce the bias, interpretability preservation techniques should avoid to introduce constraints that are not strictly necessary for the purpose of interpretability. One way to tackle this problem is to enable variable granularity in information granules. In fact, many techniques impose to information granules a fixed granularity level (this is achieved by defining fuzzy sets with the same width, for example); however it is still possible to preserve interpretability of information granules with variable granularity. If the variability conforms to data distribution, then the information granules are more conformant to the underlying data: as a result these information granules enable more accurate – but still interpretable – fuzzy models. Once information granules are derived from data, they can be used in a fuzzy knowl-

12

Introduction edge base in order to describe the knowledge about the or, for classification problem, to predict class information of new data. A fuzzy knowledge base is usually composed by linguistic rules - i.e. rules in the form IF - THEN where only linguistic terms appear. Specifically, a linguistic rule describes a relation between two information granules: one represented in the rule antecedent and the other in the rule consequent. Fuzzy rule-based systems (FRBSs) are tools that enable knowledge representation and inference through fuzzy rules denoted by linguistic terms. The main point of strength of FRBSs is the possibility of establishing a semantic similarity (or co-intension) between the fuzzy sets that are used in rules and the implicit semantics of the linguistic terms that are used to denote them. In this way the users of a FRBS can read and understand fuzzy rules, as well as revise and integrate rules with domain knowledge. In other words, the FRBS can be interpretable for users [110]. However, when FRBSs are acquired from data through some learning scheme, the semantic co-intension between fuzzy sets and implicit semantics is often lost. This happens because fuzzy sets are usually shaped in order to optimize a specific performance measure, usually defined in terms of accuracy. Nevertheless, the loss of semantic co-intension in a rule base determines a FRBS that is no longer interpretable. The development of specific learning algorithms is intended to overcome the interpretability loss in FRBSs acquired from data. Mainly, these learning schemes drive the adaption process so that a number of interpretability constraints is satisfied. The choice of the interpretability constraints used to guide the learning process is usually application-dependent; nevertheless some of them have a general scope and are widely used in literature. To achieve interpretable granulation, some algorithms have been proposed such as Hierarchical Fuzzy Partitioning [61], Fuzzy Decision Trees [72], or more complex methodologies, such as HILK++ [4] and complete tools like FisPro [62] and GUAJE [3]. Many learning algorithms for acquiring an interpretable FRBS require to fix the granularity of fuzzy partitions, i.e. the number of fuzzy sets that partition each input feature: the aim of such algorithms is to find the best shapes of the fuzzy sets in the partition so as to optimally balance accuracy and interpretability of the final system. However, the optimal number of fuzzy sets for each feature is often unknown and could be different for different features. As a result, in many cases a trial-and-error approach is used to select the best granularity for each feature. A framework for generating interpretable information granules is DCf (Double Clustering f ramework). DCf is an abstract framework that can be implemented in a

13

Chapter 1

Introduction

number of ways to achieve different techniques for interpretable information granulation. DC* (Double Clustering with A*) is an instance of DCf for classification problems that adds variable granularity as a key feature for information granulation [28]. DC* performs a two-step clustering on the available data to provide a set of information granules. Firstly, it identifies cluster prototypes in the multidimensional data space via the LVQ1 algorithm so as to exploit class information and to find class-aware clusters. Then, it clusters the projections of these prototypes along each dimension by a properly defined search procedure based on the A* algorithm. The use of the A* algorithm has the twofold objective of deriving interpretable fuzzy sets and minimizing the number of information granules, so as to provide a compact and interpretable description of data. The resulting information granules can be directly translated into human-comprehensible fuzzy rules to be used for classification tasks. Another key feature of DC* is the possibility to find automatically the optimal granularity level, i.e. the minimum number of granules that represent data, thus recovering the user from an arbitrary choice of the granularity level of each fuzzy set. DC* has two critical points: (i) the granulation results are very sensitive to the positioning of multi-dimensional prototypes resulting from LVQ1, and (ii) the computational complexity is exponential in the worst case. Performance of A* is heavily dependent on the heuristic function that enables an efficient exploration of the search space. However, the current heuristic function in DC* is not very informative in the earlier stages of search, thus reducing A* to a greedy search in some cases. As a result, the worst case scenario of a combinatorial explosion of the candidate solutions to be tested is likely to occur for problems of mid-large size. This work presents a new version of DC*, named DC* v2.0 (described in chapter 4). The proposed method aims to tackle the points of weakness of the original DC*, which become apparent when the method is applied to complex problems. Specifically, DC* v2.0 aims at filling the gaps inherent the second clustering step of the method and introducing a new way to generate information granules through fuzzy sets. In particular, in the second clustering step, a new structure for the A* priority queue is introduced (sec. 4.1.1) and a new heuristic function that exploits class information to effectively penetrate the A* search space, is presented (sec. 4.1.2). Moreover, a new search approach, composition of a Genetic Algorithm (GA) and the A* search, is introduced. Thanks to the efficiency of GA, this particular approach, named GA-guided A*, allows to enhance the efficiency of DC* without

14

Introduction loosing its solution optimality (sec. 4.2.2). Experimental results, presented in chapter 5, show the DC* capability to provide for a better trade-off between accuracy and interpretability of the models automatically derived from data, compared with the well known Hierarchical Fuzzy Partitioning algorithm, other than prove the effectiveness of the innovations introduced in DC* v2.0.

15

2 State of the art In this chapter an overview about the state of the art of fuzzy systems is provided, particularly focusing on the general aspects related with the interpretability of fuzzy rule-based systems and their design process. To this pursuit, the chapter is organised as follows: a section on Fuzzy Systems (section 2.1) refers to system modelling in general and goes into the details of fuzzy modelling by describing the components and the types of fuzzy systems. section 2.2 is devoted to the general concept of interpretability by giving a characterisation of this quality. Then, interpretability constraints are introduced in section 2.3, and those used in the current work are described in detail. A general overview of the entire process of designing an interpretable fuzzy model is given in section 2.5 in which also a number of the most widely used methods and tools are described.

2.1 Fuzzy Systems A model can be viewed as a theoretical scheme that simply represents a real system or a complex reality, with the aim of enabling its understanding. System modelling is the action and effect of approaching to a model. Models allow to explain, control, simulate, predict, and even improve real systems. The usefulness of models is strongly based on their reliability (indicated as the model performance) and comprehensibility, both of which represent the main objectives in system modelling. In the system modelling area it is possible to identify at least three different paradigms. White box modelling is the most traditional approach, as it assumes the availability of a thorough knowledge of the system’s nature and a suitable mathematical scheme. On the other hand, in black box modelling models are generated from data, without additional a priori knowledge, by considering a sufficiently general structure [145]. In particular, white box modelling shows difficulties when dealing with complex and poorly understood systems, while black box modelling could yield structures that

17

Chapter 2

State of the art

may not have any physical significance [6]. Generally speaking, the adoption of the former approach leads to comprehensible models with a poor reliability, while the latter provides for reliable models often without an acceptable comprehensibility. The third approach to system modelling is represented by the combination of the two paradigms, namely the grey box modelling [66], where known parts of the systems are modelled by exploiting prior knowledge, while the unknown parts (or less certain) are modelled by adopting black box approaches. In order to build grey box models, one of the most successful approaches is represented by fuzzy modelling (FM), which uses a descriptive language, based on fuzzy logic with fuzzy predicates to model a system [101, 147]. FM, by adopting different parametric system identification techniques, builds model structures in the form of fuzzy rule-based systems (FRBSs). Control, modelling and classification are the classical applications where fuzzy systems have demonstrated their ability [46, 134, 32]. Both the capability to provide an interpretable description of the system behaviour (for human beings) and the possibility to incorporate human expert knowledge - typically provided for many real-world systems and expressed by vague and imprecise statements - are the key factors for the success and interest of fuzzy systems. Due to these features, fuzzy systems sometimes are used to describe knowledge obtained by black box models such as neural networks [30]. Typically, FM is characterized by two conflicting features, interpretability and accuracy, based on which the quality of fuzzy models is assessed. Specifically: • Interpretability is a subjective property, not easy to formalize, which depends on several factors such as the model structure, the number of input variables, the number of fuzzy rules, the number of linguistic terms, and the shape of the fuzzy sets. Literature about interpretability of fuzzy systems gives different criteria such as compactness, completeness, consistency, or transparency. In general, interpretability can be intended as the capability of the fuzzy model to describe the system behaviour in a comprehensible way for a human being. • Accuracy is referred to the fuzzy model capability to faithfully represent the modelled system. In other words, the closer the model to the system, the higher its accuracy. Closeness is intended as the similarity between the real system and the fuzzy model responses. As Zadeh stated in its Principle of Incompatibility [168]: «as the complexity of a system increases, our ability to make pre-

18

2.1 Fuzzy Systems cise and yet significant statements about its behaviour diminishes until a threshold is reached beyond which precision and significance (or relevance) become almost mutually exclusive characteristics.» it is straightforward to understand that interpretability and accuracy are features in contrast with each other and hence, in a fuzzy model, one of them could prevail on the other. Due to this aspect, the FM field may be split into two different areas [19]: • Linguistic fuzzy modelling (LFM), where the main objective is to obtain fuzzy models with a good interpretability; • Precise fuzzy modelling (PFM), where the main objective is to obtain fuzzy models with a good accuracy. However, up to the last fifteen years, the effort in FM was inclined on increasing the accuracy of fuzzy models as much as possible, without giving the right importance to their interpretability, resulting in a deviation from the original purpose of FM which aims at exploiting the descriptive power of the linguistic variable concept [168, 163]. Finding a good trade-off between accuracy and interpretability, represents the modern tendency in FM. In the next section a general overview of the components of a fuzzy rule-based system is presented.

2.1.1 Components of a fuzzy rule based system The classical way to represent human knowledge in a rule-based system is the use of the “IF-THEN” rule form, where the fulfilment of the rule antecedent (the IF condition) enables the execution of the consequent part (typically represented by an action). Those kind of systems have found successful application to model human-centric problem solving and information processing. However, conventional approaches to represent human knowledge make use of bivalent logic, which is affected by the impossibility to deal with the issue of uncertainty, imprecision and gradualness. Nevertheless, those concepts are heavily related with the human way of thinking. The result is that conventional approaches do not provide adequate models for human-like reasoning. The application of Fuzzy Logic (FL) to rule-based systems leads to FRBSs in which

19

Chapter 2

State of the art

the “IF-THEN” rule parts are represented by fuzzy statements - i.e. fuzzy rules. FRBSs provide essential advantages w.r.t. classical rule-based approaches: • the key features of knowledge captured by fuzzy sets involve handling of uncertainty, imprecision and gradualness; • inference methods become more robust and flexible with approximate reasoning schemas. The use of linguistic variables and their linguistic values improve the capability to represent knowledge. Linguistic terms context-dependent and their meanings are defined by gradual membership functions [163]. Moreover, the approximate reasoning structure is provided by FL inference methods such as generalized modus ponens [168]. This leads to obtain a common computational base for inference in rule-based systems. These considerations highlight a clear separation between two main concepts in a FRBS: the Knowledge and the Processing Structure (reasoning), as shown in Figure 2.1. This represents a key aspect that allows to consider FRBSs as knowledge based systems. Knowledge Base Rule Base

Fuzzification

Fuzzy Partitions

Inference Engine

Defuzzification

Processing Structure

Figure 2.1: General structure of a Mamdani Fuzzy Rule-Based System

The first FRBS was presented in [106] by Mamdani, which was able to augment the Zadeh initial formulation with the application of Fuzzy Sets (FSs) to a control problem with real inputs and outputs. Those kind of models are referred as FRBSs with fuzzifier and defuzzifier or, usually, as fuzzy logic controllers (FLCs) as proposed by the author in [107]. Control systems design constituted the main application of Mamdani FRBSs and the term FLC becomes quickly popular.

20

2.1 Fuzzy Systems In Figure 2.1 the generic structure of a Mamdani FRBS is depicted with the clear separation between Knowledge Base (KB) and Processing Structure (PS). The former stores the available knowledge about the problem in the form of fuzzy “IFTHEN“ rules. The PS, instead, executes the inference process exploiting the KB rules. The fuzzification module operates a mapping between crisp values of the input domain U and fuzzy sets defined on the same universe of discourse. The defuzzification module, as opposed to fuzzification, operates a mapping between fuzzy sets defined in the output domain V and crisp values defined in the same universe. Taking into account Mamdani FRBSs, the next two subsections analyse the two mentioned main components: KB and PS. 2.1.1.1 Knowledge Base (KB) As briefly explained, the KB represents the repository of the available specific knowledge about the problem that models the relationships between input and output of the underlying system. Exploiting the KB, the inference process maps an observed input to an associated output. Linguistic variables are usually adopted in the rule structure of Mamdani FRBSs [165]. Dealing with multiple inputs - single output (MISO) systems, linguistic rules are represented as:

IF X1 is A1 and . . . and Xn is An T HEN Y is B,

(2.1)

with Xi and Y being input and output linguistic variables, respectively, and with Ai and B being linguistic terms associated to these variables. The KB contains two different levels of information: the fuzzy rule semantics and the linguistic rules. The former is represented by fuzzy partitions while the latter represents the expert knowledge. These distinctions are depicted in Figure 2.1 as two different entities: • Rule base (RB) - represented by a set of linguistic rules. This means that more than one rule can be activated by the same input pattern • Fuzzy partitions - are the representation of the adopted linguistic term sets (considered in the linguistic rules). Fuzzy membership functions defines the

21

Chapter 2

State of the art

semantics of the adopted linguistic terms. At each linguistic variable involved in the problem, a fuzzy partition of its domain is associated (as depicted in Figure 2.2). NM

NS

ZR

PS

PM

0.5

min

Max

Figure 2.2: Example of a fuzzy partition

A consideration about RB structure is worth to be done. Adopting a connective other then the and operator to aggregate terms in rule antecedent may lead to more generic linguistic rule structures. Nevertheless, the considered rule structure is generic enough to subsume other rule representation forms and hence are widely used in literature due to their simplicity [158]. On the other hand, rules with a different structure can be considered, as discussed in subsection 2.1.2.

2.1.1.2 Processing structure As depicted in Figure 2.1, the PS of a general Mamdani FRBS is composed of the following modules: • The fuzzification interface transforms the crisp input data into fuzzy values. Those represent the input to the fuzzy reasoning process. • The inference system, according to the information stored in the KB, infers from the fuzzy input to several resulting fuzzy sets of output. • The defuzzification interface converts the output fuzzy sets, obtained from the inference process, into a crisp value.

The fuzzification interface The fuzzification interface defines a mapping from crisp input values to fuzzy sets defined over the Universe of discourse of the same

22

2.1 Fuzzy Systems input. Exploiting a fuzzification operator F , it is possible to compute the membership function of the fuzzy set A0 , defined over the Universe of discourse U and associated to a particular crisp value x0 , such as:

A0 = F (x0 )

(2.2)

There are many F operators. For example, the point wise or singleton, usually adopted, builds A0 as a singleton whose support is x0 , with the following membership function:

 

A0 (x) = 

1, if x = x0 0, otherwise

(2.3)

The inference system Given the fuzzy sets of input, the inference system derives the fuzzy outputs by exploiting the relations defined in the fuzzy rules. In particular, a mapping between the fuzzy sets U = U1 × U2 × · · · × Un in the input domain of X1 , . . . , Xn and the fuzzy sets V in the output domain of Y is defined. The fuzzy inference system adopts an extension of the classical modus ponens, the generalised modus ponens [168]:

IF X is A T HEN Y is B X is A0

(2.4)

Y is B 0 Since A and B are fuzzy sets and X and Y are linguistic variables, the conditional statement “IF X is A THEN Y is B” is defined as a fuzzy conditional statement that represents a fuzzy relation between A and B defined in U × V. The fuzzy relation is expressed by a fuzzy set (R) with membership function :

µR (x, y) = I(µA (x), µB (y)), ∀x ∈ U, y ∈ V

(2.5)

23

Chapter 2

State of the art

where µA (x) and µB (y) are the membership functions for the fuzzy sets A and B respectively and I is a fuzzy operator that models the fuzzy relation. The result of the application of generalised modus ponens (in this case the fuzzy set B 0 ) is obtained by means of the compositional rule of inference [168]:

B 0 = A0 ◦ R

(2.6)

In particular, when applied to the i-th rule of the RB, which is in the form:

Ri : IF Xi1 is Ai1 and . . . and Xin is Ain T HEN Y is Bi ,

(2.7)

given an input vector x0 = (x1 , . . . , xn ), the expression of the compositional rule of inference becomes:

µBi0 (y) = I(µAi (x0 ), µBi (y))

(2.8)

where µAi (x0 ) = T (µAi1 (x1 ), . . . , µAin (xn )), T is a fuzzy conjunctive operator (a tnorm) and I is a fuzzy implication operator. Usually the common choice for both of them is the minimum function. The defuzzification interface The role of the defuzzification interface is to compute the aggregation of the information provided by the outputs of the rules in order to obtain a crisp output value. This can be obtained in two different ways: FATI (first aggregate, than infer) and FITA (first infer, then aggregate) [10, 70, 158]. Although in the first conception of FLCs, Mamdani has proposed the FATI [106], in the last years, in particular or real-time applications which require fast response time, the FITA is becoming more popular [70, 46, 147]. The defuzzification interface for the FATI operates as follows: • The fuzzy set B 0 is defined by means of a fuzzy aggregation operator G that takes into account all the individual fuzzy sets Bi0 : n

o

µB 0 (y) = G µB10 (y), µB20 (y), . . . , µBn0 (y)

24

(2.9)

2.1 Fuzzy Systems • A defuzzification method D is adopted in order to transforming the fuzzy set B 0 into a crisp output value y0 : y0 = D(µB 0 )

(2.10)

The aggregation operator G is usually implemented by the maximum function (a t-conorm) and the defuzzifier D can be the “center of gravity” (CG) or the “mean of maxima” (MOM), among others: • CG: R

y · µB 0 (y)dy Y µB 0 (y)dy

y0 = YR

(2.11)

• MOM: yinf = inf{z | µB 0 (z) = sup µB 0 (y)} ysup = sup{z | µB 0 (z) = sup µB 0 (y)} yinf + ysup y0 = 2

(2.12)

In FITA, instead, each output fuzzy set contribution is taken into account separately. In particular, each fuzzy set Bi0 is defuzzified and the final crisp value is obtained by means of an averaging or selection operation. In this case, the most common defuzzification operator is either CG or the “maximum value” (MV), weighted by the matching degree as shown as follows:

Pm

y0 =

hi · yi i=1 hi

i=1

Pm

(2.13)

where yi is the CG or MV of the fuzzy set Bi0 (inferred from rule Ri ) and hi = µAi (x0 ) is the matching value between the input x0 and the antecedent of the i-th rule. In the next section three different types of FRBSs are presented: Linguistic fuzzy rule base systems, Takagi–Sugeno–Kang Fuzzy Rule-Based Systems and Fuzzy RuleBased Systems for Classification.

25

Chapter 2

State of the art

2.1.2 Type of Fuzzy Rule-Based Systems In the previous section a general description of FRBS has been provided, based on the first FRBS proposal by Mamdani. In this section more specific descriptions of different FRBS structures which can be considered during the design process, are provided. 2.1.2.1 Linguistic fuzzy rule base systems A Linguistic FRBS can be considered as Mamdani-type FRBS, being this one the main tool to develop linguistic models [106, 107]. The structure of the rules composing a linguistic FRBS is: IF X1 is A1 and . . . and Xn is An T HEN Y1 is B1 , . . . , Ym is Bm ,

(2.14)

where Xi and Yj are, respectively, the input/output linguistic variables, while Ai and Bj are the linguistic terms associated at the corresponding fuzzy partitions (defined over the linguistic variables - see Figure 2.3). This kind of FRBS structure provides a good framework to include expert knowledge about the problem in the form of fuzzy linguistic rules. VS

S

M

L

VL

0.5

l

r

Figure 2.3: Example of a fuzzy partition of a linguistic variable

In linguistic FRBSs, the Knowledge Base (KB) module stores: • the rule base that collects all linguistic rules; • the fuzzy partitions, which describe the linguistic terms adopted in the system defining the corresponding semantics by means of their membership functions.

26

2.1 Fuzzy Systems As mentioned, Linguistic FRBSs provides a natural framework to include expert knowledge in the system in the form of fuzzy linguistic rules. These can be easily combined with rules derived from data. Moreover, linguistic FRBSs are highly interpretable because their rules are composed by linguistic terms defined over the linguistic variables. Each linguistic term set is justified by an underlying semantic associated to the adopted linguistic labels. Linguistic FRBS rules can have a clear human interpretation which leads to call those systems linguistic or descriptive Mamdani FRBSs. This candidates them to be adopted in applications in which model interpretability results a key feature, such as fuzzy control and linguistic modelling[46, 96][134, 147]. Even if linguistic FRBSs possess several advantages, some drawbacks are present. The most common problem is the lack of accuracy for complex problems due to the linguistic rule structure. The limitations of the fuzzy linguistic IF-THEN rules are due to the adoption of linguistic variables, as analysed in [12, 16].

2.1.2.2 Takagi–Sugeno–Kang Fuzzy Rule-Based Systems Takagi, Sugeno and Kang proposed a different model, commonly referred as TSK FRBS, based on rules in which rule antecedent is composed by linguistic variables while rule consequent is a function of the input variables [151, 148]. Usually, the consequent function is a linear combination of the variables involved in the corresponding rule antecedent:

IF X1 is A1 and . . . and Xn is An T HEN Y = p1 · X1 + · · · + pn · Xn , (2.15) where Xi represent the input variables, Y is the output variable and p = (p0 , p1 , . . . , pn ) is a vector of real parameters. Given a KB composed of m rules, the TSK FRBS output is evaluated as a weighted sum of all the individual rule outputs, Yi , i = 1, . . . , m:

Pm

hi · Yi i=1 hi

i=1

Pm

(2.16)

27

Chapter 2

State of the art

in which hi = T (Ai1 (x1 ), . . . , Ain (xn )) is the matching degree between the antecedent part of the i-th rule and the current inputs to the system, x0 = (x1 , . . . , xn ), and T represents a conjunctive operator modelled by a t-norm. In order to design a TSK inference engine only the T conjunctive operator have to be chosen - commonly the minimum and the algebraic product. Moreover, TSK systems do not need a defuzzification processes being their outputs crisp real numbers. On the other hand, however, interpretability in TSK FRBSs is penalised w.r.t. Mamdani FRBSs due to two main reasons: • rule consequent structures is difficult to be interpreted by a human expert; • Their overall output simultaneously depends on the activation of the rule antecedents and on the evaluation of the function defining rule consequent, which depends itself on the crisp inputs as well rather than being constant.

2.1.2.3 Fuzzy Rule-Based Systems for Classification An FRBS for classification is tailored to perform classification tasks by exploiting fuzzy rules in order to represent knowledge. A fuzzy rule structure, in this case, has the antecedent part equal to a Mamdani FRBS while the consequent is directly represented by a class label, as follows IF X1 is A1 and · · · and Xn is An T HEN Y is C being C a categorical variable (class label). Variants of FRBSs for classification include consequents in which each class labels is present with the corresponding degree of membership other than consider a certainty degree for each rule in the RB. For this kind of FRBSs too, the rule firing is operated by a composition between the linguistic variables involved in a rule. Such composition is computed making use of a t-norm (usually the minimum norm) and its result represents the value with which Y is C. Given a particular input pattern, more than one rule can be fired with different firing values; usually the corresponding class label is the one provided by the rule with the highest firing degree. Another approach allows to provide a class label only if the firing value exceeds a certain threshold. In particular, if more than one rule with a different class label is

28

2.2 Defining Interpretability fired by the same input, it is possible to label that input only if the firing value exceed a defined threshold. Otherwise, the system is unable to provide a classification for the input pattern - i.e. in other words the designer choice is to leave to the human expert those particularly uncertain decisions. This system behaviour is really appreciated in sensitive fields, e.g. in particular real world applications such as decision support in medicine, finance, etc.

2.2 Defining Interpretability Defining interpretability is challenging because it deals with the the relation occurring between two heterogeneous entities: a fuzzy system and a human user acting as an interpreter of the system’s knowledge base and working engine. To pave the way for defining such a relation, some fundamental properties need to be incorporated into a fuzzy system, so that its formal description becomes compatible with the user’s knowledge representation. The definition of interpretability, therefore, calls for the identification of several features; among them, the adoption of a fuzzy inference engine based on fuzzy rules is straightforward to approach the linguistic-based formulation of concepts which is typical of human abstract thought. A distinguishing feature of a fuzzy rule-based system is the double level of knowledge representation: (i) the semantic level made by the fuzzy sets defined in terms of their membership functions, as well as the aggregation functions used for inference, and (ii) the syntactic level of representation, in which knowledge is represented in a formal structure where linguistic variables are involved and reciprocally connected by some formal operators (e.g. “AND”, “THEN”, etc.). A mapping is defined to provide the interpretative transition that is quite common in the mathematical context: semantics is assigned to a formal structure by mapping symbols (linguistic terms and operators) to objects (fuzzy sets and aggregation functions). In principle, the mapping of linguistic terms to fuzzy sets could be arbitrary. Nevertheless, the mere use of symbols in the high level of knowledge representation implies the establishment of a number of semiotic relations that are fundamental for the preservation of interpretability of a fuzzy system. In particular, linguistic terms — as usually picked from natural language — must be fully meaningful for the expected reader since they denote concepts, i.e. mental representations that allow the reader to draw appropriate inferences about the entities she encounters. As a consequence, concepts and fuzzy sets are implicitly connected by means of the common linguistic terms they are re-

29

Chapter 2

State of the art

lated to; the key essence of interpretability is therefore the property of cointension [167] between fuzzy sets and concepts, consisting in the capability of referring to similar classes of objects: such a possibility is assured by the use of common linguistic terms. The notion of semantic cointension is further strengthened by the Comprehensibility Postulate [118], which asserts that «The results of computer induction should be symbolic descriptions of given entities, semantically and structurally similar to those a human expert might produce observing the same entities. Components of these descriptions should be comprehensible as single “chunks” of information, directly interpretable in natural language, and should relate quantitative and qualitative concepts in an integrated fashion.» The key-point of the postulate, which has been conceived in the general context of Machine Learning but can be directly applied to fuzzy systems, is the human centrality of the results of a computer induction process; the importance of the human component implicitly suggests this aspect to be taken into account in the quest for interpretability. Actually, the semantic cointension is related to one facet of the interpretability process, which can be referred to as comprehensibility of the content and behaviour of a fuzzy system. On the other hand, when we turn to consider the cognitive capabilities of human brains and their intrinsic limitations, then a different facet of the interpretability process can be defined in terms of readability of the bulk of information conveyed by a fuzzy model. In that case, simplicity is required to perform the interpretation process because of the limited ability to store information in the human brain’s short term memory [119]. Comprehensibility and readability represent two facets of a common quality and both of them are to be considered for the design of interpretable fuzzy systems. Interpretability is a complex requirement that has an impact on the design process. Therefore, it must be justified by strong arguments, like those briefly outlined in the following: 1. In an interpretable fuzzy system the acquired knowledge can be easily verified and related to the domain knowledge of a human expert. In particular, it is easy to verify if the acquired knowledge expresses new and interesting relations about the data; also, the acquired knowledge can be refined and integrated with expert knowledge. 2. The use of natural language as a mean for knowledge communication enables the possibility of interaction between the user and the system. Interactivity is

30

2.3 Interpretability Constraints and their assessment meant to explore the acquired knowledge; in practice, it can be done at symbolical level (by adding new rules or modifying existing ones) and at numerical level (by modifying the fuzzy sets denoted by linguistic terms, or by adding new linguistic terms denoting new fuzzy sets). 3. The acquired knowledge can be easily validated against common-sense knowledge and domain-specific knowledge. This capability enables the detection of semantic inconsistencies that may have different causes (misleading data involved in the inductive process, local minimum where the inductive process may have been trapped, data overfitting, etc.). This kind of anomaly detection is important to drive the inductive process towards a qualitative improvement of the acquired knowledge. 4. The most important reason to adopt interpretable fuzzy models is their inherent ability to convince end-users about the reliability of a system (especially those users not concerned with knowledge acquisition techniques). An interpretable fuzzy rule-based model is endowed with the capability of explaining its inference process so that users may be confident on how it produces its outcomes. This is particularly important in such domains as medical diagnosis, where a human expert is the ultimate responsible of critical decisions.

2.3 Interpretability Constraints and their assessment The issue of extracting interpretable fuzzy information granules from data has been tackled in several ways, including two main approaches. A first approach is mainly aimed at minimizing the complexity in description of data, by balancing the trade-off between accuracy and simplicity. The idea of reducing the complexity of description is coherent to the well-known Occam Razor’s Principle, which is restated in Information Theory as the Minimum Description Length principle [136]. Complexity reduction can be achieved by reducing the dimensionality of the problem, e.g. by feature selection [39][152][156] while feature extraction (e.g. Principal Component Analysis) is not recommended because the extracted features do not have any directly interpretable meaning. Other methods to reduce complexity include the selection of an appropriate level of granularity [39][61] the fusion of similar information granules [51][81][146], elimination of low-relevant information granules [9][138], and hierarchical structuring [152].

31

Chapter 2

State of the art

The Minimum Description Length principle has been effectively adopted in Artificial Intelligence for inductive learning of hypotheses [55]. However, in interpretable fuzzy information granulation such principle is useful to guarantee simplicity while not being sufficient to guarantee interpretability since labelling fuzzy sets with linguistic terms requires further constraints. Based on such consideration, some authors propose to introduce several constraints into the information granulation process to achieve interpretability.

Interpretability constraints force the process of information granulation to (totally or partially) satisfy a set of properties deemed necessary to allow the attachment of linguistic labels to the information granules. The first pioneering works concerning the application of interpretability constraints for fuzzy modelling include [43][34]. A preliminary survey of methods to guarantee interpretability in fuzzy modelling is given in [58][19]. A review of interpretability constraints is reported in [115].

Interpretability is a quality of fuzzy systems that is not immediate to quantify. Nevertheless, a quantitative definition is required both for assessing the interpretability of a fuzzy system and for designing new fuzzy systems. A common approach for a quantitative definition of interpretability is based on the adoption of a number of constraints and criteria that, taken as a whole, provide for a (at least partial) definition of interpretability. In literature a large number of interpretability constraints and criteria can be found [115, 169].

32

2.3 Interpretability Constraints and their assessment

High−Level Fuzzy Rule Bases

Fuzzy Rules Abstraction Levels Fuzzy Partitions

Fuzzy Sets Low−Level

−> Compactness −> Average firing rules −> Logical view −> Completeness −> Locality −> Description length −> Granular output −> Justifiable number of elements −> Distinguishability −> Coverage −> Relation preservation −> Prototypes on special elements −> Normality −> Continuity −> Convexity

Figure 2.4: Interpretability constraints and criteria in different abstraction levels.

An usual approach is to organize the interpretability constraints in a hierarchical fashion (Figure 2.4), which starts from the most basic components of a fuzzy system, namely the involved fuzzy sets, and goes on toward more complex levels, such as fuzzy partitions, fuzzy rules, up to considering the model as a whole. At the lowest level, interpretability concerns each single fuzzy set, with the role of expressing an elementary yet imprecise concept that can be denoted by a linguistic term. Thus, fuzzy sets are the building blocks to translate a numerical domain into a linguistically quantified domain that can be used to communicate knowledge. However, not all fuzzy sets can be related to elementary concepts, since the membership function of a fuzzy set may be very awkward but still legitimate from a mathematical viewpoint. Actually, a sub-class of fuzzy sets should be considered, so that its members can be easily associated to elementary concepts and tagged by the corresponding linguistic labels. Fuzzy sets of this sub-class must verify a number of basic interpretability constraints. In the following sections a subset of the most significant interpretability constraints are presented. As mentioned, each section regards a particular level in the abstraction scale.

33

Chapter 2

State of the art

2.3.1 Constraints on Fuzzy Sets Constraints on fuzzy sets represent the first level in the interpretability analysis i.e. they affect each fuzzy set involved in the knowledge base of a fuzzy system. In particular, those constraints are dedicated to the shape of fuzzy sets, in order to label fuzzy sets with semantically sound linguistic terms. Some of them have a formal description while others are justified by common sense. In the following a selection of the constraints most commonly adopted in literature is provided. For those, a fuzzy set A is fully characterized by its membership function µA defined on the Universe of Discourse U . The set of all possible fuzzy sets definable on U is denoted with F (U ).

One-dimensionality Usually fuzzy systems are defined on multidimensional domains characterized by several features. However, each fuzzy set being denoted by a linguistic term should be defined on a single feature, whose domain becomes the universe of discourse, which is assumed as a closed interval on the real line. Relations among features are represented as combinations of one-dimensional fuzzy sets, which can be linguistically interpreted as compound propositions.

Normality A fuzzy set A should benormal, i.e. there exists at least one element (called prototype) with full membership:

∃x ∈ U : µA (x) = 1

(2.17)

A fuzzy set is sub-normal if it is not normal. In Figure 2.5, normality and subnormality are graphically depicted for two fuzzy sets. The normality requirement is very frequent. Indeed, it is implicitly assumed in almost all literature concerning interpretable fuzzy modelling, while only few authors require it explicitly [34, 137, 141].

34

2.3 Interpretability Constraints and their assessment

1

0.9

0.8

0.7

membership value

0.6

0.5

0.4

0.3

0.2

0.1

0 −2

−1

0

1

2 U

3

4

5

6

Figure 2.5: Example of a normal fuzzy set (solid) and a sub-normal fuzzy set (dashed). If the two fuzzy sets represent qualities, then the sub-normal fuzzy sets expresses a quality that is never completely met by any element of the domain.

Although there are attempts to provide for linguistic interpretations of subnormal fuzzy sets [93], actually normality is almost always required since it implies that at least one element of the Universe of Discourse should exhibit full matching with the concept semantically represented by the fuzzy set [50, 131, 43].

Continuity A fuzzy set A should be continuous i.e. its membership function µA is continuous in the universe of discourse. Continuity is motivated by the common assumption that physical attributes vary (or are perceived to vary) with continuity. As a matter of fact, most concepts that can be naturally represented through fuzzy sets derived from a perceptual act, which comes from external stimuli that usually vary in continuity. Therefore, continuous fuzzy sets are better in accordance with the perceptive nature of the represented concepts. Actually, this constraint is always met when modelling interpretable fuzzy systems, but it is rarely explicated in literature. However, since fuzzy set theory does not guarantee continuity of fuzzy sets ipso facto, this constraint should be included in interpretability analysis.

Convexity A fuzzy set A should be convex i.e. the membership values of elements belonging to any interval are not lower than the membership values at the interval’s

35

Chapter 2

State of the art

extremes:

∀a, b, x ∈ U : a ≤ x ≤ b → µA (x) ≥ min{µA (a), µA (b)}

(2.18)

The fuzzy set A is strictly convex if inequalities in (Equation 2.18) are strict, i.e. without equality signs. Roughly speaking, a convex fuzzy set represent a concept that, if applicable to two elements to a certain degree, then it is applicable to all elements between the two. In this way, the concept represented by the fuzzy set can be conceived as elementary, i.e. it is related to a single specific property of a perceived object (see Figure 2.6).

1

0.9

0.8

0.7

membership value

0.6

0.5

0.4

0.3

0.2

0.1

0 −2

−1

0

1

2 U

3

4

5

6

Figure 2.6: A convex fuzzy set (solid) and a non-convex fuzzy set (dashed). It is difficult to label the non-convex fuzzy set with a linguistic term.

Non-convex fuzzy sets can be used for modelling complex concepts, e.g. “MEAL TIME”, which may have three peaks around 8:00 am, 12:00 am and 8:00 pm and low membership outside these peaks [56]. However, it is easy to observe that such fuzzy sets could be easily redefined as a union of elementary fuzzy sets. Continuing the previous example, we could redefine “MEAL TIME” as union of the elementary (convex) fuzzy sets “BREAKFAST TIME”, “LUNCH TIME” and “DINNER TIME”.

36

2.3 Interpretability Constraints and their assessment

2.3.2 Constraints on Fuzzy Partitions The basic elements of linguistic models are linguistic variables (LVs). A LV requires the specification of a symbol denoting the name of the LV. Usually the name of a LV coincides with the attribute name of a complex object (e.g. the attribute “Age” of an object “Person”). The LV can take values from a set of linguistic terms (also known as linguistic values), which are symbols that usually coincide with natural language terms. Often, adjectives or other linguistic qualifications are used as linguistic terms (e.g. Old, About 70, Not very young, etc.) and apply to a domain (e.g. the natural numbers from 0 to 120). In some cases, the set of available linguistic terms is complex; hence a generative grammar is specified, so as to provide for a computational machinery that is capable of enumerating all the linguistic terms, as well as to verify whether a term belongs to the admissible set or not. A LV maps its linguistic terms into corresponding fuzzy sets, thus endowing them with a semantics. This mapping is crucial for translating linguistic structures (like rules) into fuzzy information granules, which can be involved in approximate inference. On the overall, a LV can be formalized as follows [163]:

LV = hX, T, U, G, µi

(2.19)

where • X is the name of the variable (a symbol); • T is a (usually finite) set of linguistic terms (symbols); • U is the domain of applicability of the linguistic term. U is a Universe of Discourse in the sense defined in the previous section. • G is a grammar that generates the symbols in T ; • µ : T 7→ F (U ) is the semantic interpretation of linguistic terms, realized by mapping terms in T into fuzzy sets on U . Given a term τ ∈ T its interpretation as fuzzy set will be denoted as µτ . In interpretability analysis, much concern is directed towards linguistic terms in T and their semantic interpretation µ (T ). In particular, linguistic terms that do

37

Chapter 2

State of the art

not correspond to natural language terms should be avoided as they would not convey any meaning to the final user. This requirement is often application oriented; however exotic linguistic constructs should be always avoided (e.g. Very very very medium [108]). The requirement of using natural language terms in T poses a semantic problem. Natural language terms, indeed, convey an implicit semantics that is roughly shared among all users that use those terms for communicating information. Of course, each user may have a different semantic for the same term; however, the semantics of the same term to different users should highly match: this is necessary for enabling communication among users. As a consequence, to be interpretable, the explicit semantics of a linguistic term (as defined by the corresponding fuzzy set) should highly match with the implicit semantics possessed by a user that reads the term in a knowledge base. In other words, implicit and explicit semantics must be cointensive. This problem is non-trivial and hardly formalizable, because the implicit semantics conveyed by a linguistic term cannot be expressed formally. Thus, the objective of interpretability analysis is to define a set of constraints to partially meet the cointension requirement.

Justifiable number of elements The number of linguistic terms in a LV should not be too high, preferably less than 7 ± 2 This criterion is motivated by a psychological experiment reported in [119] and considered in a large number of works concerning interpretable fuzzy modelling, e.g. in [50, 77, 82, 83, 161, 43, 129, 140]. In his test battery, Miller proved that the span of absolute judgement (assigning scores to stimuli of different perceived magnitudes) and the span of immediate memory (The number of events that are reminded in the short term) impose a strong limitation on the amount of information that a human being is able to perceive, process and remember. Experimentally, the author found that the number of entities that can be clearly remembered for a short time is around 7, plus or minus 2, depending on the subject under examination. Interestingly, and quite surprisingly from a cognitive standpoint, complex information can be mentally organized into unitary chunks of information that are treated as single entities. In this way, the human brain is capable to perceive, process or remember in its short term memory both simple perceptions (e.g. tones) or more complex entities (e.g. faces), provided

38

2.3 Interpretability Constraints and their assessment that the number of entities is less than the magical number seven, plus or minus two. Miller’s findings stimulated further works, which either confirmed or reduced this number [139, 41]. As a general result, human brain is capable of handling few chunks of information in the short memory, and this should be kept in mind when designing interpretable fuzzy models. In particular, this general criterion requires that the number of linguistic terms in a LV should be kept as small as possible, since these terms denote attribute qualities that can be associated to stimuli of different perceived magnitudes. This constraint imposes a limit on the complexity of a LV and can be extended to other components of a fuzzy models (number of attributes, information granules and rules). As a consequence, the constraint introduce a bias on the final model which may affect its predictive accuracy. This is a long-stated problem in modelling interpretable fuzzy systems: often, most of the designer effort is to find the right trade-off between accuracy and interpretability. To alleviate this problem, the type of knowledge representation can be considered as a design variable. In particular, multilevel representations can be considered to model complex relationships among data, where higher levels of representation provide for rough yet comprehensible knowledge, whilst lower levels of representation enable accurate prediction without paying too much attention to interpretability [44].

Distinguishability Different terms in a LV should be interpreted by well distinguishable fuzzy sets. Distinguishability is one of the most common interpretability constraints adopted in fuzzy modelling literature [7, 34, 50, 61, 77, 82, 85, 161, 122, 43, 129, 131, 137, 140]. Roughly speaking, distinguishable fuzzy sets are well disjoint so they represent distinct concepts and can be assigned to semantically different linguistic labels (see fig. Figure 2.7). In a slightly more precise sense, distinguishability is a (fuzzy) relation between fuzzy sets defined on the same Universe of Discourse that is fully satisfied when the fuzzy sets are completely disjoint1 , while it is not satisfied at all when the two fuzzy sets coincide. 1

i.e. ∀x ∈ U : min{A(x), B(x)} = 0

39

Chapter 2

State of the art

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −1

0

1

2

3

4

5

Figure 2.7: Example of non-distinguishable fuzzy sets (dash vs. dash-dots) and distinguishable fuzzy sets (solid vs. dash or solid vs. dash-dots). Non distinguishable fuzzy sets represent almost the same concept, hence it is difficult to assign different labels Completely disjoint fuzzy sets are maximally distinguishable. However, usually linguistic concepts partially overlap so that the passage from a concept to another is usually smooth (e.g. from small to tall). This explains the gradual nature of the property of distinguishability. Well distinguishable fuzzy sets represent well separated concepts; this makes their linguistic interpretation easier and less subjective [140, 154] Distinguishable fuzzy sets are also useful to avoid model inconsistency by reducing redundacy [155]. As a lateral benefit, computational efforts required for inference are alleviated. Coverage Each element of the Universe of Discourse should be well represented by at least one fuzzy set of a linguistic variable. Formally, given a linguistic variable LV = hX, T, U, G, µi, then ∀x ∈ U ∃τ ∈ T s.t. µτ > 0 Completeness is a property of deductive systems that has been used in the context of Artificial Intelligence to indicate that the knowledge representation scheme can represent every entity within the intended domain [155]. When applied to fuzzy models, completeness states that the fuzzy model should be able to infer a proper

40

2.3 Interpretability Constraints and their assessment conclusion for every input [98]. Hermann [67] justifies completeness (there called ‘cover full range’) by the fact that in human reasoning there will never be a gap of description within the range of the variable. On the contrary, incompleteness may be a consequence of model adaptation from data and can be considered a symptom of overfitting [84]. Usually, a stronger version of the requirement is adopted, which constrains the membership degree of being greater than a fixed treshold, thus avoiding formally complete LVs that cannot be intened as such in practice (see also [78]). E.g. see Figure 2.8.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −1

0

1

2

3

4

5

Figure 2.8: Example of completeness violation. In the highlighted regions of the Universe of Discourse (ellipses) 0.5-completeness is not verified. Elements in such region are not well represented by any fuzzy set in the LV.

Relation preservation Given a linguistic variable LV = hX, T, U, G, µi, if for a subset of linguistic terms T≺ = {τ1 , τ2 , . . . , τn } ⊆ T a total ordering ≺ is defined, then for each couple τi , τj with τi < τj there exists a threshold t such that ∀x ∈ U : 



x ∈ [mU , t[→ µτi (x) > µτj (x) ∧



(2.20)



x ∈ [t, MU ] → µτi (x) ≤ µτj (x)

41

Chapter 2

State of the art

1

Low

High

Medium

0.9

0.8

0.7

membership value

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.5

1

1.5

2 U

2.5

3

3.5

4

Figure 2.9: A linguistic variable with three fuzzy sets. While each individual fuzzy sets has a well defined semantics, they are not properly labelled thus hampering their interpretability.

Such a constraint will be referred to in the following as proper ordering constraint and poses some limitations to the choice of fuzzy sets used in a LV. Figure 2.9 illustrates the case of a LV with three corresponding fuzzy sets which appear to be labeled in an unsuitable way.

Prototypes on special elements In many problems some elements of the universe of discourse have some special meaning. A common case is the meaning of the bounds of the universe of discourse, which usually represent some extreme qualities (e.g.,VERY LARGE or VERY SMALL). Other examples are possible, which are more problem-specific (e.g., the typical human body temperature). In all these cases, the prototypes of some fuzzy sets of the partition must coincide with such special elements. For the boundary of the universe of discourse the following constraint is used: Given a linguistic variable LV = hX, T, U, G, µi, then the lower and upper bounds of U should be prototypes for some fuzzy sets of the variable: ∃τ 0 , τ ” ∈ T s.t. µτ 0 (mU ) = µτ 00 (MU ) = 1

42

(2.21)

2.3 Interpretability Constraints and their assessment

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −1 (minU)

0

1 (minU+e)

2

3 (a)

4

5

6

7

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −1 (minU)

0

1

2

3 (b)

4

5

6

7

Figure 2.10: Example of sets of fuzzy sets violating leftmost/rightmost fuzzy sets (a) and verifying such constraint (b).

This constraint is often implicitly used in interpretable fuzzy modelling (with some exceptions, where it is explicitly mentioned like [34]). The constraints states that there exist two extreme fuzzy sets that fully represent the limit values of the Universe of Discourse. In fig. Figure 2.10 an example of two sets of fuzzy sets, one verifying leftmost/rightmost fuzzy sets and the other violating the constraint, is depicted. Leftmost/rightmost fuzzy sets are hence important in those linguistic variables that express qualities on the Universe of Discourse (e.g. Low, Medium, High), so as to adhere to human intuition. However, such this constraint is not encessary for linguistc variables that express fuzzy quantities (e.g. about 0, plus or minus 5, etc.), since these labels do not convey any meaning of extremity. On the other hand, the following constraint refer to problem-specific prototype values: Given a linguistic variable LV = hX, T, U, G, µi, if the elements of a finite subset SU ⊂ U have some special meaning, then each element in S should be prototype of some fuzzy sets in LV. This constraints has been initially proposed as “Natural zero positioning” because the null value has a special role in control problems, where fuzzy rule-based systems are widely used [43, 155]. Natural positioning is a generalization of the previous constraint on leftmost/rightmost fuzzy sets because of the special role of extreme values of the universe of discourse for any linguistic variable. Extreme values apart, other

43

Chapter 2

State of the art

“special” values are only problem specific. Simple examples may include 0 and 100 for LVs describing water temperatures (in Celsius degrees), or 37 for LVs describing body temperatures (again, in Celsius), etc. These values are natural prototypes for linguistic terms like IcingPoint, BoilingPoint, NormalTemperature when they make sense for the model to be designed.

2.3.3 Constraints on Fuzzy Rule Bases In most problems a number of linguistic variables must be defined, one for each feature. Different assignments of linguistic variables can be combined together to form fuzzy rules. A fuzzy rule is a unit of knowledge that has the twofold role of determining the system behaviour and communicating this behaviour in a linguistic form. Some of the most general interpretability constraints and criteria for fuzzy rules are the following.

Description length The description length of a fuzzy rule is the number of linguistic variables involved in the rule. A small number of linguistic variables in a rule implies both high readability and semantic generality, hence short rules should be preferred in fuzzy systems. The set of rules that defines the behaviour of a fuzzy system is named rule base. As previously stated, the interpretability of a rule base taken as a whole has two facets: • (i) a structural facet (readability), which is mainly related to the easiness of reading the rules, and • (ii) a semantic facet (comprehensibility), which is related to the information conveyed to the users to understand the system behaviour. The following interpretability constraints and criteria are commonly defined to ensure the structural and semantic interpretability of fuzzy rule bases:

Compactness A compact rule base is defined by a small number of rules. This is a typical structural constraint that advocates for simple representation of knowledge in order to allow easy reading and understanding.

44

2.3 Interpretability Constraints and their assessment Average firing When an input is applied to a fuzzy system, the rules whose conditions are verified to a degree greater than zero are “firing”, i.e. they contribute to the inference of the output. On the average, the number of firing rules should be as small as possible, so that users are able to understand the contributions of the rules in determining the output.

Logical view Fuzzy rules resemble logical propositions when their linguistic description is considered. Since linguistic description is the main mean for communicating knowledge, it is necessary that logical laws are applicable to fuzzy rules; otherwise, the system behaviour may result counter-intuitive. Therefore the validity of some basic laws of the propositional logic (like Modus Ponens) and the truthpreserving operations (e.g., application of distributivity, reflexivity, etc.) should be verified also for fuzzy rules.

Completeness The behaviour of a fuzzy system is well defined for all inputs in the universe of discourse; however when the maximum firing strength determined by an input is too small, it is not easy to justify the behaviour of the system in terms of the activated rules. It is therefore required that for each possible input at least one rule is activated with a firing strength greater than a threshold.

Locality Each rule should define a local model, i.e. a fuzzy region in the universe of discourse where the behaviour of the system is mainly due to the rule and only marginally by other rules that are simultaneously activated. A moderate overlapping of local models is admissible in order to enable a smooth transition from a local model to another when the input values gradually shift from one fuzzy region to another. About interpretability constraints, as seen, a number of interpretable constraints and criteria apply to all levels of a fuzzy system. Sometimes interpretability constraints are conflicting (e.g. distinguishability vs. coverage) and, in many cases, they conflict with the overall accuracy of the system. A balance is therefore required, asking in its turn for a way to assess interpretability in a qualitative but also a quantitative way.

45

Chapter 2

State of the art

2.4 Interpretability assessment Assessing interpretability is hard because the analysis of interpretability is extremely subjective. In fact, it clearly depends on the background knowledge and experience of who is in charge of making the evaluation. Hence, it is necessary to consider both objective and subjective indexes. On the one hand, objective indexes are aimed at making feasible fair comparisons among different fuzzy models designed for solving the same problem. On the other hand, subjective indexes are thought for guiding the design of customized fuzzy models, thus making easier to take into account users’ preferences and expectations during the design process [104]. Gacto et al. [53] proposed a double axis taxonomy regarding semantic and structural properties of fuzzy systems, at both partition and rule base levels. Accordingly, they pointed out four groups of indexes (see Figure 2.11). Structural indexes are mainly designed to assess the readability of a fuzzy system, while semantic indexes concern the quantification of its comprehensibility. Accordingly, structural indexes at the partition level relate the number of features and the number of membership functions per feature to the readability of a fuzzy partition; at the rule-base level the structural indexes relate readability with the number of rules and the total rule-length (i.e. the sum of all linguistic variables used in each rule). The indexes that try to assess the comprehensibility of a fuzzy system are far more complex. At the partition level it is worth mentioning the Context-Adaptation index [14], which is based on fuzzy ordering relations. As another example, the GM3M index [52] combines three indexes that assess how much a single fuzzy set changed after a tuning process. The Semantic Cointension index [111] belongs to the set of indexes at the rule-base level. For classification problems, this index evaluates the degree of fulfilment of a number of logical laws exhibited by a given fuzzy rule base. Finally, the CO-Firing based Comprehensibility Index [76] measures the complexity of understanding the fuzzy inference process in terms of information related to cofiring rules, i.e. rules firing simultaneously with a given input.

46

2.5 Design of Interpretable Fuzzy Models Fuzzy Partition Level Structural−based Interpretability

Semantic−based Interpretability

Fuzzy Rule Base Level Q1

Q2

Number of features Number of membership functions

Number of rules Number of conditions

Q3 Context−adaptation based index GM3M index

Q4 Semantic−cointencion based index Co−firing based comprehensibility index

Figure 2.11: Some interpretability indexes organized in a double-axis taxonomy (adapted from [53]).

2.5 Design of Interpretable Fuzzy Models Constrained information granulation of data can be achieved in several ways, which can be roughly grouped in three main categories: regularized learning, genetic algorithms and ad-hoc algorithms. In regularized learning the granulation process is aimed at extracting information granules so as to optimize an objective function that promotes the definition of accurate information granules but penalizes those solutions that violate interpretability constraints [156][85][103]. The objective function must be encoded in an appropriate way so as to be applied to classical constrained optimization techniques (e.g. Lagrangian multipliers method). This approach is quite effective, but the encoding of interpretability constraints into proper regularization functions is not always easy. This limits the number of interpretability constraints that can be used with such method. When using genetic algorithms, information granules are properly encoded into a population of individuals that evolve according to an evolutionary cycle, which involves a selection process that fosters the survival of accurate and interpretable information granules [51][69][84]. Genetic algorithms are especially useful when the interpretability constraints cannot be formalized as simple mathematical functions that can be optimized according to classical optimization techniques (e.g. gradient descent, least square methods, etc.). Moreover, multi-objective genetic algorithms are capable to deal with several objective functions simultaneously, e.g. one objective function that evaluates accuracy and another to assess interpretability [58][73][80]. However, the adoption of genetic algorithms may rise efficiency issues in some application contexts.

47

Chapter 2

State of the art

Algorithms based on regularized learning or genetic algorithms may provide information granules with partially fulfilled interpretability constraints. To overcome this problem, ad-hoc algorithms for interpretable information granulation could be devised. Such algorithms directly encode interpretability constraints within the granulation procedure. Examples of ad-hoc algorithms are NEFCLASS [123][120] and algorithms of the DCf family [27].

2.5.1 Design choices In order to implement an inference engine of a Mamdani FRBS, the following design choices are required [88]: 1. Select the implication operator I of the IF-THEN rules. In [106] the minimum has been proposed as the t-norm for the I operator. Other approaches used the algebraic product [64] I(x, y) = x · y

(2.22)

or fuzzy implication operators (e.g. the Lukasiewicz’s one) [153] I(x, y) = min(1, 1 − x + y)

(2.23)

among others [47, 37][15, 70, 88]. 2. Select the conjunctive operator T that is used in the rule antecedent. Usually this operator is chosen from the t-norm family [64, 70]. 3. Select the defuzzification interface mode between the FATI and the FITA; in relation to this choice different operators have to be selected, like CG or MOM among the others.. Although some studies on the fuzzy operators have been proposed in literature, those do not take into account the interpretability point of view [15, 70, 88]. The relation between operator choice and interpretability preservation is a matter of current research.

48

2.5 Design of Interpretable Fuzzy Models

2.5.2 Design tasks The generation of the fuzzy rule set is strictly related with the type of the FRBS. However, some general tasks can be identified: 1. Selection of the most relevant features (input and output variables) of the system. This can be done by a human expert or exploiting classical statistical methods or combinatorial methods [10] 2. Definition of the data base (DB) structure containing the semantics of the linguistic terms used in the model. This includes the following sub-tasks: a) selection of the term sets for each linguistic variable. This quantity is exploited in order to regulate the granularity of the system; b) choice of the membership function type: usually between triangular, trapezoidal, Gaussian or exponential-shaped [46]. The former two are computationally simpler while the latter two are differentiable and have smoother transitions. c) definition of the parameters of each membership function associated to the respective linguistic term. 3. Derivation of the rules (number and composition). For effectively designing a FRBS, information about the problem plays a fundamental role. In particular, problem-specific domain knowledge and careful considerations can guide the fuzzy rule set derivation process in order to obtain better performance of the FRBS. Generally speaking, in typical FRBS applications such as modelling, control and classification, the FRBS designer has to deal with two types of information: numerical and linguistic. The derivation of the fuzzy rule set is devised according to these types, as this could be manual or automatic [158]. When rules are derived from human experts, these are demanded to express their knowledge about the problem in the form of linguistic rules. In particular, human expert defines linguistic variables and the respective linguistic labels, their meanings and the structure of the rules in the RB.This approach is very weak when facing complex problems, which involve many numerical features in the design process. Such difficulties have motivated different approaches based on inductive learning methods from data such as: ad hoc data-driven generation methods [10, 22, 33, 37, 75, 128, 157], least-squares-based methods [10, 151], gradient descent methods

49

Chapter 2

State of the art

[126, 127], hybrid methods [78], neural networks [121, 142, 149, 150], clustering techniques [45, 162] and evolutionary algorithms [35, 38, 40], among others. In linguistic FRBSs, when both linguistic information and numerical information are available, DB and RB definition can be designed simultaneously. The possibility to combine both linguistic and numerical information in a unique fashion is seen as a major advantage of FRBSs [10, 117, 158]. In the process of integrating human expert knowledge and induced knowledge there are different policies. Experts could be asked to form of linguistic rules from scratch; then, in a second stage, numerical information (usually in the form of input-output data pairs) is exploited by the automatic learning scheme in order to complete the FRBS or by the tuning phase with the aim to refine the FRBS. Conversely, a KB can be automatically built from data and then the human expert evaluates and refines the obtained KB. A more interesting, iterative approach, requires the simultaneous adoption of both the approaches during the whole design process. In particular [4] proposed to exploit expert knowledge and extracted knowledge from data in combination with the aim to obtain robust compact systems with a good accuracy/interpretability trade-off.

2.6 The interpretability-accuracy trade-off As previously explained, interpretability and accuracy are the main objectives to be addressed in FM. In general, being contradictory issues, both these criteria cannot be satisfied to a high degree and hence, one of them could take a higher priority than the other (this depends on the problem nature). For this reason two different approaches are identifiable w.r.t. the main objective: Linguistic Fuzzy Modeling (LFM) (interpretability is prior) and Precise Fuzzy Modeling (PFM) (accuracy is prior). Historically, accuracy in FM took the major priority, while currently the aim is to find a good balance between those characteristics. Improvements mechanisms have been proposed in order to compensate the initial gap. LFM is quite rigid for a number of reasons, such as:[12, 16]: 1. The rigid partitioning of the input and output spaces causes a lack of flexibility in the FRBS;

50

2.6 The interpretability-accuracy trade-off 2. It is very hard to fuzzy partition the input spaces when the system input variables have some functional dependency; 3. The homogeneous partitioning of the input and output spaces is inefficient and does not scale to high-dimensional spaces 4. The number of variables and linguistic terms in the system defines the size of the KB. To obtain an accurate linguistic FRBS requires a significant level of granularity - i.e. the creation of new linguistic terms. This leads to increasing the number of rules and hence a decreasing of interpretability. To tackle these drawbacks a FRBS modelled by a LFM can be improved in accuracy by acting on the design process as well as on the design structure [18]. For example, in [102] a two-stage DB design is made by first deriving simultaneously the DB and the RB and then performing an a posteriori tuning. In [83], an initial generation of the RB with a subsequent three-stage DB design (input variable selection, simultaneous DB tuning and RB reduction, and DB fine tuning) is developed. The cooperative co-evolutionary paradigm [135] has shown an increasing interest thanks to its high ability to manage with huge search spaces and decomposable problems, and new simultaneous derivation methods are currently emerging using this technique [20, 130, 131]. In general, the automatic design of a DB gives more flexibility to the modelling process but it runs the risk of losing interpretability and overfitting the problem. To avoid this risk, a possibility is to perform a first modelling stage to obtain accurate initial models, and subsequently applying a process to improve the interpretability of the obtained model even at the expense of losing certain accuracy. To improve the interpretability in these models, a number of mechanisms could be applied, such as: • Selecting input variables in the model. Feature selection is a very common approach (see, e.g. [17, 71, 90, 97, 143, 144]) but shows the drawback of excluding features in the modelling process even if they could be of some interest under some conditions; • Selecting input variables in the linguistic rules. To overcome the exclusion of features from the entire model, feature selection could be applied on each single rule as in [29, 59, 105, 159]; • Selecting linguistic rules. A rule set is optimized by selecting a subset of rules. according to different criteria [36, 37, 57, 68, 73, 74, 94]. Instead of making a

51

Chapter 2

State of the art

hard selection, rules could also be ordered according to some relevance criteria [95]; • Merging linguistic rules. As an alternative to selection rule merging reduces the RB by a fusion of rules according to some criteria, such as neighbourhood [90]. • Linguistic Approximation. In the case of models derived with the main objective of accuracy the resulting fuzzy sets could not be intelligible. Linguistic approximation tries to find a linguistic description that represents a given fuzzy set [49, 109, 147].

2.7 Design methods and tools As reported in the previous sections, the design process of a FRBS implies many different aspects to be considered and decisions to be taken. A number of methods have been proposed in literature, involving different parts of a FRBS. Here the focus is on methods for granulating data in order to derive fuzzy partitions and rules. Some of these methods have been implemented in form of software tools, which could be conveniently used to perform interpretable fuzzy modelling. The most widely used tools are briefly reviewed later in the section.

2.7.1 Algorithms and Methods In this section, the most known algorithms and methods, adopted in literature in order to obtain an interpretable granulation of data and hence build FRBSs, are briefly described. To the best of the author’s knowledge, there is not a method (or algorithm) capable of provide a granulation of the data in total autonomy. This means that all the methods require, at least, the definition of the number of partitions for each input (thus forcing the granulation process) or a user selection of the best model in a family of model. As explained in the next chapter (chapter 3), the automatic selection of the best number of partitions for each input is one of the key features of the Double Clustering with A* (DC*), which provides for an automatic interpretable granulation of preclassified data by requiring the definition of only one hyper-parameter related with

52

2.7 Design methods and tools the final granularity of the model. The meaning of this hyper-parameter is very clear for the user as it defines the maximal desired granularity of the model (and not the final granularity as it is strictly dependent on data). This feature makes DC* very promising in the field of automatic interpretable granulation of data. 2.7.1.1 Hierarchical Fuzzy Partitioning (HFP) HFP aims at generating a family of interpretable fuzzy partitions from data [61]. Members of this family are distinguished by their degree of interpretability and accuracy; the user can successively select the partitions that best balance accuracy and interpretability according to his/her needs. Preliminarily, HFP cycles over each data feature and operates a one-dimensional clustering of data samples to define a first fuzzy partition for each feature. In the worst case a fuzzy set per data sample is generated; however clustering is used to accelerate HFP by generating fuzzy partitions with a number of fuzzy sets considerably smaller than the number of data samples. The main stage of HFP is to iteratively merge adjacent fuzzy sets so that the new partition is as much similar as the previous partition (the one preceding the merging process). This is accomplished by defining a specific partition measure that is based on computing distances between fuzzy sets: the couple of fuzzy sets to be merged is selected in order to minimize the variation of this partition measure. Fuzzy set merging is carried out so as to guarantee strong fuzzy partitions and the iterative merging process is stopped when only one fuzzy set is defined on each feature. The merging process is carried out over each input feature independently, resulting in a hierarchy of partitions for each feature. A combination of partitions (one for each feature) defines a granulation of the data space, where each information granule is defined by the Cartesian product of fuzzy sets belonging to different partitions. A selection process of the information granules is carried out by summing the membership degrees of all data samples to each granule (Σ-count): the information granules whose Σ-count is below a threshold are discarded. The remaining information granules can be used to define the rules of a FRBS. To avoid the combinatorial explosion of FRBSs that can be generated by picking a partition in each hierarchy, a heuristic procedure is implemented to generate a sequence of FRBSs defined by combinations of partitions with decreasing granularity. The sequence of combinations of partitions is then returned by HFP.

53

Chapter 2

State of the art

2.7.1.2 Fuzzy Decision Trees (FDT) Fuzzy Decision Trees (FDT), also known as Fuzzy ID3, are adopted to built FRBs in classification problems [79]. They combine the uncertainty handling and approximate reasoning capabilities of fuzzy sets with the comprehensibility and ease of application of the decision trees. In the classical ID3, at any point, the feature that provides the greatest gain in information or, equivalently, the greatest decrease in entropy is evaluated ans selected to proceed in the tree building. Applying the procedure, a set of leaf nodes (sub-population) of the decision tree is obtained, where the patterns are of a single class. Fuzziness is incorporated in the ID3 algorithm at the node level modifying the conventional decision function with the introduction of different fuzzy measures. In particular, a fuzzy entropy is adopted considering the membership of a pattern to a class. In fact, in FDT all the input attributes are discretized in linguistic terms by means of fuzzy sets. Fuzzy rules for a particular class are generated traversing the path from the root to a leaf node representing that class. In this manner it is possible to obtain a set of rules for all the classes, exploiting the intersection of the features/attributes encountered along the path. However, as mentioned, the initial fuzzification of the input attributes involves the user to fix the granularity for each of them, forcing the process to respect a particular imposed granularity which may not be actually related with the data.

2.7.1.3 DCf DCf (Double-Clustering framework) is a general framework centred on a two-step clustering process, that can be customized by employing several clustering algorithms, either for the first and the second step. The sole requirement for such algorithms is to produce prototypes from the available data. The choice of specific clustering algorithms leads to an instance of DCf. Different instances of DCf have been developed. In [23] FDC (Fuzzy Double Clustering) has been proposed, which integrates the Fuzzy C-means algorithm [13] for the multidimensional clustering (first step), and a hierarchical clustering scheme for the prototype clustering (second step), which reveals quite efficient to cluster one-dimensional

54

2.7 Design methods and tools numerical data, provided that such data are sorted. To avoid the computational effort derived by the use of a fuzzy clustering algorithm, another instance of DCf was developed, called CDC (Crisp Double Clustering) [26], in which a vector quantization algorithm that follows the Linde-Buzo-Gray (LBG) formulation [100] is used to accomplish the first clustering step and, as in FDC, a hierarchical clustering algorithm is used for the second step. Both FDC and CDC can be applied to any problem involving numerical data granulation. In case of classification problems, the granulation process can be improved by exploiting class information. This was done in DCClass (Double Clustering for Classification) [25], where the first clustering step is performed through the LVQ1 (Learning Vector Quantization, version 1) algorithm [92] so that each multi-dimensional prototype is associated to a class label. Such labels can be used in the second step of DCf to determine the number of fuzzy sets onto each dimension. DCClass exploits class information provided with data to self-determine the granularity level. Nevertheless, the granularity level is selected by a simple heuristic procedure, hence DCClass can just find a “good” granularity level. DC*, discussed in details in the following and preliminary presented in [28] and [114] is designed to achieve an “optimal” granularity level by minimizing the number of information granules describing data and, hence, it provides for more interpretable fuzzy information granules. 2.7.1.4 Highly Interpretable Linguistic Knowledge (HILK++) HILK++ is a methodology capable to build interpretable FRBSs for classification problems [4]. It is based on HILK, a general framework that makes the design process of interpretable FRBSs easy, and allows the combination of both expert knowledge and knowledge extracted from data. To build interpretable fuzzy models, the HILK++ main steps are: 1. Feature selection. This process finds the most discriminative variables of the problem exploiting the capabilities of the well-known algorithm C4.5. 2. Partition design. This step provides for the design of the partitions involved the the model. It is composed by a partition learning phase and a partition selection phase. The available approaches are: a regular partitioning, a partitioning based on k-means, and a partitioning based on the HFP algorithm. 3. Rule base learning. Exploiting the identified partitions, the rule base learning step automatically extract rules from data. For this task the available algo-

55

Chapter 2

State of the art

rithms are: Wang and Mendel, Fuzzy Decision Tree, and Fast Prototyping Algorithm. 4. Knowledge base improvement. This step is an iterative refinement process that involves both partitions and rules, i.e. Linguistic Simplification step and Partition Optimization step (Solis-Wets and Genetic-Tuning). It is straightforward to verify how HILK++ represents a complete methodology making use of a number of algorithms and methods, involving a final refinement process, and hence requiring an high number of hyper-parameters other than a strict expert-user interaction.

2.7.2 Tools The scenario of tools that allows to build interpretable FRBSs can be restricted to three main software packages that are worth to be mentioned: NEFCLASS, FisPro, and GUAJE. NEFCLASS2 is short for NEuro-Fuzzy CLASSification [124, 120, 125]. It is adopted for data analysis by neuro-fuzzy models, learning fuzzy rules and fuzzy sets by supervised learning. A neuro-fuzzy system aims at finding the parameters of a fuzzy system by exploiting the learning capabilities of neural networks. Learning in this context is split in: structure learning (i.e. creation of a rule base - usually not taken from neural networks but relying on fuzzy decision tree learning or by genetic algorithms), and parameter learning (i.e. optimization of fuzzy sets exploiting algorithms that were inspired by neural network learning). Interpretability of the fuzzy system, in particular, is here viewed as the constraint that no linguistic expression has to be represented by more than one fuzzy set. This is achievable with an appropriate design of the neural network but, however, the user should always supervise and interpret the learning process. FisPro is an open source toolbox that allows to design interpretable Fuzzy Inference Systems (FIS) by exploiting expert knowledge and data [62, 63]. It is noteworthy mention that FisPro is not limited to classification problems. FisPro provides an entire suite of algorithms and methods, selectable to design an interpretable FIS. The main steps in this process, that may be viewed as a path through the building of a FIS, are: 2

http://fuzzy.cs.uni-magdeburg.de/nefclass/

56

2.7 Design methods and tools 1. Sample generation 2. Fuzzy partitioning and FIS with no rules 3. FIS learning 4. Viewing results GUAJE34 (Generating Understandable and Accurate fuzzy models in a Java Environment) is a free software tool (licensed under GPL-v3) that aims to provide a support in the process of designing interpretable and accurate fuzzy systems (not limited to classification). It works combining several open source tools exploiting their main advantages (GUAJE makes use of algorithms for knowledge induction provided by FisPro). The GUAJE objective is to make easier the knowledge extraction and representation in the context of fuzzy systems, paying special attention to interpretability issues. The user may define expert variables and rules and also choose for a supervised and fully automatic learning process. Along the process both types of knowledge, expert and induced, may be integrated under the expert supervision, respecting interpretability, simplicity and consistency of the final knowledge base. In such a way, GUAJE represents the tool to put in practice the HILK++ methodology (briefly described in subsubsection 2.7.1.4) and is the most complete suite to deal with the building of interpretable fuzzy models.

3 4

http://www.softcomputing.es/guaje/ http://sourceforge.net/projects/guajefuzzy/

57

3 Double Clustering with A* (DC*) In this chapter the Double Clustering with A* (DC*) method is described in detail. With the aim of providing a complete picture of the method in sec. 3.1 the Double Clustering Framework is introduced, which represents a basic learning scheme for interpretable fuzzy information granules. Section sec. 3.2 describes the initial version of the DC* method and finally, section chapter 4, is focused on the new version of DC* that has been developed in this work. In particular, to emphasize the difference between the first version and the DC* version proposed in this work, the former will be referred as DC* v1.0 while the latter DC* v2.0.

3.1 The Double Clustering Framework DC* is a method conceived for extracting interpretable fuzzy information granules from classified data. Such information granules are represented through interpretable fuzzy partitions and can be used to define a set of fuzzy classification rules. In other words, given a multi-dimensional numerical dataset of pre-classified data, the aim of DC* is to automatically generate an interpretable Fuzzy Rule Base that describes data through linguistic terms. DC* is an instance of the more general Double Clustering Framework (DCf ), introduced in [27], which enables the extraction of interpretable fuzzy information granules from numerical data. The adoption of the fuzzy paradigm in the model design could enable the interpretability of the final model but it is not a sufficient condition. Interpretability, as discussed in sec. 2.2, can be achieved with the satisfaction of a number of interpretability constraints during the model design, as described in detail in sec. 2.3. Usually, the adoption of multi-dimensional clustering algorithms leads to not interpretable fuzzy partitions - i.e. fuzzy sets that may not be described by a qualitative

59

Chapter 3

Double Clustering with A* (DC*)

linguistic labels. This because, although the multi-dimensional clustering is well suited to capture the granularity of the data in the multi-dimensional space, it is not possible to impose on this process any kind of constraints to ensure the interpretability of the resulting one-dimensional fuzzy partitions. On the other hand, one-dimensional clustering (i.e. clustering over each problem feature) is capable to provide interpretable fuzzy sets but may loose the multidimensional nature of the problem - i.e information about the multi-dimensional relations. The combination of the two clustering processes enables to capture the data granularity into the multi-dimensional space and exploits the capability of the onedimensional clustering to describe those granules with interpretable fuzzy sets. DCf uses a combination of the two mentioned clustering steps and then operates a partition fuzzification. Namely, the three DCf steps are: Data Clustering, Prototype Clustering and Granule Fuzzification (see Fig. 3.1).

Figure 3.1: Overview of the Double Clustering Framework The Data Clustering step, the first one in the DCf computation, is a clustering performed over the whole multi-dimensional feature space of the problem. Here, similar numerical data are embraced together into granules described by multi-dimensional cluster prototypes. Prototypes can be viewed as elements in the Universe of Discourse that describe the hidden relationships discovered by the clustering process. Multi-dimensional data prototypes are then projected over each problem feature

60

3.1 The Double Clustering Framework and exploited by the second DCf clustering, the Prototype Clustering step, which operates over each problem feature. In fact, this clustering groups the prototype projections in a number of one-dimensional clusters defined over each problem feature providing to a number of one-dimensional prototypes.

Multi-dimensional and one-dimensional prototypes give useful information to derive information granules that can be conveniently represented by fuzzy sets. Moreover, such fuzzy sets are built in accord to the interpretability constraints, mentioned in sec. 2.3, that allow a qualitative description of the information granules. This process is done by the last step of DCf, the Granule Fuzzification, which involves the derivation of fuzzy information granules. This is achieved by first fuzzifying the one-dimensional granules defined by the prototypes in each problem feature and then by aggregating one-dimensional fuzzy sets to form multi-dimensional fuzzy information granules.

The definition of the fuzzy sets should take into account the information provided by the clustering stages and, at the same time, should meet the required interpretability constraints. To satisfy both requirements, the "cut points" are defined. A cut point is the middle point between two different prototypes (over the same feature). Cut points defined over the same feature are exploited to define centers and widths of the fuzzy membership functions (see Fig. 3.2). It is worth to mention that respecting the guidelines above mentioned, the type of partition fuzzification does not influence the characteristics and generality of the framework1 .

1

In the first proposed version [27] Gaussian fuzzy sets were used

61

Chapter 3

Double Clustering with A* (DC*)

Figure 3.2: DCf prototype projection (black dots over the axes), prototype clustering over each feature (circled dots), Universe of Discourse partition definition (cuts are in red) and information granule identification (InfGr).

Multi-dimensional fuzzy information granules can be then formed by combining onedimensional fuzzy sets, one for each dimension. Among all possible combinations of one-dimensional fuzzy sets, only those that better represent clusters discovered in the first step are selected. In this way, the selected combinations of fuzzy sets represent meaningful relations among data and combinatorial explosion of information granules is avoided. The selection of such granules is accomplished on each problem feature by considering, for each cluster, the fuzzy set on a particular feature with highest membership value. The final linguistic representation of the derived information granule is a conjunction of soft constraints (one for each problem feature) like: G: attribute 1 is low AND attribute 2 is medium AND.. .AND attribute n is high The semantical facet of each information granule is defined by the t-norm composition of all compounding one-dimensional fuzzy sets, such as:

attributen 1 2 µG = µattribute ∧ µattribute low medium ∧ . . . ∧ µhigh

being ∧ a suitable t-norm and µij a membership function.

62

3.1 The Double Clustering Framework When the granulation process is completed, a fuzzy rule-based model can be built on the basis of the derived fuzzy granules. This is aimed at verifying how much the fuzzy granules identified from data are useful in providing good mapping properties or classification capabilities. DCf is a general framework that can be customized by choosing appropriate clustering algorithms, either for the first and the second step. The sole requirement for such algorithms is to produce prototypes in conformity with the granulation process specified above. The choice of specific clustering algorithms defines a particular implementation of DCf. Here, three possible implementation examples of DCf are briefly described: Fuzzy Double Clustering This implementation integrates the Fuzzy C-means algorithm (conceived by Dunn [48] and generalized by Bezdek [13]) for the multi-dimensional clustering (first DCf step), and a hierarchical clustering [86] scheme for the prototype clustering (second DCf step). The hierarchical clustering was chosen for its simplicity and the additional property of being very efficient for one-dimensional numerical data, provided that such data are sorted. This type of implementation, called FDC (Fuzzy Double Clustering), is particularly suited to enhance existing fuzzy clustering algorithms in order to perform interpretable fuzzy information granulation [24, 23]. Crisp Double Clustering To reduce the computational effort due to calculation of the partition matrix in the first stage, it is more convenient to use a vector quantization technique in place of the fuzzy clustering algorithm in the multi-dimensional data clustering stage of DCf. This leads to another implementation of DCf, called CDC (Crisp Double Clustering), in which the LindeBuzo-Gray (LBG) [100] vector quantization scheme is used to accomplish the first clustering step and, like in FDC, a hierarchical clustering algorithm is used for the second step. Details about the CDC can be found in [26]. Double Clustering for Classification In case of classification problems, the class information can be effectively exploited to improve the granulation process. In this case, the first step is implemented through the LVQ1 (Learning Vector Quantization, version 1) [92] algorithm so that each multi-dimensional prototype is associated to a class label that is used in the second step of DCf to automatically determine the number of fuzzy sets onto each dimension. This leads to an implementation of DCf that is particularly appropriate to tackle

63

Chapter 3

Double Clustering with A* (DC*)

classification problems, called DCClass (Double Clustering for Classification) [25]. The idea to exploit class information to perform a better granulation, proposed in DCClass, also represents the base concept for DC*. In other words, DCClass can be viewed as a primitive implementation from which DC* is born. For this reason some more details about DCClass are worth to be provided. According to the DCf, DCClass implementation provides a set of information granules represented in form of Cartesian product of one-dimensional fuzzy sets. As a key feature of the proposed tool, the granularity of the derived one-dimensional fuzzy sets is automatically calculated by exploiting available class information, thus recovering the user from an arbitrary choice of the granularity level of each fuzzy set (as required in FDC ad CDC). Moreover, only the number of multidimensional prototypes to be discovered in the first step has to be specified. Each of the information granules returned by DCClass is associated to class labels, so fuzzy classification rules can be directly defined. Such rules constitute the knowledge base for a fuzzy inference system, which can be conveniently used to tackle fuzzy classification problems, as well as to validate the derived information granules in terms of their adherence to the available data. Being an instance of DCf, DCClass is defined as a composition of three sequential steps. For the first DCClass step, available data is compressed by means of a vector quantization algorithm, with the aim to derive a set of multidimensional prototypes that capture multidimensional relationships in data. In particular, the LVQ1 (Learning Vector Quantization, version 1) algorithm is used for this purpose [92]. It should be noted that each multidimensional prototype is associated to a class label, which will be used in the subsequent steps in which granules are formed. The multi-dimensional prototypes are projected onto each input feature, carrying the associated class label. For each feature, the projections are clustered together according to the following criterion: the adjacent projections of the same class are grouped together, while the projections of different classes belong to different clusters. The rationale behind such step consists in merging similar projections in a single cluster, as long as they belong to the same class. In this way, fuzzy sets will be shared by different information granules, thus improving the interpretability of the resulting knowledge base. As for DCf, cut points between clusters over the same feature are defined which are used in the last fuzzification step Fig. 3.3.

64

3.1 The Double Clustering Framework feature B

feature A

Figure 3.3: DCClass Universe of Discourse partition.

The weakness of the proposed DCClass algorithm is in the excessive number of one-dimensional fuzzy sets per input and the number of input variables for each rule - i.e. in each DCClass fuzzy rule all the problem features are involved, thus compromising the resulting interpretability. Moreover, the identified information granules cover a restrict area of the prototypes neighborhood resulting too specific (thus compromising the generalization capabilities of the resulting model). This problem is due to the fact that although DCClass provides for an automatic feature granulation (the number of partitions per feature is automatically driven by the prototype projection class), it uses all the identified cuts over all the input features without performing any optimization process. To overcome this problem, Double Clustering with A* (DC*) has been proposed in [28]. In DC*, the solution optimality aspect was introduced. In fact, both DC* and DCClass share the idea of exploiting class information provided with examples to self-determine the granularity level. Nevertheless, while DCClass can just find a “good” granularity level, DC* finds the “optimal” granularity level, i.e. it performs information granulation by minimizing the number of granules that well describe the available data using only the needed features and providing for more general information granules. In the next section the DC* algorithm is introduced as it was proposed in its first version.

65

Chapter 3

Double Clustering with A* (DC*)

3.2 DC* v1.0

DC* has been developed with the aim of performing a process of interpretable information granulation with the additional feature of automatically minimizing the number of granules that describe the available data (which represents the main difference between DCClass and DC*). Precisely, DC* is designed to exploit class information for extracting interpretable information granules from data. The number of extracted information granules is upper bounded by a user-defined threshold. This enables the user to control the level of granularity of the solution, but leaves to the algorithm the choice of the optimal number of information granules. On the other hand, the granularity on each dimension is automatically determined. Prototype projections belonging to the same class are clustered together while prototype projections belonging to different classes are grouped in the same clusters only if the corresponding multidimensional prototypes can be separated in some other dimension. This strategy allows to minimize the number of cuts that partition the Universe of Discourse and hence the number of information granules describing data. In some situations, DC* may find just one granule for some dimension. This implies that whatever the value of the corresponding input feature is, it does not affect the class of the pattern. As a consequence, the input feature does not convey any useful information in describing data classification. It could be hence safely removed in the granule representation. An automatic feature selection is therefore performed, which further improves the interpretability of the resulting information granules.

According to the Double Clustering framework, DC* is defined by two clustering steps. The first step, the Data Compression, is performed by the LVQ1 algorithm (Learning Vector Quantization, version 1) [92]. This is able to find a number of well-separated multi-dimensional prototypes (code vectors) by exploiting class information. To perform the second step, an informed search procedure based on the A* strategy is defined, which enables to find the optimal number of information granules [138]. Finally, in the last stage, information granules are fuzzified and the resulting Fuzzy Rule Base built. In the following, each DC* step is described in a deeper detail.

66

3.2 DC* v1.0

3.2.1 First clustering step: Data Compression The first clustering step of the DC* algorithm is the Data Compression. This process allows to represent the dataset by the use of prototypes. In classification problems, to well represent the class distribution over the problem space, prototypes are labeled with class information and can be viewed as a summary of the underlying data samples (see Fig. 3.4). For this reason, the DC* Data Compression step has a significant influence over the entire DC* process. feature B

feature B

feature A

feature A

Figure 3.4: The first DC* clustering step: Data Compression. In this example a three class dataset is represented by six prototypes (two for each class). The number of prototypes for the data compression phase is user-defined and represents the only hyper-parameter of the DC* algorithm. This parameter is used to regulate the compression ratio of the samples. In particular, the number of prototypes corresponds to the maximum number of fuzzy rules in the final model (more prototypes may belong to the same information granule, producing a single rule). Thus, this hyper-parameter has an immediate interpretation as it regulates the desired granularity of the resulting Fuzzy Rule Base System: the lower this number, the coarser the rules, which are more readable but possibly less accurate. On the other hand, the higher is the number of prototypes, the finer is the rule granularity, which may result in a better accuracy counterbalanced by a higher complexity. Vector quantization is a well-known technique that exploits the underlying structure of input vectors for data compression. Learning Vector Quantization (LVQ) is a supervised learning technique that uses class information to derive a set of prototypes representing compressed data. The idea of learning vector quantization was originated by Kohonen in 1986 and a number of versions of such techniques are described in [92]. As mentioned, in the DC* Data Compression step, the first version of the LVQ is used, referred as LVQ1.

67

Chapter 3

Double Clustering with A* (DC*)

Let

X = [m1 , M1 ] × ... × [mn , Mn ] ⊆ Rn

(3.1)

be a n-dimensional Universe of Discourse (UoD), n

C = c1 , c2 , . . . , cnC

o

a finite set of classes, and D = {(xi , ci ) ∈ X × C : i = 1, 2, ..., nD }

(3.2)

the available dataset of nD classified samples from X. It is assumed that for each class label in C at least one sample in D is classified with the class label.

The Data Compression stage is aimed at defining a collection P = {(pj , cj ) ∈ X × C : j = 1, 2, ..., nP }

(3.3)

of prototypes that represents aggregate information of the available samples (with nP  nD ).

In order to apply the LVQ1 algorithm it is assumed that a collection of class labeled input data D is available and a number of prototypes np is defined. For better performances, the ratio of prototypes belonging to a specific class reflects the ratio of input data belonging to the same class. For each class, initial prototype positions are taken randomly from the positions of the dataset samples with the same class label. Moreover, the learning rate α, a tolerance value  and a maximum number of iterations maxI have to be defined as well. In particular, the last two are adopted as stop criteria. The α parameter is updated at each iteration in order to reduce its value during the computation. The LVQ1 steps are described in Algorithm 3.1.

68

3.2 DC* v1.0 Algorithm 3.1 The LVQ1 algorithm input : dataset D as in (3.2) input : the number of prototypes nP as in (3.3) input : learning rate α input : tolerance  input : maximum number of iteartion maxI output : the collection of prototypes P as in (3.3) stop ← false ; P ← random s e l e c t i o n of nP s a m p l e s in D ; (∗ random s e l e c t i o n p r e s e r v e s c l a s s d i s t r i b u t i o n ∗) (∗P i s a r r a n g e d as a m a t r i x ∗) while not s t o p or i ≤ maxI Ptemp ← P ; f o r each (x, c0 ) ∈ D (p, c00 ) ← c l o s e s t p r o t o t y p e in P to (x, c0 ) ; i f c0 = c00 then p ← p + α(x − p) ; else p ← p − α(x − p) ; end i f end fo r i f ||P − Ptemp || ≤  then s t o p ← true ; α ← α − (α/maxI) ; i ← i + 1; end while

The LVQ1 computation moves the prototypes suitably in the n-dimensional UoD X, with the aim to best represent the underlying class distribution. Each prototype carries a class label which is exploited in the second DC* clustering step after a prototype projection process, as detailed described in the next section.

3.2.2 Second clustering step: A* solution search The second DC* clustering step is aimed at finding the optimal partition of the UoD X by exploiting the prototype collection P (3.3), obtained in the previous step. In particular, an informed search procedure based on the A* strategy [138] is performed, which enables to find the optimal number of information granules on each dimension. Optimality, in this version of DC*, is defined as the solution with the lower number of information granules needed to describe the problem. In this section the whole process is described.

69

Chapter 3

Double Clustering with A* (DC*)

First, the prototype collection P is projected onto each dimension d = 1, 2, . . . , n. We denote each projection set as Pd = {(phd , ch ) ∈ [md , Md ] × C : (p, ch ) ∈ P ∧ phd = Πd (p)} where Πd (p) = Πd (p1 , p2 , . . . , pn ) = pd and h = 1, 2, . . . , nP . Given a projected prototype collection Pd , clustering is performed by operating on the related set of cuts. Informally speaking, a cut is a boundary of an information granule and is formally defined by the midpoint between two prototype projections, over the same feature, labeled with different classes. The formal definition of cut requires that the prototype projections are sorted. Without loss of generality, it is assumed that h0 < h00 → ph0 d < ph00 d . The set of cuts is defined as 

Td = tkd : (phd , ch ) , (ph+1,d , ch+1 ) ∈ Pd ∧ (ch 6= ch+1 ) ∧ tkd

phd + ph+1,d = 2



(3.4)

with k = 1, 2, . . . , nTd . Any subset Sd ⊆ Td of cardinality nSd defines a clustering of the projections in Pd , where each cluster is defined by all projections that are not separated by any cut in Sd (see Fig. 3.5). feature B

feature B

feature A

feature A

Figure 3.5: DC* prototype projections and cuts definition. On the left-hand side figure, dashed lines represent the set of cuts Td . On the right-hand side figure, red lines represent the subsets of cuts Sd ⊆ Td . The objective of the second DC* Clustering step is to select the subset Sd that is

70

3.2 DC* v1.0 optimal (for each problem feature, simultaneously). In order to define optimality, the concept of hyper − box must be introduced first. To this pursuit, given a subset of cuts Sd , an extended set that adds the boundary points (3.1) in dimension d to the set of cuts, is defined as: S¯d = Sd ∪ {md , Md } When a subset S¯d of cuts is considered for each dimension d, the Universe of Discourse X is partitioned into hyper-boxes defined as follows: n

Bk1 ,k2 ,...,kn = (x1 , . . . , xn ) ∈ X : skd , skd +1 ∈ S¯d ∧ skd ≤ xd ≤ skd +1

o

(3.5)

(again, it is assumed skd < skd +1 ; also s0 = md and snSd = Md ). A hyper-box contains zero or more multi-dimensional prototypes in P . A hyper-box is said pure if it does not contain prototypes or all prototypes it contains belong to the same class; otherwise it is said impure (Fig. 3.6). A pure and non-empty hyperbox is a surrogate for an information granule: if prototypes contained in a hyper-box are surrounded by data samples, then most of these samples are also contained in the hyper-box. The objective of the second DC* clustering step is therefore to find, for each dimension d, the subset of cuts Sd so that the number of pure hyper-boxes is minimized. feature B Pure HB

Impure HB

Pure HB

Pure HB

feature A

Figure 3.6: A bi-dimensional problem with ten prototypes of three different classes (square, circle, triangle). The application of two cuts (chosen among the candidate cuts) provides a partition of the input space in four hyper-boxes: three pure hyperboxes and one impure hyper-box.

71

Chapter 3

Double Clustering with A* (DC*)

The clustering problem has exponential complexity as the number of candidate Q solutions is d 2nTd ∼ O (2n·nP ). To tackle the problem, a strategy based on the A* algorithm is adopted. A* operates an informed search on the search space defined by the set of all possible clustering configurations. Although the computational complexity of A* is still exponential in the worst case, a careful design of all its components can provide a fast search of the optimal solution in most cases. The general structure of A* used in this work is reported in Algorithm 3.2. Algorithm 3.2 The A* algorithm used for one-dimensional clustering input : the state space Σ input : the initial state σ0 input : the goal test τ : Σ 7→ {true, false} input : the successor operator ξ : Σ 7→ 2Σ input : the cost function f : Σ 7→ N input : the heuristic function h : Σ 7→ N output : the optimal solution σ ∗ closed ← ∅; open ← σ0 ;

(∗ t h e s e t o f a l l v i s i t e d s t a t e s ∗) (∗ t h e queue o f s t a t e s t o be v i s i t e d ∗)

while open 6= ∅ σ ← dequeue ( open ) ; (∗ p i c k t h e most p r o m i s i n g s t a t e ∗) c l o s e d ← c l o s e d ∪ {σ} ; (∗ mark t h e s t a t e as v i s i t e d ∗) i f τ (σ) = true then (∗ s t a t e i s o p t i m a l : r e t u r n i t ∗) σ∗ ← σ ; return σ ∗ ; else Σ ← ξ(σ) ; (∗ s t a t e i s not o p t i m a l : g e n e r a t e s u c c e s s o r s ∗) Σ ← Σ \ c l o s e d ; (∗ a v o i d d u p l i c a t e s ∗) f o r each σ 0 ∈ Σ c o s t ← g(σ) + h(σ) ; (∗ e s t i m a t e c o s t o f t h e s t a t e ∗) queue ( open , σ , c o s t ) ; (∗ add s t a t e t o queue w i t h p r i o r i t y ∗) end f o r end i f end while

The computational process of A* may end up with a final state or with an empty queue. In the latter case, no solution will be find to the problem2 . It can be proved that A* is both optimal and complete. Optimality of A* means that the algorithm finds the optimal solution for the problem performing the minimum number of needed expansions. Completeness means that if an optimal state exists in the 2

As explained below, this never happens in the DC* computation due to the search space structure.

72

3.2 DC* v1.0 search space, then it is returned by A*. The A* procedure is also computationally efficient since it does not expand all possible states (as in a brute-force search) but only those states that might lead to the optimal solution. In particular, the efficiency of A* heavily relies on the adopted heuristic function. the heuristic function can be viewed as an a-priori knowledge about the problem (in this case strictly referred at the search space). By definition, given a state in the search space, the heuristic function estimates the cost of the remaining path from that state to reach a final state. The choice of a good or a bad heuristic can lead A* to, respectively, running quickly and finding the optimal solution or increasing the running time and returning sub-optimal solutions (or not find solutions at all). A common issue is the time required to compute the heuristic which leads to a trade-off between the heuristic accuracy and the heuristic time to compute the estimation. With a very accurate heuristic, A* will return the optimal solution expanding only the needed states in the solution path. However, for real problems, a perfect heuristic is almost never available and pushing its quality means a significant increasing in computational time. For this reason, heuristics with a good estimation (not perfect) and low computational time are usually preferred. For guaranteeing completeness and optimality, however, the heuristic function has to be admissible, i.e. its value should not overestimate the cost of the optimal solution. Formally, given a state σ and denoting the optimal state as σ ∗ , then:

g(σ) + h(σ) ≤ g(σ ∗ )

(3.6)

At the same time, the heuristic function should be informative enough to properly drive the search process by avoiding the expansion of useless states. At the limit, if for each state σ the heuristic value is null, i.e. h(σ) = 0 , the A* algorithm will eventually find the optimal solution, but by performing an inefficient breadth-first search strategy. On the other hand, if the heuristic function can be defined such that equation (3.6) is always an equality, A* would eventually expand only the states belonging to the path from the initial state to the optimal state. As a consequence, it is very important to design an heuristic function that is both admissible and informative. In order to apply the A* strategy to solve a specific problem, all the involved items

73

Chapter 3

Double Clustering with A* (DC*)

have to be properly defined. Specifically, these are: the search space and the initial state, the test function, the path-cost function and the heuristic function other than the priority queue structure. In the following, all such items are characterized so as to solve the problem of finding the optimal granulation of data starting from the multi-dimensional prototypes derived in the first clustering step.

The search space and the initial state Given the Universe of Discourse X as defined in (3.1), the search space Σ is defined as the set of all possible partitions (granulated views) of X. Hence, any state σ ∈ Σ is defined by a possible combination of valid cuts on the n-dimensions. Specifically, a convenient way to represent a state is through a structure: σ = (S1 , S2 , . . . , Sn )

(3.7)

being Sd a subset of cuts in dimension d, i.e. Sd ⊆ Td . The initial state from which A* starts the search is:





σ0 = ∅, ∅, . . . , ∅ |

{z n

}

The test function The test function indicates to A* whether a given state σ represents a final state (a candidate solution) or not, hence it can be expressed as a function τ : Σ → {T RU E, F ALSE} that outputs TRUE if a state is final, FALSE otherwise. In order to give a proper definition of the test function, the concept of final state has to be specified. Given a state σ (hence by definition a subset of cuts Sd ), this state is said a final state if and only if the resulting hyper-boxes defined in (3.5) are all pure (see Fig. 3.7). In such a case, the goal is met and the search algorithm terminates. Because of the operating mode of A*, the first goal state is always optimal.

74

3.2 DC* v1.0 feature B

feature A

Figure 3.7: Final state (all pure hyper-boxes) for a bi-dimensional problem.

Successor operator The successor operator generates a set of states given an input state. Each state is distinguished from the input state by the presence of one additional cut in one dimension (see Fig. 3.8). feature B

feature A

feature B

feature B

feature A

feature B

feature A

feature B

feature A

feature B

feature A

feature A

Figure 3.8: State expansion. In the bottom of the figure the successor states for a generic state (figure top) are shown. Red lines depict the the Sd set of cuts. Formally, the successor operator is a function ξ : Σ → 2Σ defined as follows: σ 0 = (S10 , S20 , ..., Sn0 ) ∈ ξ(S) ⇔ (a) σ 0 6= σ (b) Si0 = Si ∨ Si0 = Si ∪ {t0 } , t0 > max Si (c) Si0 6= Si → ∀ k 6= i : Sk0 = Sk Property (a) means that a state cannot be successor of itself; property (b) states that the set of valid cuts on a dimension i for a successor has at most one valid cut

75

Chapter 3

Double Clustering with A* (DC*)

(greater of all existing cuts, to avoid duplicate states) more than its corresponding in the predecessor; property (c) states that if a set of cuts differs from its corresponding in the predecessor, then all other sets of cuts are equal to their correspondents in the predecessor - i.e exactly one cut is added in a successor state . P

P

As a consequence, if nT = d nTd is the total number of cuts and nσ = d nSd is the number of cuts in state σ, then the number of successors of state σ is at most nT − nσ . The additive property of the successor operator enables the generation of the state σT = (T1 , T2 , . . . , Tn )

(3.8)

named the terminal state - i.e. the state where all cuts are used. By construction, the cuts of the terminal state generate pure hyper-boxes: this guarantees that A* will eventually terminate on a goal state (see Fig. 3.9). terminal state (all cuts)

σ0

3rd depth level states composed by 3 cuts

final state (no-optimal)

optimal solution

terminal state (all cuts)

σT

Figure 3.9: Example of DC* search space for a bi-dimensional problem. Highlighted in yellow the optimal solution. Circled in red non-optimal solutions.

Cost function: path-cost and heuristic function The path-cost function used in A*, indicated as g(σ), simply counts the number of cuts that are used in a state (i.e. g (σ) = nσ ), while the heuristic function h(σ) is more complex because it must estimate the minimum number of cuts that are required to reach a goal state from the current state.

76

3.2 DC* v1.0 To evaluate the heuristic function for a state σ, a relation on all impure hyperboxes derivable from σ is first defined. Given two (impure) hyper-boxes Bk1 ,k2 ,...,kn and Bk10 ,k20 ,...,kn0 , they are connected if there exist a dimension d such that kd = kd0 . Informally speaking, such hyper-boxes share the same interval [skd , skd +1 ] and could be therefore potentially purified by a single cut (it is easy to show that, if the two hyper-boxes are impure, then there must be a cut t belonging to Td that lays in the interval [skd , skd +1 ]). The sets n

C (d, k) = Bk1 ,...,kd−1 ,k,kd+1 ,...,kn |ki = 1, 2, . . . ∧ i 6= d

o

(3.9)

collect all hyper-boxes that share the same interval [sk , sk+1 ] on dimension d. These sets can be organized in decreasing order of their cardinality in a sequence 







C d(0) , k (0) , C d(1) , k (1) , . . . such that for i < j the inequality |C (di , ki )| ≥ |C (dj , kj )| holds (Fig. 3.10).

C(2,4)

1

C(2,3)

3

C(2,2)

1

C(2,1)

1 C(1,1)

1

C(1,2)

1

C(1,3)

C(1,4)

2

1

C(1,5)

1

Figure 3.10: Connected impure hyper-boxes. On the features there is the cardinality of all the sets C(d, k).

In order to minimize the number of additional cuts (because of the heuristic admissibility property), it is reasonable to maximize the number of hyper-boxes to be intersected. Given the set B (0) of all impure hyper-boxes derived by σ, the following sets can be

77

Chapter 3

Double Clustering with A* (DC*)

defined: B (h+1) = B (h) \ C (dh , kh ) for h = 0, 1, 2, . . .. The set B (h+1) is therefore defined by the collection of all impure hyper-box that cannot be purified by a cut in dimension dh+1 within the interval [skh , skh +1 ]. The sequence of sets B (h) is decreasing in cardinality: therefore there ∗ ∗ exist a value h∗ such that B (h ) 6= ∅ and B (h +1) = ∅. The value of h∗ is the value of the heuristic function (see Fig. 3.11). A simplified version of the heuristic function evaluation steps for a state σ is described in Algorithm 3.3.

Figure 3.11: Heuristic function evaluation for a bi-dimensional state. On the features the number of Connected impure hyper-boxes is indicated.

78

3.2 DC* v1.0 Algorithm 3.3 The heuristic function computation algorithm of DC* v1.0 input : the state σ output : the h(σ) value the impure function defined as: impure(σ) = {hb|hb ∈ σ ∧ hbis impure} the connected function defined as: conn(impure(σ)) = [{hbi }|hbi ∈ impure(σ) ∧ (hbi , hbi+1 ) are connected] the function maxConn(conn(σ)) which returns the set with the higher cardinality in conn(σ) h ← 0 (∗ t h e v a l u e o f t h e e s t i m a t e d c u t s ∗) impureHB ← impure(σ) (∗ t h e impure hyper−b o x e s i n t h e s t a t e ∗) c o n n H B c o l l e c t i o n ← conn(impureHB) ; (∗ t h e l i s t o f s e t s o f c o n n e c t e d hyper−b o x e s ∗) while impureHB 6= ∅ maxHB ← maxConn( c o n n H B c o l l e c t i o n ) ; (∗ p i c k s t h e s e t w i t h t h e h i g h e s t c a r d i n a l i t y ∗) h ← h+1; (∗ adds t h e c u t t o p u r i f y t h e c o n n e c t e d hyper−b o x e s ∗) impureHB ← impureHB \ maxHB ; (∗ removes t h e p o t e n t i a l l y p u r i f i e d hyper−b o x e s ∗) f o r connHBset in c o n n H B c o l l e c t i o n : connHBset ← connHBset \ maxHB ; (∗ removes t h e p o t e n t i a l l y p u r i f i e d hyper−b o x e s ∗) end while

Successor selection: The Priority queue The A* priority queue is the structure that collects all the expanded states - i.e. a sequence of states sorted by a priority value. The priority policy is defined by the cost function value f (σ) = g(σ) + h(σ) and regulates the order with which A* visits, and hence expands, the states in the search space - i.e. the lower the f (σ) the higher the priority in the queue. At any iteration of the A* search procedure, several states with the same highest priority value may be present in the priority queue. In this situation (quite common as empirically observed) A* selects one of such states randomly to verify if it is final and eventually to generate its successors. The selection is random because there is no further information to distinguish between states with same priority value. Nevertheless, different choices may lead to different final states with different classification accuracies other then different interpretability characteristics. To overcome this problem, a selection criterion when several states have the same highest priority value is introduced, as described in [112]. Since it is not possible to evaluate the classification accuracy nor the interpretability of non-final states, the sort is performed on the basis of cuts. In other words, when a state successor σ is generated by adding

79

Chapter 3

Double Clustering with A* (DC*)

a new cut t, the distance D(σ) between t and the nearest prototype projection is computed (see Fig. 3.12). feature A distance

Figure 3.12: In red the last considered cut and the two distances.

States with the same priority value (i.e. f (σ)) are sorted in descending order in the priority queue according to their corresponding values D(σ). Thus, in a sorted set of states with same priority, the first state has all its valid cuts well separated from all one-dimensional projections. This selection criterion has been inspired by separation hyper-planes of linear Support Vector Machines, which are defined so as to maximize their distance from available data to provide highly accurate classifications [42]. Hence, the final priority queue structure can be viewed as a two-level priority queue where each state σ is characterized by two different values hf (σ), D(σ)i, where in the case the first priority level has the same value for more than one state, the second priority, the D(σ), is exploited to suitably choose the next expansion. As previously explained, the second DC* clustering step provides for an optimal partition of the Universe of Discourse X - i.e. a partition composed by the minimum number of hyper-boxes and hence the minimum number of partitions per feature. In the next section the fuzzification process is described which leads to the definition of the final Fuzzy Rule Base.

3.2.3 Granule Fuzzification and Rule Definition The last DC* stage is dedicated at the derivation of fuzzy information granules, operation accomplished exploiting the optimal partition of the Universe of Discourse, provided by the second DC* clustering step. This is achieved by first fuzzifying the one-dimensional granules defined over the problem features and then by aggregating one-dimensional fuzzy sets to form multi-dimensional fuzzy information granules. The fuzzification of one-dimensional information granules is attained by defining Gaussian fuzzy sets for which the overlap level is fixed. Specifically, the cuts resulting

80

3.2 DC* v1.0 from the second DC* clustering step are exploited to define the point where two adjacent3 Gaussian fuzzy sets overlap when they reach the value of 0.5 (see Fig. 3.13).

Figure 3.13: Gaussian fuzzy sets defined over a problem feature

It is easy to show that, for each dimension, these fuzzy sets meet all the interpretability constraints mentioned in sec. 3.1, except for the relation preservation constraint and in particular the proper ordering relation between concepts mapped by fuzzy sets. When interpretability constraints are satisfied, meaningful labels can be assigned to each one-dimensional fuzzy set, like LOW, MEDIUM, HIGH, etc., depending on the relative position over the feature domain. Moreover, multidimensional granules can be represented as conjunction of labels, like ‘x1 is LOW AND x2 is HIGH’, thus conveying immediately interpretable knowledge that can be used in fuzzy inference. Pure non-empty hyper-boxes defined over the Universe of Discourse are considered information granules because they contain prototypes belonging to a specific class and, as mentioned, prototypes are representative of data samples. Thus, multidimensional fuzzy information granules can be formed by combining the defined one-dimensional fuzzy sets, one for each input feature (see Fig. 3.14). 3

Two (one-dimensional Gaussian) fuzzy sets A and B in a collection of Gaussian fuzzy sets are said adjacent if there is not any fuzzy set in the collection whose prototype (i.e. element with maximum membership value) lays between the prototypes of A and B.

81

Chapter 3

Double Clustering with A* (DC*)

low

medium

high

feature B

0.5

1

feature A empty

medium

full

1

0.5

Figure 3.14: Example of a multi-dimensional fuzzy information granule over the Universe of Discourse

Specifically, the semantic interpretation of each information granule is defined by the t-norm composition of all compounding one-dimensional fuzzy sets, e.g.: attributen 1 2 µG = µattribute ∧ µattribute low medium ∧ ... ∧ µhigh

The final linguistic representation of the derived information granule is a conjunction of soft constraints (one for each attribute) like, e.g.: G : attribute1 is low...AND attributen is high Once the granulation process is completed, a fuzzy rule-based model composed by classification rules can be built on the basis of the derived fuzzy granules. Being each non-empty information granule suitably built to contain only prototypes belonging to a particular class, it can be labeled with the corresponding class label, thus providing a collection of fuzzy information granules with class label (G1 , c1 ), (G2 , c2 ), . . . , (Gn , cn )4 . Starting from information granules labeled with class information, fuzzy rules can be defined (one for each information granule) having the following form:

IF x is Gi THEN class is ci 4

(3.10)

Precisely, more than one information granule can contain prototypes belonging to a particular class, thus more then one information granule can be labeled by the same class label.

82

3.2 DC* v1.0 Which is an abbreviate form of the 0th order Takagi-Sugeno rule: IF x is Gi THEN classi,1 = 0, ..., classi,ci −1 = 0, classi,ci = 1, classi,ci +1 = 0, ..., classi,nc = 0 Given an input x, the outputs of the fuzzy rule-based classifier are computed according to the standard formula: P

Gi (x)δ(ci ,k)

classk (x) =

i

P

Gi (x)

i

being δ the Kronecker symbol. If one class only has to be assigned to input x , then the class with highest classk (x) is chosen (tiers can be selected randomly).

3.2.4 Summary In this section the original version of DC* has been described. Particularly, this description concerns the method as firstly proposed in [28] (including some slight modifications about the A* priority queue, introduced in [112]). The aim is to provide the reader with the general idea underlying DC*, i.e. its way of computing by focusing on its peculiar characteristics. However, experimental results have shown efficiency weakness of this version of DC* when applied to large-complex problems - i.e. DC* was unable to provide a model for the data at hand in a reasonable time. DC* efficiency performances are strictly related with the optimal solution search, the main task tackled by the adoption of the A* algorithm. Due to the nature of the DC* search graph, the required computational time could be - in the worst case - exponential. This has motivated an investigation on the behavior of the A* search algorithm and its principal components, aiming to improve the efficiency performances of the whole method. Moreover, changes to the granule fuzzification step has been motivated in order to improve both interpretability and accuracy of the final model. All these interventions have led to a new version of DC*, named DC* v2.0, described in details in the next section.

83

4 DC* v2.0 In this chapter the new version of DC* is presented, named DC* v2.0. The method, even if based on the first proposed version, presents substantial innovations. Specifically, this section is structured to illustrate the differences between the two versions of DC*, focusing on the newly introduced features. sec. 4.1 illustrates improvements about the second step of DC*, in particular the optimal solution search exploiting the A* algorithm; here, a new priority queue structure and a new heuristic function are described in detail. sec. 4.3 is dedicated to illustrating changes in the granule fuzzification method phase, describing also the introduction of three new fuzzification approaches. Finally, section sec. 4.2, illustrates a different approach based on a Genetic Algorithm, with the aim to further improve the optimal solution search problem. For details about the way of computing of DC*, the reader is referred to section sec. 3.2 because the following sections are strictly focused on the improvements (also formal notation is based on the previous chapter). In the following, a differentiation is introduced referring to DC*. In particular, it will be used “DC* v1.0” to explicitly referring to the initial version of the method, and “DC* v2.0” to referring to the new introduced version; only “DC*” when there is no need to differentiate the version (e.g. for some shared feature).

4.1 Solution Search with A* As previously explained (see sec. 3.2.2), the capabilities of DC* to find the optimal solution - i.e. the best space partition that maximizes the accuracy/interpretability trade-off - heavily relies on the A* search process. However, as mentioned at the end of the previous chapter, DC* has shown some efficiency weakness dealing with largecomplex problems (empirically observed in the experiment reported in sec. 5.2.1). As for the optimal solution search, A* also plays a fundamental role for the DC* method efficiency because the greatest part of the DC* computational burden relies on the

85

Chapter 4

DC* v2.0

search of an optimal solution. Therefore, the aim is to improve this crucial step in order to obtain a considerable efficiency gain of the method without compromising the optimality of the final solution. In this section two modifications to the A* structure of DC* v1.0 are presented. In particular, the new proposed A* priority queue (sec. 4.1.1) and the new proposed heuristic function (sec. 4.1.2) allow to build a different A* structure, adopted in DC* v2.0.

4.1.1 The New Queue The first introduced modification regards the A* priority queue. It is worth to remember that A* expands states in the search space respecting the state priority provided by the queue policy. Of course, this means that the A* priority queue might influence over the entire search process in terms of both efficiency and effectiveness. The new proposal is a three-level priority queue that takes into account, in the order, the cost function f (σ) (see sec. 3.2.2), a distance measure between cuts, and the number of involved features. More specifically, when there are more states with the same priority, the next priority level is exploited to sort the states, and so on. As mentioned in sec. 3.2.2 in “Successor selection: The Priority queue”, a multilevel priority queue is motivated by the fact that, at any iteration of the A* search procedure, several states with the same highest first priority value (at the first level) may be present in the priority queue (a quite common case as empirically observed). Due to the way of computing of A*, no changes are introduced at the first priority level of the queue1 . For the second priority level, instead, a different kind of distance measure is adopted. Exploiting the idea presented in [112], when a state successor σ is generated by adding a valid cut t, the new adopted measure takes into account the distance between t and its nearest cut2 . Formally, two distances are computed δl (t), δr (t), respectively from t to the left-hand side cut and from t to the righthand side cut (see Fig. 4.1), and the selected distance is ∆(σ) = min(δl (t), δr (t)) which characterizes the state σ. The bigger the distance ∆(σ) the more the priority. Actually, being the A* priority queue ordered in an ascending way, there is the need 1

It is worth to remember that the A* optimality is obtained exploiting the cost function f (σ) (sum of the path cost and the heuristic function). For further details about A* implementation in DC* the reader is referred at sec. 3.2.2. 2 This differs from the DC* v1.0 in which the considered distance was from t and its nearest prototype.

86

4.1 Solution Search with A* to consider the distance measure complement w.r.t. the whole feature. Hence, being the problem normalized in [0, 1], the adopted value is ∆(σ)0 = 1 − min(δl (t), δr (t)). .

feature A δl(t)

δr(t)

Δ(σ)

Figure 4.1: The ∆(σ) identification. In red the cut which characterizes the new state σ.

This new kind of distance shifts the focus on the resulting information granules. Considering a number of states with the same cost function f (σ), they are sorted by taking into account the amplitude of their information granules. This leads to a solution which has the minimum number of cuts (this constraint is ensured by the first priority level, f (σ)) and is composed by information granules as “wide” as possible. Information granules with high amplitude have a twofold rationale: they are preferred to the low ones due to their generalization capabilities considering the underlying data (accuracy improvement) and, since more general information granules are easier to label, they result more interpretable for a human being. Moreover, preferring wider granules instead of small ones can favor clusters that contain more prototypes, leading to a lower number of information granules and, therefore, pushing the expansion toward a minimal solution (a more compact fuzzy rule base) other than increasing the efficiency performance. Finally, the third queue priority level is represented by the number of involved features - i.e. the number of features on which at least a cut is defined. This value is strictly related to the interpretability of the final model. Specifically, for each state σ as defined in (3.7), the function f count(σ) counts the sets Sd : Sd 6= ∅ ∧ Sd ∈ σ. In summary, each state σ stored in the priority queue structure of DC* v2.0 is characterized by three different values hf (σ), ∆(σ), f count(σ)i being each of them computed in the expansion phase of A*. The new queue structure favors states characterized by the minimum number of cuts (first priority level) which provide for the widest possible information granules (second priority level) and involve the minimum number of features (third priority level).

87

Chapter 4

DC* v2.0

4.1.2 The New Heuristic function In DC*, the complexity of finding an optimal solution is strictly related to two main characteristics that are partially related. On the one hand there is the number of prototypes defined by the user and, on the other hand, there is the data distribution over the problem space. As explained, prototypes are representative of data and hence are placed in the problem space respecting data distributions. If data with the same class are well grouped together, even a high number of prototypes will not generate too many information granules; on the contrary, if data are sparse or data belonging to different classes are highly overlapped, the number of prototypes plays a fundamental role, heavily influencing the complexity of the search (as depicted in Fig. 4.2). feature B

feature B

feature A

feature A

Figure 4.2: Cuts configuration for two different trivial problems with the same number of prototypes. It is possible to see how different data distributions define different cut configurations, leading to different complexity levels. Therefore, an high number of prototypes may lead to a high complexity in the solution search, which makes DC* 1.0 not terminating in reasonable time. Empirical results have shown the efficiency weakness of DC* v1.0 when applied to particular large-complex problems caused by its computational burden3 . As explained in section sec. 3.2.2, the heuristic function covers a fundamental role in the A* search process. To improve the DC* efficiency, the heuristic function informative power should be maximized without compromising the admissibility property, taking into account the needed computational effort to be evaluated. In particular, improving the heuristic function yet preserving its admissibility property (hence ensuring the solution optimality) aims at improving the search efficiency. 3

See, e.g. sec. 5.2.1.

88

4.1 Solution Search with A* Further investigations have pointed out a particular behavior of the heuristic function that attributes to it the method inefficiency. Specifically, at different search stages the heuristic informative power is extremely low, yielding to a high number of states with the same constant heuristic value. In other words, this means that a final state is estimated to be equidistant from all the considered states, making the heuristic function useless and transforming the A* search in a Breadth First Search.

In order to investigate the heuristic function behavior, DC* v1.0 has been applied to a suited built test problem named “k-chessboard”. As depicted in Fig. 4.3, the k-chessboard is a particular bi-dimensional problem with 2 classes, shaped to form a square where prototypes of the same class have not to be adjacent4 . The value of “k” refers to the number of samples over the square side, in other words the square dimension. The higher the “k” value the higher the problem complexity, which requires more information granules, thus cuts, to be solved. At each data sample a prototype corresponds - i.e. the number of defined prototypes must be equal to the number of data samples. In this case the optimal solution is also the only possible solution to the problem which, in particular, is the terminal state (see eq. 3.8), the state composed by all the valid cuts. It is worth to remember that this is the deepest state in the A* search space (see sec. 3.2.2). Such dataset type represents the worst scenario for the DC* search, since it has to penetrate through the whole search graph, heavily stressing the A* search procedure. Moreover, with the aim to test the heuristic correctness, a visual on-line check during the search process is feasible due to the bi-dimensionality of the problem. Applying the k-chessboard problems to DC* and looking at the obtained empirical results (see for example Fig. 4.4), the proposed DC* v1.0 heuristic function has the expected correct variation during the search process.

4

Considering the orthogonal axes.

89

Chapter 4

DC* v2.0

Figure 4.3: the “k-chessboard” test dataset with k=5. On the left side the dataset distribution is depicted. On the right side, the solution cut configuration.

h(σ)

number of expanded states

Figure 4.4: DC* v1.0 heuristic function variation for the 5-chessboard problem.

When applied to the benchmark dataset, the heuristic function constant behavior can be analyzed taking into account both the dataset complexity (the data distribution in the problem space) and dimensionality. Given a state σ, to exploit the heuristic additive property (which allows to obtain values greater than 1) the following condition must be satisfied as explained in sec. 3.2.2: there must be more than one C(d, k) set. In other words, there is the need to have more than one not-connected impure hyper-box set which enables heuristic values greater than 1. Taking into account the way of computing of DC*, it is straightforward that this condition occurs more frequently when the problem has a low dimensionality and a high complexity5 . Differently from benchmark problems, the “k-chessboard” test forces the number of needed cuts to purify a state - i.e. the heuristic function value - to be greater 5

Referred as the needed cuts to solve the problem and hence related to the number of prototypes and the data distribution over the Universe of Discourse.

90

4.1 Solution Search with A* than 1, implying a more informative heuristic. This phenomenon does not occur with benchmark problems in which is highly frequent that a feature connects all the impure hyper-boxes, hence reducing the informative power of the heuristic function (due to the admissibility property, the impure hyper-boxes can be purified by means of only one cut). The lack of useful information about the estimated needed cost to reach a final state transforms A* in a simple best-first search with a very shallow penetration in the search space; in particular, due to the search space characteristics (the cost is represented by the number of cuts in the state, see sec. 3.2.2) the best-first search operates actually as a breadth-first search, which results extremely inefficient for real-world problems. It is worth to underline the search space shape of DC*. The ratio between the search space depth and max width is extremely low, yield a really flattened search space whose maximal amplitude is wmax = nntt ! 2 . ( 2 !) The need to design a powerful heuristic that can provide a useful estimation, arises. In particular, the aim is to exploit some additional problem information to obtain a heuristic function that can effectively drive the search. The underlying idea of the new heuristic function is to make use of prototype class information, included in an impure hyper-box, to optimistically estimate the minimum number of cuts needed to separate prototypes of different classes. Class information can be viewed as the additional knowledge about the problem, exploitable to make the heuristic function more informative and hence to better penetrate the search space. By definition, in an impure hyper-box there are at least two prototypes with two different class labels. It means that there is the need of at least one cut to split the impure hyper-box into two new pure hyper-boxes (i.e. one hyper-box for each class label). Generally speaking, given an impure hyper-box including prototypes with nc different class labels, at least nc different pure hyper-boxes must be derived through the splitting process. Given n sets TdHB ⊆ Td of candidate cuts (one for each feature d = 1, . . . , n) that intersect the impure hyper-box HB 6 , the application of subsets

6

From this point, the superscript HB means that the particular value refers to (and is limited to) an hyper-box only.

91

Chapter 4

DC* v2.0

of cuts SdHB ⊆ TdHB , splits the hyper-box into a number of hyper-boxes equal to:

nHB (S1HB , . . . , SnHB ) =

n  Y

|SdHB | + 1



(4.1)

d=1

being n the total number of features and |SdHB | the cardinality of SdHB . To ensure the splitting of an impure hyper-box into pure hyper-boxes, the relation nHB ≥ nc must be satisfied. In particular, due to the admissibility property of the heuristic function, the number of cuts needed to define the minimum number of hyper-boxes, is defined as nHB Sheur

=

n X

|SdHB |

(4.2)

d=1

such that nHB ≥ nc and nHB is minimal. For a better comprehension Fig. 4.5 shows two different impure state scenarios. For each state, a white hyper-box containing four prototypes of four different classes is depicted. Dashed lines represents the other cuts limited to the considered hyper-boxes. Considering these hyper-box and taking into account the prototype positions in the space, it is straightforward identify that to purify the hyper-boxes, in (a) two cuts are needed and in (b) three cuts are needed. Nevertheless, due to the way of computing the heuristic estimation of the needed number of cuts (that does not consider the positions but the number of different classes), nHB Sheur = 2 for both (a) and (b) scenarios. Moreover, the figure shows the connected hyper-boxes, depicted in checked white/gray. Those, being intersected by the estimated cuts to purify the white hyper-boxes, are no more taken into account in order to preserve the admissibility. In Algorithm 4.1 the procedure adopted in order to evaluate nHB Sheur , satisfying the mentioned constraints, is provided. It is noteworthy how cuts are taken into account in a suited fashion in order to ensure admissibility. More specifically, admissibility is related with the estimation of the number of cuts needed to purify a state, as already mentioned, and the same statement holds taking into account a single impure hyperbox. On the other hand, the idea underlying the heuristic function is to evaluate the number of hyper-boxes w.r.t. the number of different classes. The number of hyper-boxes is function of the number of cuts defined over the axes (as shown eq. 4.1) and, in particular, is sensitive to the distribution of cuts over the axes. In

92

4.1 Solution Search with A*

Figure 4.5: Example of two non-goal states. In white, two impure hyper-boxes with four prototypes of different classes are depicted.

other words, considering a number of cuts, their distribution over the axes may lead to a different number of hyper-boxes. For example, in Fig. 4.6 a two-dimensional problem is depicted and the white impure hyper-box is taken into account. The figure shows how applying the same number of cuts differently distributed over the problem axes, leads to a different number of built hyper-boxes: six on the left-hand side figure and four on the right-hand side figure.

In the proposed solution, each new involved cut is considered to be on a different axis of the problem (respecting the constraints |SdHB | ≤ |TdHB | ). This ensures the minimality of the used cuts, maximizing the built hyper-boxes.

93

Chapter 4

DC* v2.0

Figure 4.6: The same impure hyper-box with the same number of cuts differently applied over the axes.

Algorithm 4.1 Computation of the heuristic value for a singular impure hyper-box input : the number of candidate cuts that intersect the hyper-box over each axis, nTd d = 1, . . . , n input : the number of different classes in the hyper-box, nc output : the estimation of the minimum number of cuts need to purify the hyper-box, nSheur f o r d = 1 . . n do : nSd ← 0 end f o r builtHBoxes nSheur ← 0 d ← 1



1

(∗ r e p e a t s t h e c y c l e u n t i l t h e number o f b u i l t H B o x e s i s enough t o s e p a r a t e t h e d i f f e r e n t c l a s s e s ∗) while not ( b u i l t H B o x e s ≥ nc ) do : (∗ adds a c u t on t h e " d " a x i s c o n s i d e r i n g nTd ∗) i f nSd ( d=1 |Sdσ |) - i.e. the number of cuts in σf inal is greater than d=1 |Sd the number of cuts in σ

104

4.2 Solution Search with GA-Guided A* “it is possible to exploit a sub-optimal good solution to drive the algorithm to find the optimal solution” which sounds feasible. As anticipated in the section beginning, the sub-optimal good solution is obtained by exploiting a GA. In particular, this approach is used due to its convergence speed, useful to find a solution state to be used as PoG, even if not optimal. As mentioned for the DCγ implementation (sec. 4.2.1), the GA efficiency is paid losing the solution optimality characteristic. In the GA-guided A* algorithm, this does not represent a problem due to the A* algorithm, which preserves the solution optimality (see Fig. 4.11).

GA

A*

PoG optimal solution

solution space Figure 4.11: The GA-guided A* workflow. The GA, computed before the A* search, identifies a solution in the search space adopted as PoG by A* that computes the actual optimal solution for the problem. Inspired by the DCγ approach, from which the chromosome encoding/decoding is shared (introduced in sec. 4.2.1), a new GA is developed. The aim is to find a solution that is close to the optimal solution state11 , as quickly as possible. By the solution state definition and due to the search space characteristics, there is a level in the search space where the optimal solution σopt is located. Other states 11

A state σ is a solution state if it is composed only by pure hyper-boxes.

105

Chapter 4

DC* v2.0

with the same number of cuts of σopt are at the same level. Other possible solutions to the problem can be located in the area below σopt or, at most, at the same level of σopt and hence representing other optimal solutions to the problem - i.e. with the minimum number of cuts.

In Fig. 4.12 the feasible solution area in the search space is depicted (gray plus blue area). In particular, to maximize the effectiveness of the GA-guide, the solution provided by the GA and hence used as PoG, should share as much as possible cuts with the optimal solution (the ideal condition is that σP oG ≡ σopt ). As previously mentioned, this is due to the fact that if σP oG is a child of σopt - i.e. σopt ⊆ σP oG then the reaching of σopt is accelerated by the PoG attraction action. At the limit, a σP oG that does not share cuts with σopt , does not provide a useful attraction at all. For example, in Fig. 4.13 two solution states are depicted. The left-hand side state is considered as σP oG and the right-hand side one as σopt . Those do not share any cut and hence σP oG , although a solution, is located in a totally different region of the feasible solution area without providing a useful attraction to σopt . The PoG effectiveness is maximized if σopt is an its ancestor, however, a PoG may be useful even if just a sub-set of cuts of σopt is present in it - i.e. σopt partially included in σP oG , σopt ∩ σP oG 6= ∅, condition that happens when there is not a direct kinship between states but they share an ancestor. Therefore, a PoG ideal region is definable over the feasible solution area (see Fig. 4.12 blue area). In particular, looking at the Fig. 4.12, a couple of considerations have to be made: (i) the more the PoG is high in the feasible solution area, the less is its granularity and (ii) referring to the PoG ideal region, the more the PoG is close to the optimal solution the better is its effectiveness. Nevertheless, given that the GA works as expected, hence finding a solution state trying to minimize its number of cuts, the GA solution location in the feasible solution area is strictly problem dependent.

106

4.2 Solution Search with GA-Guided A*

σopt

optimality level

Figure 4.12: In the picture, the feasible solution area (gray plus blue) and the PoG ideal region are depicted.

feature B

feature B

feature A

feature A

Figure 4.13: Comparison between two different DC* 2.0 solutions. On the right hand side the optimal solution. On the left hand side a sub-optimal solution to the problem. It is noteworthy how the two solutions do not share any cut. Therefore, those are located in very different places in the search space.

Given the PoG in the search space, the guiding process - i.e. the attraction operated by the PoG during the A* state expansion - is actually operated trough a state comparison. The idea is, given the PoG state σP oG and a number of states σi for which the f (σ) value is equal, the more promising state is the one closest to σP oG . Therefore, in order to evaluate the closeness to σP oG , the need to introduce a distance measure between states arises. Exploiting the concepts above discussed in this section, the distance measure can be reasonably based on cuts: the more the shared cuts between two states the more they are close, and vice-versa. Formally, the distance measure from the PoG is based

107

Chapter 4

DC* v2.0

on the Jaccard Distance and defined as follow:

δP oG (σP oG , σ) = 1 −

P oG Sd



∩ Sd

|SdP oG ∪ Sd |

∈ [0, 1]

(4.5)

where Sd is considered as the collection of cuts over all the n problem dimensions . Each expanded state σ is characterized by a distance measure from the PoG, δP oG (σP oG , σ). As mentioned, to preserve the A* optimality characteristic and effectively operate an attraction action toward the optimal solution, the distance measure δP oG (σP oG , σ) is placed as second priority in the A* search queue. Therefore, in DC* v2.0 with GAguided A*, each state in the A* priority queue is characterized by the three values hf (σ), δP oG (σP oG , σ), ∆t i. This means that A* visits the expanded states taking into account their cost function f (σ) (preserving the solution minimality and exploiting the typical search property of A*) and then makes use of the GA-guide only when there is an f (σ) value draw situation between states. In this case, the state with the lower distance from the PoG, δP oG (σP oG , σ), is expanded (due to the effect of the attraction action): the less the distance value, the more the shared cuts between the current state and the PoG, the more promising the path through the considered state is, in order to reach the optimal solution (see Fig. 4.14).

explored states

σopt PoG

Figure 4.14: The search space of DC* v2.0 with GA-guided A*. In the upper part of the search space the explored states are separated from the unexplored states by the frontier (states in the A* priority queue). The optimal solution (in red) and the PoG (in blue) are depicted.

108

4.2 Solution Search with GA-Guided A* Finally, in the case of more than one state with both the same cost function f (σ) and the same distance from the PoG δP oG (σP oG , σ), the ∆t value is exploited to choose the next state expansion.

4.2.2.1 GA - guided A*: GA details In the following, details about the implemented GA, for the GA-guided A* algorithm, are provided.

Initial Population The initial GA population is not randomly chosen. Given the chromosome length nch (i.e. the number of all possible cuts Td for d = 1, . . . , n), the initial population is two times the chromosome length 2nch , limited to 200 individuals. Each initial individual has only one gene equal to 1, i.e. a configuration in which only one cut is present. Usually there are two equal individuals; however, this not happens when the limit is reached.

Individual representation Individuals are represented by chromosomes as defined in eq. (4.3).

Number of generations The number of generations is set to 20. This is due to the fact that preliminary experimentations showed that actually the population evolution occurs in the first 20 generations (i.e. it is possible to fairly assume that the GA fitness functions has a constant trend after the 20th generation).

Selection Criteria Individuals are selected by Tournament selection.

Crossover and Mutation rates The Crossover rate is set to 0.7 and the Mutation rate is set to 1/nch . Fitness Function To explain the fitness function design some considerations have to be made. The GA final solution has to be an individual in which the configuration of cuts represents a goal state - i.e. a state in which all the hyper-boxes are pure. On the other hand, for the sake of interpretability, a state with a low number of cuts is desirable (i.e. the lower is the number of cuts, the lower is the number

109

Chapter 4

DC* v2.0

of information granules, the higher is the interpretability). Due to the problem nature, those two objectives are in conflict with each other. To tackle the problem a granularity measure gr(ch) → [0, 1] and a pureness measure p(ch) → [0, 1] are introduced to characterize each GA individual. The former is strictly related to the number of cuts adopted in a state w.r.t. the valid cuts (the relation between the number of cuts and the number hyper-boxes is discussed in sec. 4.1.2, in the context of the new heuristic function characterization), and is formalized as follows: given a chromosome ch as defined in (4.3)

PnTd d=1 k=1 vkd Pn d=1 nTd

Pn

gr(ch) =

The latter, the pureness measure p(ch), is defined taking into account the hyperboxes information. In order to allow the hyper-boxes check, p(ch) evaluation requires the chromosome decoding into a state σ. Two functions, countP ure(ch) and countImpure(ch), count respectively the number of pure hyper-boxes (not empty) and the number of impure hyper-boxes of which σ is composed. The p(ch) value is define as

p(ch) =

countP ure(ch) countP ure(ch) + countImpure(ch)

Individuals with p(ch) = 1 are solutions of the problem - i.e. all the not-empty hyper-boxes are pure. Thus, the aim of the fitness function is both to strongly penalize individuals with p(ch) 6= 1 and works on individuals that have p(ch) = 1 minimizing the granularity gr(ch). Therefore, the score function for an individual ch is defined as follow

scorech = [1 − gr(ch)] p(ch)5 The use of exponent for p(ch) is motivated by the need of strongly emphasizing the purity requirement in determination of the score. The fitness function value refers to the entire generation population and is calculated as the average score value. In

110

4.3 Granule Fuzzification particular, at the 20th generation the best individual - i.e. the one with the highest score - is chosen as the GA solution.

4.3 Granule Fuzzification In this section changes in the granule fuzzification phase of DC* are presented. In particular, firstly a different type of fuzzy membership functions is introduced in the DC* method (sec. 4.3.1). Then, based on the membership function characteristics and their underlying semantic, three kinds of granule fuzzification are described (sec. 4.3.2, sec. 4.3.3, sec. 4.3.4). It is worth to notice that the Granule Fuzzification phase can be implemented by different methods without compromising the DC* obtained partitions. In other words, once the partitions are obtained over the problem features, the granule fuzzification type can be changed w.r.t. the application needs other than the user preferences.

4.3.1 Strong Fuzzy Partitions based on cuts As seen in sec. 3.2.3, DC* v1.0 uses Gaussian fuzzy sets in order to fuzzify the partitions obtained by the double clustering phase. Gaussian fuzzy sets were adopted to allow the integration of the obtained fuzzy partitions into Neuro-Fuzzy Systems that require completely differentiable fuzzy sets [54]. However, without the need of this kind of integration, the choice to adopt Gaussian fuzzy sets is no more justified. Moreover, as discussed in [115], Gaussian membership functions do not respect the Proper Ordering of linguistic concept constraint. As a consequence, trapezoidal fuzzy sets are preferred when the modeling process does not involve any gradientbased learning algorithm. In DC* v2.0, Strong Fuzzy Partitions (SFPs) are used to fuzzify the information granules defined over the problem features. Even if representing a simplification of the concepts underlying the fuzzy sets, the choice to adopt SFPs have a twofold motivation. The first one is the SFPs simplicity: they are widely used because they simplify the modeling process as they usually require few parameters for their definition. The second motivation is because of SFPs interpretability: the fulfillment of many interpretability constraints (as described in sec. 2.3) is guaranteed if SFPs

111

Chapter 4

DC* v2.0

are adopted12 . Triangular SFPs (TSFPs) are widely used for modeling interpretable fuzzy systems. They are characterized by the use of triangular fuzzy sets to define a fuzzy partition. Triangular fuzzy sets have a number of desirable properties, which are useful for interpretability (they are normal, convex and continuous) as well as for modeling [133].

The definition of a TSFP with n fuzzy sets is completely characterized by n values that correspond to the prototypes of each fuzzy set: this makes the design of TSFPs very simple. Usually, the prototypes are computed by some algorithm that tries to locate prototypes in order to better represent the available data. Other approaches fix the number of triangular fuzzy sets per features; then the location of prototypes is determined according to some optimization process [31] or through evolutionary algorithms [87, 21]. In some cases, fuzzy partitions are designed after a clustering analysis of multidimensional data. This approach enables the discovery of multidimensional relationships among data, which can be conveniently represented as fuzzy rules [1]. To ensure interpretability, clusters are usually projected on each input feature, where fuzzy sets are defined so as to resemble as much as possible the projected clusters [89, 81]. Often, prototype-based clustering is used (like fuzzy c-means or similar): in these cases the prototypes of multidimensional clusters are projected on each input feature and could serve as prototypes of the fuzzy sets in a partition [5].

However, as described in [116], the simple use of multidimensional prototypes does not give enough information about the span of clusters within the data domain (see Fig. 4.15).

12

SFPs are not strictly necessary for satisfying the above mentioned interpretability constraints.

112

4.3 Granule Fuzzification

Figure 4.15: Comparison between a SFP derived from prototypes (left) and a SFP derived from cuts (right). It is noteworthy how the clusters results improperly represented in the prototype based approach.

For such reason, an alternative approach makes use of cuts, i.e. points of separation between clusters projected onto input features (as described in sec. 3.1). Cuts can be conveniently used to define the bounds of the 0.5-cuts of the fuzzy sets in a fuzzy partition13 . More specifically, given a collection of cuts, a SFP can be defined so that the 0.5-cuts of the fuzzy sets in the partition coincide with the intervals bounded by the cuts. Since the 0.5-cut of a fuzzy set is the set of elements that are most representative for the fuzzy set, then a SFP based on cuts is a robust representation of the projections of multidimensional clusters on an input feature. As shown in the following, a SFP based on cuts cannot be always defined by triangular fuzzy sets. The consequences of this result impact on the flexibility of modeling approaches based on triangular fuzzy sets: imposing the use of this type of fuzzy sets restricts the possibilities of representing multi-dimensional relationships in an interpretable way. In fact, the use of triangular fuzzy sets represents a further bias - which is not motivated by any interpretability requirement - to be added to the structural constraints that are already taken into account while designing a fuzzy model (as known, such constraints ultimately impose the requirement of a balance between interpretability and accuracy). In other words, the flexibility connected to a modeling process based on the employment of SFPs may be restricted by confining the choice of fuzzy sets to the triangular category. As a consequence, interest should 13

The 0.5-cut of a fuzzy set is the (crisp) set of all elements with membership degree greater or equal to 0.5.

113

Chapter 4

DC* v2.0

be shifted towards a more relevant issue concerning the possibility to define SFPs based on cuts. In the following, the feasibility of this approach is shown by resorting to trapezoidal fuzzy sets. Trapezoidal fuzzy sets are widely used for modeling interpretable fuzzy systems [54, 21, 141, 2, 60]; however, in most cases trapezoidal fuzzy sets require more parameters than triangular fuzzy sets. Such parameters need to be tuned according to some heuristic optimization process like genetic algorithms. Thanks to the DC* way of computing, the introduced granulation approaches do not need free parameters because trapezoidal fuzzy sets are defined given a collection of cuts only. In this way there is no need of further optimization processes beyond the partitioning process that produced the cuts. Below, a formal proof that triangular fuzzy sets cannot always be used to define SFPs given a set of cuts is provided. Then, a procedure to define SFPs based on trapezoidal fuzzy sets is presented.

4.3.1.1 Generation of SFP from cuts

A SFP is a collection14 of fuzzy sets A1 , A2 , . . . , An+1 defined on a Universe of Discourse X = [m, M ] ⊆ R such that: ∀x ∈ X :

n+1 X

Ai (x) = 1

(4.6)

i=1

A triangular fuzzy set is denoted by T [l, p, r] where: • l is the leftmost bound of its support; • p is the element of its core (also called prototype); • r is the rightmost bound of its support. 14

We assume that the collection is sorted, so that it is legitimate to refer to the i-th fuzzy set in a SFP.

114

4.3 Granule Fuzzification The membership function of a triangular fuzzy set can be conveniently defined as a case-based function:

T [l, p, r] (x) =

 x−l   ,    p−l x−r

p−r     0,

x ∈ ]l, p] x ∈ ]p, r[ x≤l∨x≥r

A triangular fuzzy set is well-formed if and only if l≤p≤r

(4.7)

A Triangular Strong Fuzzy Partition (TSFP) is a SFP made with triangular fuzzy sets only15 . A TSFP made of n + 1 fuzzy sets is completely characterized by n − 1 parameters pi for i = 2, 3, . . . , n. In fact, the triangular fuzzy sets of a TSFP can be defined as T [pi−1 , pi , pi+1 ] for i = 1, 2, . . . , n + 1 with the convention that p0 = p1 = m and pn+1 = pn+2 = M . Given an element x ∈ X, at most two fuzzy sets have non-zero membership in a TSFP: these fuzzy sets are said adjacent. Furthermore, since triangular fuzzy sets are convex, their α-cuts are intervals. Given the constraint 4.6 of a SFP, it is immediate to verify that the 0.5-cuts of two adjacent fuzzy sets in a TSFP are also adjacent (in the sense of sharing one and only one intersection point). Let t1 , t2 , . . . , tn ∈ X a sequence of cuts, where ti < ti+1 for i = 1, 2, . . . , n − 1. In order to design a SFP based on cuts, each cut corresponds to an intersection point between two adjacent fuzzy sets in a SFP; as a consequence, n cuts correspond to the intersection points of n + 1 fuzzy sets in a SFP. (An intersection point between two fuzzy sets is a point in X where both fuzzy sets have the same non-zero membership, see also Fig. 4.15 right hand side.) In the following, it is shown that it is not always possible to build a TSFP of n + 1 fuzzy sets given an arbitrary set of n cuts. This is proved by attempting to build a TSFP and then the conditions that prevent the definition of well-formed triangular 15

Exceptionally, trapezoidal fuzzy sets can be defined as leftmost and rightmost fuzzy sets. However, this case can be safely ignored in the present argumentation.

115

Chapter 4

DC* v2.0

fuzzy sets are highlighted. The reader can refer to Fig. 4.16 as an illustrative example of the proof.

Figure 4.16: A sequence of cuts that prevents the generation of a well-formed triangular fuzzy set (red dashed line). We suppose that a triangular fuzzy set T [li−1 , pi−1 , ri−1 ] is defined so that T [li−1 , pi−1 , ri−1 ](ti−1 ) = 0.5 and T [li−1 , pi−1 , ri−1 ](ti ) = 0.5 The membership values on ti−1 and ti constrain the parameters li−1 and ri−1 . In particular, the parameter ri−1 can be obtained by applying the case-based definition of a triangular fuzzy set, resulting in ti − ri−1 = 0.5 ⇒ ri−1 = 2ti − pi−1 pi−1 − ri−1 The next triangular fuzzy set T [li , pi , ri ] must be defined so as to satisfy the constraints 4.6 of a SFP. The parameters of the membership function must be therefore defined as li = pi−1 and pi = ri−1 = 2ti − pi−1

116

4.3 Granule Fuzzification while ri is defined such that 0.5 =

ti+1 − ri pi − ri

i.e. ri = 2 (ti+1 − ti ) + pi−1

In order to assure well-formedness 4.7, the relation pi ≤ ri must hold. It is easy to show that this relation is true if and only if ti+1 ≥ 2ti − pi−1 = ri−1

(4.8)

Therefore, if the cuts used for partitioning do not verify 4.8, it is not possible to define well-formed triangular fuzzy sets. This result has a strong impact on interpretable fuzzy modeling. In fact, if by T the collection of all possible sets of cuts on X is denoted, and by P the set of all TSFPs, then relation 4.8 states that it is not possible to define a bijective mapping from T to P. On the other hand an injective mapping from P to T is trivial: given a TSFP, the set of cuts can be defined by selecting all the intersection points between triangular fuzzy sets. Therefore, the set T is richer that P, thus any algorithm that carries out a collection of cuts is potentially more flexible and less biased than an algorithm that produces triangular SFPs.

4.3.1.2 Generation of Trapezoidal SFPs

The aim of this section is to show how it is possible to derive a SFP given a set of cuts, i.e. given an element of T by resorting to trapezoidal fuzzy sets instead of triangular fuzzy sets. In the following, some procedures to derive a SFP made of trapezoidal fuzzy sets given a collection of cuts on X are shown.

117

Chapter 4

DC* v2.0

First, the definition of a trapezoidal fuzzy set is recalled:

T [a, b, c, d] (x) =

  x−a  ,   b−a     1,  x−d    c−d     0,

x ∈ ]a, b[ x ∈ [b, c]

(4.9)

x ∈ ]c, d[ x≤a∨x≥d

A trapezoidal fuzzy set is well formed if the relations a≤b≤c≤d hold. Any triangular fuzzy set is a trapezoidal fuzzy set when a=l∧b=c=p∧d=r therefore it is possible to qualify a fuzzy set as trapezoidal even if its actual shape is triangular. A SFP made of trapezoidal fuzzy sets Ai = T [ai , bi , ci , di ], i = 1, 2, . . . , n + 1, requires that ai+1 = ci and bi+1 = di for i = 1, 2, . . . , n, as well as a1 = b1 = m

(4.10)

cn+1 = dn+1 = M In the following sections, three approaches for designing trapezoidal SFPs are presented in detail. The first one, called Constant Slope (sec. 4.3.2), defines trapezoidal fuzzy sets with the same slope (in absolute value). This is the simplest approach as it does not require additional knowledge for the design of a SFP. The second approach, called Variable Fuzziness (sec. 4.3.3) is based on the idea that fuzzy sets with a large support are more imprecise than fuzzy sets with a small support. As a consequence, the slope of the trapezoidal fuzzy sets is defined according to the distance between two adjacent cuts. Finally, the third approach (sec. 4.3.4) extends

118

4.3 Granule Fuzzification the second one by requiring an additional set of Core Points, i.e. points in the domain that must belong to the core of a fuzzy set. This approach can be used when it is a-priori known that some points are representative of some concepts to be fully represented by linguistic terms.

4.3.2 Constant Slope Given a set of cuts t1 , t2 , . . . , tn ∈ X it is possible to define a SFP made of trapezoidal fuzzy sets by applying the following procedure. First, the differences between cuts ∆i = ti+1 − ti are computed for i = 0, 1, . . . , n, with the convention that t0 = 2m − t1 and tn+1 = 2M − tn Then, the smallest difference ∆imin = min {∆i |i = 0, 1, . . . , n} is selected with the corresponding index imin . (More than one index may verify this relation: in such a case the first index is selected.) By definition, the interval [timin , timin +1 ] is the most specific among all intervals [ti , ti+1 ]. Therefore, the most specific fuzzy set is defined, which is triangular and defined by the following parameters: ti +1 + timin bimin = cimin = min 2 and 3ti − timin +1 aimin = 2timin − bimin = min 2 and 3ti +1 − timin dimin = 2timin +1 − cimin = min 2

119

Chapter 4

DC* v2.0

The slopes of the oblique segments in the triangular fuzzy set have the same magnitude but opposite signs. In particular, the ascending segment has slope ρ+ =

bimin

1 1 = − aimin timin +1 − timin

while the descending segment has slope ρ− =

cimin

1 1 = = −ρ+ − dimin timin − timin +1

The slopes ρ+ and ρ− are used to define the remaining fuzzy sets. By construction, the use of these slopes assures that all trapezoidal fuzzy sets are well-formed. In fact, higher slopes (in magnitude) could be also used, while lower slopes may hamper the well-formedness of the trapezoidal fuzzy sets. Given a cut ti , i = 1, 2, . . . , n, the following parameters are defined: ai+1 = ti −

1 2ρ+

bi+1 = ti +

1 2ρ+

ci = t i +

1 = ai+1 2ρ−

di = ti −

1 = bi+1 2ρ−

Finally, the leftmost and rightmost fuzzy sets are defined so as to be trunked at the extreme points of X. Therefore a1 = b 1 = m and cn+1 = dn+1 = M It is easy to verify that ai ≤ bi and ci ≤ di for each i = 1, 2, . . . , n + 1. Wellformedness of the trapezoidal fuzzy sets can be thus checked by verifying that bi ≤ ci for each i = 1, 2, . . . , n + 1. We suppose, by contradiction, that bi > ci . By

120

4.3 Granule Fuzzification construction, this means that ti−1 +

1 1 1 > t + = t − i i 2ρ+ 2ρ− 2ρ+

which is equivalent to ρ+
∆i−1 which is absurd by definition of ∆imin . An example of CS partition is depicted in Fig. 4.17.

Figure 4.17: Example of a Constant Slope SFP from cuts. Red dots indicate the centers of the partitions and show how those points are well covered by the cores of the fuzzy sets.

4.3.3 Variable Fuzziness This approach is based on the idea that the fuzziness of a fuzzy set in a partition is dependent on the amplitude of the interval between two cuts. In particular, the smaller is such amplitude, the sharper are the related fuzzy sets. Fuzziness can be quantified through the notion of entropy measure [16]; however it is easy to verify that fuzziness is related to the slopes of the trapezoidal fuzzy sets, so that high slopes lead to sharp fuzzy sets and vice versa. The procedure for generating the trapezoidal fuzzy sets works as follows: for each i = 0, 1, . . . , n − 1 the values ∆i and ∆i+1 are compared and the shortest is selected. (Here, t0 is set to m and tn+1 is set to M to make the selection coherent with the idea underlying this approach.) If ∆i ≤ ∆i+1 then the descending part of the fuzzy set Ai+1 is defined by a membership function that is highest at the center of ∆i and

121

Chapter 4

DC* v2.0

gets the value 0.5 at ti+1 . Formally this requires that ci+1 =

ti + ti+1 2

and di+1 = 2ti+1 − ci+1 As a consequence, the ascending part of the fuzzy set Ai+2 is defined accordingly: ai+2 = ci+1 and bi+2 = di+1 By construction, it is verified that bi+2 will be smaller than the midpoint of ∆i+1 , thus guaranteeing well-formedness of the trapezoidal fuzzy set. If ∆i > ∆i+1 the scheme is inverted and the ascending part of Ai+2 is first defined by setting bi+2 =

ti+1 + ti+2 2

and ai+2 = 2ti+1 − bi+2 Then, the descending part of Ai+1 is defined accordingly: ci+1 = ai+2 and di+1 = bi+2 Finally, the undefined parts of the leftmost and rightmost fuzzy set are set as in (4.10). In Fig. 4.18 an example of VF partition is provided.

122

4.3 Granule Fuzzification

Figure 4.18: Example of a Variable Fuzziness SFP from cuts. Red dots indicate the centers of the partitions and show how those points are well covered by the cores of the fuzzy sets.

4.3.4 Core Points This approach exploits additional information to define the SFP. In particular, it is assumed that in each interval between two cuts a finite and non-empty set of points Pi ⊂ [ti , ti+1 ] is available, with the constraint that such points must belong to the core of the corresponding fuzzy set in the partition16 . The procedure for generating the trapezoidal fuzzy sets is similar to that defined for variable fuzziness. More specifically, for each Pi the minimum and maximum elements are considered, i.e. pmin = min Pi i and pmax = max Pi i Furthermore, the distances between such points and the cuts are considered: δileft = ti − pmax i−1 and δiright = pmin − ti i for i = 1, 2, . . . , n. For each i the values of δileft and δiright are compared: if δileft ≤ δiright then ai+1 = ci = pmax i−1 16

The core of a fuzzy set is the (crisp) set of all elements with full membership.

123

Chapter 4

DC* v2.0

and bi+1 = di = 2ti − ci otherwise di = bi+1 = pmin i and ci = ai+1 = 2ti − bi+1 In figure Fig. 4.19 an example of CP partition is depicted.

Figure 4.19: Example of a Core Points SFP from cuts. In this case, dots represent the core points of each partition. Those have to be covered by the cores of the respective fuzzy sets because full representative (the needed additional knowledge) of the underlying linguistic concept.

The optimal configuration of cuts identified by DC* represents the starting point for a modeling procedure devoted to define a SFP for each input feature based on trapezoidal fuzzy sets, which corresponds to the last step of DC*. Furthermore, the inherent working engine of DC* is oriented to produce additional pieces of information, namely the prototypes identified by the Data Compression step (sec. 3.2.1). Therefore, DC* represents a suitable procedure to design SFPs based on cuts and core points (in line with the approach described in sec. 4.3.4).

4.4 Summary This chapter has presented the version 2.0 of DC*, a strongly enhanced version of the DC* method regarding both the accuracy/interpretability aspects of the final model and the efficiency of the method applied to particularly complex problems. Specifically, innovations have been introduced on the second method phase, the

124

4.4 Summary optimal solution search (described in sec. 4.1), operating on both the A* priority queue structure and the heuristic function. Moreover, a suited new approach to the optimal solution search has been presented in sec. 4.2, composition of a Genetic Algorithm and A*. Finally, different fuzzification criteria of the information granules have been proposed in sec. 4.3. Results about the effectiveness of DC* v.2.0 are provided in the next chapter.

125

5 Experimental Results This chapter aims at showing the DC* capabilities by providing a set of experiments. From a global point of view, the objective of the experimentations is to evaluate the DC* method in terms of both accuracy and interpretability. To this pursuit, the reported experiments show the improvements that have been achieved by enhancing DC* from its original version to the new version. First, in sec. 5.2.1, DC* (with some slightly modifications to the original proposed version1 ), has been tested on a number of selected benchmark datasets, in a comparative experimentation with the Hierarchical Fuzzy Partitioning (HFP) algorithm (described in sec. 2.7.1.1). Comparative results have shown the capabilities of DC* to find a better trade-off between interpretability and accuracy, although its efficiency weakness with large datasets has been arisen. These experiments shows the potential of DC* and suggest possible paths for improvements, some of which have been considered in this work. First, to tackle the efficiency problem of DC*, significantly improvements on the second step of the method have been introduced, leading to DC* v2.0. In particular, in sec. 5.2.2 performances obtained by means of the new proposed heuristic function (described in details in sec. 4.1.2) are compared with the previous method version, showing a significant improvement in terms of expanded states without compromising the solution quality. Moreover, a different approach aimed at further improving the method efficiency, based on a GA guide for the A* search (described in sec. 4.2), is compared with the new proposed heuristic function in sec. 5.2.3. Results have shown an appreciable reduction of the computational time needed to find an optimal solution to the problem at hand. Finally, the granule fuzzification stage of DC* has been taken into account. As described in sec. 4.3, this operation is totally independent and can be implemented by different methods without compromising the DC* obtained partitions. For this reason, in sec. 5.2.4, an experimentation that compares different kind of interpretable 1

Those modifications involved the A* priority queue structure (see sec. 4.1.1) and the granule fuzzification made by means of trapezoidal strong fuzzy partitions.

127

Chapter 5

Experimental Results

granule fuzzifications has been conducted leading at the choice of a particular fuzzification approach in order to maximize the final model accuracy (ensuring the fulfillment of the interpretability constraints).

In the next section, information about the dataset adopted in the experiments are provided. Following, the four different experiments with the respectively results are described in details.

5.1 Dataset Description

This section is dedicated at describing the datasets adopted in the experimentations. This is a collection of free benchmark classification problems, selected from the well known UCI Machine Learning Repository [8], in which only numerical preclassified data, without missing values, appear. The datasets are heterogeneous in their scope, in order to perform the experimentation process on data showing different characteristics. In table Tab. 5.1, the structural characteristics of the datasets are depicted. In particular, Ionosphere and Glass Identification has been modified i.e. in the former the second feature has been removed due to its constant value for all the dataset samples; in the latter, Glass Identification, class “4” has been removed since it is not represented by any sample. In the following, for each dataset, a brief motivation that justifies its adoption in the experimentations is provided. For the dataset description the reader is referred to the UCI Machine Learning Repository [8].

128

5.1 Dataset Description Dataset samples features classes Iris 150 4 3 Wine 178 13 3 Wisconsin Breast Cancer 683 9 2 Pima Indians Diabetes 368 8 2 Vertebral Column (2 classes) 310 6 2 Vertebral Column (3 classes) 310 6 3 Ionosphere 351 33(34)* 2 Cardiotocography (CTG) 2126 21 3 Glass Identification 214 9 6(7)** Statlog-Shuttle 58000 9 7 Table 5.1: Dataset characteristics. *The second feature has been removed because it exhibits a constant value. **Class “4” has been removed since it is not represented by any sample.

Iris The Iris dataset is one of the most known dataset in literature and hence has been selected as a good benchmark problem in order to test the DC* capabilities and comparing them with other approaches. Wine The wine dataset is another one of the most known dataset in literature. It has been selected due to its number of features. It could be interesting to study how DC* performs on this kind of problems. Wisconsin Breast Cancer The Wisconsin Breast Cancer (WBC) dataset has been selected mostly for its number of both samples and features and also because it is a two-class problem. Pima Indians Diabetes Pima Indians is one of the most known problems in literature and hence selected as a good benchmark problem for DC*. Vertebral Column (two and three classes) This dataset has a twofold class labeling. It has been selected to study the DC* behavior when class labels increases over the same problem. Ionosphere The Ionosphere dataset has been selected due to its high number of features. Cardiotocography Cardiotocography (CTG) dataset has been selected because of its number of both features and samples. Glass Identification Glass is another one of the most known dataset in literature. It has been selected as a good benchmark problem (due to its complexity)

129

Chapter 5

Experimental Results

other than its number of classes. Statlog-Shuttle This complex dataset has been selected mostly because of its complexity given from the number of samples and the number of strongly unbalanced classes.

5.2 Experimentations In the following, experiments regarding the evolution from DC* v1.0 to DC* v.2.0 are provided. In sec. 5.2.1 a comparison between DC* and the well known Hierarchical Fuzzy Partitioning algorithm [61] (described in sec. 2.7.1.1) is described. In sec. 5.2.2, the efficiency performances of the new proposed heuristic function (described in sec. 4.1.2) are compared with the ones obtained by means of the original heuristic function. In sec. 5.2.3, the capabilities of DC* with GA-Guided A* (introduced in sec. 4.2.2) are compared with the efficiency of DC* with the new heuristic function. Finally, different types of partition fuzzification, based on Strong Fuzzy Partitions from cuts (see sec. 4.3), are evaluated in sec. 5.2.4.

5.2.1 Experiment 1: HFP / DC* comparative experimentation This experiment aims at comparing two algorithms that are capable of generating fuzzy partitions from data so as to verify a number of interpretability constraints: Hierarchical Fuzzy Partitioning (HFP) and Double Clustering with A* (DC*). In particular, the tested DC* version is the 1.0, described in sec. 3.2, except for the queue structure modifications, discussed in sec. 4.1.1, and the partition fuzzification operated by the Variable Fuzziness method, as described in sec. 4.3.3. The idea is to obtain a starting point to evaluate the DC* method, in terms of both accuracy and interpretability, and focusing on its efficiency performances (intended as the capability to provide a model of the data in a reasonable time). Both algorithms exhibit the distinguishing feature of self-determining the number of fuzzy sets in each fuzzy partition, thus relieving the user from the selection of the best granularity level for each input feature. However, the two algorithms adopt very different approaches in generating fuzzy partitions, thus motivating an extensive experimentation to highlight points of strength and weakness of both. The experimental results show that, while HFP is on the average more efficient, DC* is capable

130

5.2 Experimentations of generating fuzzy partitions with a better trade-off between interpretability and accuracy, and generally offers greater stability with respect to its hyper-parameters.

Experimental setup

The overall setup of the experimental sessions is described in this section. The aim of the experimentation is to perform a fair comparative test between HFP and DC*. The test involves the use of two freely available tools: WEKA2 3.6.7 [65] and FisPro3 3.4 [62]. The first is a suite of machine learning algorithms for data mining tasks: it is used to perform a stratified ten-fold partition of the datasets involved in the experimentation. The latter is an open source software for fuzzy inference system design and optimization: it includes an implementation of HFP and a tool for the subsequent generation of FRBSs. FisPro is also exploited to carry out the performance evaluation of all the derived FRBSs. Fig. 5.1 depicts the comparative experimentation framework highlighting the role of WEKA and FisPro, together with the contribution of DC*.

2 3

Weka is available at http://www.cs.waikato.ac.nz/ml/weka/ FisPro is available at http://www.inra.fr/mia/M/fispro/FisPro_EN.html

131

Chapter 5

Experimental Results

Figure 5.1: DC* - HFP. Comparative experimentation framework.

Ten datasets are involved in the experimental sessions: all of them, as described in sec. 5.1, include numerical, classified data without missing values. The datasets are heterogeneous in their scope, in order to perform the experimentation process on data showing different characteristics. In view of the ultimate validation to be accomplished at the end of the experimentation, the first step consists in a stratified ten-fold partition of the datasets, performed using WEKA: given a seed value, the tool randomly creates the data partitions providing the learning sets and the test sets. For each fold a different experiment is carried out. Each learning set is meant to be processed by both HFP and DC*. After the generation of partitions, the corresponding FRBSs are defined. Since HFP generates a family of partitions, only two partitions are eventually selected for comparison: the most accurate —denoted as “minErr”— and the most interpretable (i.e. the

132

5.2 Experimentations partition with the smallest number of rules), denoted as “minRules”. Furthermore, HFP requires the specification of a number of hyper-parameters, which have been fixed to their default value. On the other hand, the behavior of DC* mainly depends on the number of prototypes required for the data compression phase (operated by LVQ1), which has been varied for each experimentation, while the following hyper-parameters (defined in sec. 3.2.1) are fixed for all experimental sessions: • initial learning rate = 0.01; • maximum number of epochs = 1000; • shifting threshold value = 10−4 . For each dataset, several executions of DC* are performed, doubling the number of prototypes at each run. In particular, the first run is computed with a number of prototypes equal to the number of classes (one prototype per class). At each successive run, the number of prototypes is doubled and proportionally distributed among the classes. With few exceptions, the process is iterated until the number of prototypes is at least equal to the number of rules of the most accurate FRBS identified by HFP. To evaluate the accuracy of the final FRBSs —both from HFP and DC*— a module within FisPro is applied on the test sets. Since a ten-fold cross validation scheme is adopted, the performance for each dataset can be expressed in terms of average values related to ten different classification models.

Results and discussion

A number of considerations can be drawn from the analysis of the experimental results which are globally illustrated in tables Tab. 5.2 and Tab. 5.3.

133

Chapter 5

Experimental Results

Table 5.2: A general picture of the experimental results (first part). Each column reports the average results (over 10-fold cross validation) ± the standard deviation. For DC* the number p of prototypes used in the first stage is also reported. In bold, the best results in terms of accuracy/interpretability tradeoff are highlighted. Dataset

Iris

Wine

WBC

Pima

Vertebral 2

134

ALGORITHM minErr HFP minRules p=3 DC* p=6 p = 12 minErr HFP minRules p=3 p=6 DC* p = 12 p = 24 minErr HFP minRules p=2 p=4 p=8 DC* p = 16 p = 32 p = 48 minErr HFP minRules p=2 p=4 DC* p=8 p = 16 p = 32 minErr HFP minRules p=2 DC* p=4 p=8

Rules 8.5±0.81 2±0 3±0 3.6±0.49 3±0 39.6±20.29 2±0 2.9±0.3 3.1±0.3 3.5±0.5 4.8±1.25 34.7±18 2±0 2±0 2±0 2±0 2±0 3.1±1.58 3.5±1.5 39±16.64 2±0 2±0 2±0 2.2±0.6 2.6±0.8 7.6±2.54 5.5±3.61 2±0 2±0 2±0 2±0

Features 3±0 1±0 2±0 2±0 1.6±0.49 5.4±1.11 1±0 2±0 2±0 2±0 2.3±0.45 4.9±0.83 1±0 1±0 1±0 1±0 1±0 1.5±0.67 1.7±0.64 5.8±1.17 1±0 1±0 1±0 1.1±0.3 1.4±0.49 3.3±0.46 2.3±0.64 1±0 1±0 1±0 1±0

mean #MF 2.09±0 2±0 2±0 2±0 2.4±0.49 2.31±0.19 2±0 2±0 2±0 2±0 2±0 2.02±0.05 2±0 2±0 2±0 2±0 2±0 2±0 2±0 2.29±0.15 2±0 2±0 2±0 2±0 2±0 2±0 2±0 2±0 2±0 2±0 2±0

Err % 5.33±7.20 33±14.00 30.7±14.93 21.3±13.27 6.67±6.67 7.22±3.72 32±5.41 43.3±7.50 23.8±5.97 21.3±11.00 19.1±4.06 4.85±2.09 16.5±6.74 13.7±4.84 12.8±4.56 13.7±4.84 13.7±3.49 6.47±3.03 6.62±3.56 28±4.55 35±4.03 35.6±4.08 36.9±4.83 37.3±5.64 32.6±9.38 22.2±6.66 26±7.66 33±10.28 34.7±9.53 32.2±10.66 25.3±8.31

5.2 Experimentations Table 5.3: (cont’d from table Tab. 5.2). Dataset Vertebral 3

Ionosphere

CTG 3

Glass

Shuttle

ALGORITHM minErr HFP minRules p=3 DC* p=6 minErr HFP minRules p=2 DC* p=4 p=8 minErr HFP minRules p=3 DC* p=6 p = 12 minErr HFP minRules p=6 DC* p = 12 minErr HFP minRules p=7 DC* p = 14

Rules 4.6±1.28 2±0 3±0 3.9±0.3 300±637.02 2±0 2±0 2±0 2.2±0.6 120.8±95.21 2±0 2.7±0.46 3±0 3.3±0.46 15.4±5.28 2±0 6±0 8.4±1.11 81.5±22.84 2±0 7±0 10±1.18

Features 2.1±0.3 1±0 2±0 2±0 5±3.02 1±0 1±0 1±0 1.1±0.3 7.9±1.87 1±0 2±0 2±0 2.3±0.46 5.6±0.92 1±0 3±0 3.7±0.46 22.83±0.49 1±0 3±0 4±0

mean #MF 2.05±0.15 2±0 2±0 2±0 2±0 2±0 2±0 2±0 2±0 2.16±0.08 2±0 2±0 2±0 2±0 2.13±0.17 2±0 2±0 2±0 2.87±2.76 2±0 2±0 2±0

Err % 43±8.63 47±9.16 51.3±7.28 33.4±11.63 16±6.20 31±11.43 39.7±8.60 23.1±10.26 15.7±12.54 17.4±2.48 20.4±2.50 42.3±6.34 26.3±7.46 13.6±8.50 51±15.52 57±11.38 63.8±15.10 43.3±13.71 3.67±1.51 21.4±0.43 57.9±19.53 19.6±2.71

For each dataset, the table reports the results related to the couple of FRBSs generated by HFP and selected for comparison (the one providing the best accuracy performance and the one including the lower number of rules) and to a number of FRBSs generated by DC* (each of them obtained by doubling the number of prototypes at every new generation). It can be verified how the simplest models derived (namely, the 2-rules FRBSs produced by HFP and the FRBSs generated by applying DC* with one prototype per class) are characterized by poor values of accuracy performance. On the other hand, by increasing the structural complexity of the models, it is possible to observe a consequent reduction of the error values. In other words, the well known effects connected with the accuracy/interpretability trade-off can be recognized in the results reported in the table. To allow a fair comparison, the maximum number of DC* prototypes (standing as

135

Chapter 5

Experimental Results

the upper limit on the maximum number of rules to be generated) is set for each dataset by taking into account the number of rules composing the most effective FRBS derived through HFP (wherever it is possible). In this way, HFP and DC* are oriented to potentially produce models with a similar number of rules so that the accuracy analysis can be more revealing. In terms of accuracy DC*, outperforms HFP on six datasets. More interestingly, when we turn to consider the accuracy/interpretability trade-off, it can be observed how the DC* method is able to provide the simplest models, exhibiting the smallest number of rules and involved features (the average number of MFs per model is almost the same when both algorithms are applied). Such a superiority is verified for each dataset and in some cases the gain in terms of structural complexity is highly appreciable (e.g. Ionosphere and CTG 3 datasets). In this sense, the models produced by DC* appear to be preferable even when their accuracy performance values are slightly lower than those reported by the HFP-generated FRBSs. As a further remark, it can be noticed how DC* provides better results also in terms of model stability, as showed by the standard deviation values reported to complete the information regarding the average numbers of rules, involved features and MFs. Some additional details, concerning the experiments on particular datasets, can be highlighted. While in a couple of cases (related to the Shuttle and Wine datasets) the iterative application of DC* with doubled number of prototypes has been quit to avoid an excessive growth of the computational burden, in some other circumstances the process has been stopped by reason of the huge number of rules generated by HFP. This is the case, for instance, of the Ionosphere and CTG 3 datasets, where DC* is able to overcome HFP in terms of accuracy performance while producing classifiers with a reduced number of rules. Better results from HFP could be potentially achieved by fine-tuning of the hyper-parameters. However, this would have involved a heavy trial-and-error process further complicated by the need of selecting partitions from the returned sets. Finally, the overall analysis of the obtained results lets us underline some peculiarities of the DC* algorithm. Firstly, it is able to produce FRBSs with reduced features: their number is lower than both the total number of features in each dataset and the number of features characterizing the HFP-generated models. The second peculiarity of DC* consists in the capability to operate the optimization process mentioned in section sec. 3.2 producing FRBSs with a number of rules which is significantly

136

5.2 Experimentations lower than the number of prototypes (corresponding to the maximum number of attainable rules).

Summary The experimental results show that both DC* and HFP exhibit different points of strength that make them valid approaches for generating interpretable FRBSs. In particular, DC* is superior in terms of accuracy/interpretability tradeoff because it is capable of designing very compact FRBSs with appreciable classification errors that are only slightly higher or even smaller than the most accurate models provided by HFP. Furthermore, DC* requires very few hyper-parameters, the most important one regulating the granularity of the acquired knowledge by fixing the upper bound in the number of rules to define. However, DC* is not very sensitive to this hyperparameter: this avoids the necessity of fine-tuning and guarantees a descriptive stability of the resulting FRBSs. As concerning HFP, this algorithm shows more flexibility since it is not limited to classification problems and, on the average, it is more efficient than DC* because its computational complexity is polynomial on the number of data samples. On the other hand, since the theoretical computational complexity of DC* is exponential in the worst case (on the number of prototypes times the data dimensionality), in some experiments DC* did not terminate in reasonable time (after more than 30 hours). The efficiency of DC*, that relies strongly in its second step, can be improved by a refinement of the heuristic function used in the second stage of DC*, where A* is applied.

5.2.2 Experiment 2: The New heuristic function In this experiment the original heuristic function and the new proposed heuristic function for the A* search stage of DC*, are compared in terms of efficiency. As described in details in sec. 3.2, DC* relies on A* for the granulation process, whose efficiency is tightly related to the heuristic function used for estimating the costs of candidate solutions. In sec. 4.1.2 a new heuristic function which estimates the path cost to reach a solution state, is proposed. In particular, the new heuristic function is capable of exploiting class information trying to overcome the weakness with complex problems as shown by the heuristic function originally used in DC* (described

137

Chapter 5

Experimental Results

in sec. 3.2.2). This experiment aims at proving the efficiency improvements obtained adopting the new proposed heuristic function. Experimental results show that the proposed heuristic function allows huge savings in terms of computational effort even with complex problems. However, no gain has been observed for two-classes problems. As expected the two heuristics provide the same results for two-classes problems, because they compute in the same way in this particular case.

Experimental setup The experimental objective is to provide a performance comparison between the original and the new proposed heuristic function (details are provided respectively in sec. 3.2.2 and sec. 4.1.2). In particular, as shown in Fig. 5.2, for a fair comparison two versions of A* (one for each heuristic function) are applied on the same prototypes, obtained by the first DC* step (the data compression phase). Performances are evaluated on seven different datasets (a subset of the datasets described in sec. 5.1 as shown in Tab. 5.4). As described, datasets include only numerical, pre-classified data without missing values.

DC*

A* with the original heuristic

original heuristic # explored states TERZO LIVELLO DI PROFONDITÀ STATI CONTENETI TRE TAGLI

COSTO DALL'ORIGINE PARI A TRE

Data Compression: LVQ1

Dataset # Prototypes

same optimal solution

feature B

feature A

TERZO LIVELLO DI PROFONDITÀ STATI CONTENETI TRE TAGLI

COSTO DALL'ORIGINE PARI A TRE

new heuristic # explored states A* with the new heuristic

Figure 5.2: The experiment setup framework to test the heuristic functions.

For each dataset, two different numbers of prototypes are adopted in the experiment. It is worth to remember that prototypes are proportionally assigned to the classes, according to the class distribution in the dataset. The key information that shows

138

5.2 Experimentations the different efficiency of the heuristic functions is the number of states explored by A*, representing the most expensive operation performed during the DC* method execution. Due to the optimality of A*, for each dataset both the versions returned the same optimal solution (i.e. the same cut configurations), but through a different number of explored states. As explained in sec. 3.2, DC* finds the optimal solution to the problem - i.e. the one with the minimum number of information granules - thanks to the adoption of A* in the second clustering phase. It is noteworthy that for this reason, only the numbers of the explored states are investigated (without considering the solution performance: accuracy and interpretability) and hence without performing a cross-validation.

Results and discussions In table Tab. 5.4 the summarized results of the experiments are shown. (For the Shuttle dataset with 21 prototypes computed by DC* with the original heuristic function, execution has been stopped after 6 hours and the number of explored states has been reported.) Table 5.4: Datasets and experimental comparative results. Shuttle with 21 prototypes computed by the original heuristic is incomplete. *The second feature has been removed because it exhibits a constant value. **Class “4” has been removed since it is not represented by any sample. Dataset Iris samples 150 / features 4 Wine samples 178 / features 13 Breast Cancer Wisconsin samples 683 / features 9 Vertebral Column (3 classes) samples 310 / features 6 Ionosphere samples 351 / features 33(34)* Glass Identification samples 214 / features 9 Statlog-Shuttle samples 58,000 / features 9

classes 3 3 2 3 2 6(7)** 7

numb. of prototypes 21 42 20 40 30 60 12 24 10 20 9 18 12 21

explored states original proposed 408 652 5,454 23,053 34 61 9,089 53,727 97 23,143 10,720 257,854 276,842 >2,827,876

111 139 1,026 3,451 34 61 742 13,556 97 23,143 1,095 38,826 5,533 120,487

% saving 72.79% 78.68% 81.19% 85.03% 0.00% 0.00% 91.84% 74.77% 0.00% 0.00% 89.79% 84.94% 98.00% >95.74%

139

Chapter 5

Experimental Results

It is possible to observe that for datasets with two classes there is no gain in efficiency because the two heuristic functions work in the same way. On the other hand, for datasets with more than two classes, the ability of the new proposed heuristic in exploiting class information is apparent: the proposed heuristic function allows huge savings by exploring a very small number of states compared with the original one. The proposed heuristic function proved to considerably boost the efficiency of DC*, making it a competitive alternative of other well known algorithms for extracting interpretable knowledge from data. In fact, in the worst case scenario considered in the experimentation (namely, the case of Statlog-Shuttle data with 21 prototypes), the generation of the information granules required about 15 minutes to complete4 .

Summary The conducted experimentation has shown the significant improvement, in terms of efficiency performance of the method, obtained by means of the new heuristic function in the A* search process. On the one hand, experimental results obtained with the comparison between DC* and HFP (see sec. 5.2.1) ensures that DC* is a good candidate for automatically designing fuzzy rule-based classifiers that exhibit high interpretability and good accuracy. It is also easy to tune because it requires the specification of just one hyper-parameter5 , namely the number of prototypes for the first step, which has a clear semantics as it regulates the level of granularity of the derived knowledge base. On the other hand, the results obtained by this experimentation show that the adoption of the new heuristic function overcome the DC* weakness to deal with complex problems. Therefore, DC* can be used both to generate few fuzzy information granules for a rough description of data and, alternatively, to design an accurate classifier through a greater number of fuzzy information granules. However, behaving in the same way of the original heuristic function, the new heuristic function does not provide any efficiency improvements for two-class problems. This particular aspect is tackled with the GA-guided A* approach, as described in sec. 4.2 and showed by the results provided in the next experiment. 4

Experiments have been conducted on a virtual machine (VMware) equipped with four x86 vCPUs @ 2.35GHz and 8GB of vRAM. 5 Other hyper-parameters are left to default values because they do not significantly influences the method.

140

5.2 Experimentations

5.2.3 Experiment 3: GA guided A*

In this experimentation DC* with the GA-guided A* approach (described in sec. 4.2.2) is evaluated. In particular, with the aim to investigate the efficiency gain obtained adopting this new search approach, a comparison with the performance obtained by DC* with the new heuristic function is provided. As described, the GA-guided A* is a combination of a Genetic Algorithm and the A* search algorithm which should improve the efficiency of the DC* computation preserving its optimality.

Experimental results show that the GA-guided A* allows to obtain a significant additional speed-up in the second step of DC*, the optimal solution search, even when compared with the efficiency performance of DC* with the new heuristic function.

Experimental setup

This experiments aims to prove the efficiency gain obtained adopting the GA-guided A* approach (described in sec. 4.2.2) in the second step of DC* (generally described in sec. 3.2.2). With the purpose to obtain this, the GA-guided A* approach is compared with the A* search with the new heuristic function (described in sec. 4.1.2 and tested in sec. 5.2.2). It is worth to underline that the GA-guided A* approach is, applied to the A* search with the new heuristic function, hence leading to focus on the efficiency improvements obtained using the GA-guide in the search stage.

The approaches are tested on seven different datasets (a subset of the datasets described in sec. 5.1 as shown in Tab. 5.5).

141

Chapter 5

Experimental Results dataset

# prototypes

LVQ1

feature B

DC*v2.0

feature A

GA GA-guided A*

A* search explored states

explored states

σopt

σopt PoG

Computational time GA+A*

Optimal solution

Computational time A*

Figure 5.3: The experiment setup framework to test the heuristic functions. In Fig. 5.3 the experiment framework is depicted. For each dataset, a number of prototypes is specified and the first clustering step of DC*, the compression phase, is computed. For a fair comparison, both the GA-guided A* approach and the A* with the new heuristic function are computed on the same prototype positions, hence exploiting the same set of cuts (see sec. 3.2 for details about this). On one hand the GA finds a cut configuration in order to guide the following A* search (left-hand side of Fig. 5.3); on the other hand, the A* search (with the new heuristic function) is directly computed (right-hand side of Fig. 5.3). Details about the GA parameters are provided in sec. 4.2.2.1. Differently from the comparison experiment between the two different heuristic function, where the computed optimal solution was exactly composed by the same number of cuts in the same position (described in sec. 5.2.2), here the solutions are both optimal (the same minimum number of cuts) but may have cuts in different positions. This is related with the two different A* priority queue structures: in the A* search approach, the queue is composed by the cost function value, the distance between cuts, and the number of involved features, hf (σ), ∆(σ), f count(σ)i (for details see sec. 3.2.2); in the GA-guided A* approach, in order to exploit the PoG attrac-

142

5.2 Experimentations tion, the queue is composed by the cost function value, the distance from the PoG, and the distance between cuts, hf (σ), δP oG (σP oG , σ), ∆t i (for details see sec. 4.2.2). In the case of same cost function value for a number of states, this difference in the queue leads to prefer states with wide cuts in the A* search approach, while states closer to the PoG in the GA-guided A* approach, hence enabling the possibility to reach two different optimal solution states. Being interested in the computational burden investigation, in particular in the comparison between these two approaches at the optimal solution search, the solution cut configuration is not taken into account in the experiment results and hence a cross-validation is not required. Results and discussion Aiming at investigate the efficiency capabilities of both the approaches and due to their the structure, the only measure that gives useful indications is the computational time, even if this could be affected by a number of external factors related to the operating system, the hardware, the processes, etc6 . Specifically, this choice is driven by the fact that, in order to evaluate the GA-guided A* approach, both the GA and the consequent A* search computations have to be taken into account. The problem is that there is no measures shared by those algorithms except for the required computational time. Therefore, for the GA-guided A* the sum between the GA computational time and the A* search is considered; on the other hand, the computational time required by A* is considered. However, referring only to the A* computations (the A* approach and the A* guided by the GA), a comparison on the number of expanded states is possible, highlighting the actual contribution of the GA-guide during the search process. In Tab. 5.5 on page 145, the summarized results of the experiments are shown. Due to the nature of the GA computing and with the aim to investigate the result stability limited to the GA-guided A* approach, for each dataset ten different runs of the whole GA-guided A* search have been performed over the same prototypes positions and hence exploiting the same cut configuration7 . The reported results are the average values of the ten executions. At each row corresponds a dataset and 6

For this reason, experiments are conducted trying to reproduce at best the same machine conditions for the comparison between the approaches. 7 This implies that for the A* search approach the values of the ten different execution are approximatively the same

143

Chapter 5

Experimental Results

for each dataset the corresponding approach values are depicted. On the left-hand side of the table there are the experiment measured values while in the right-hand side only the comparison values, referred to the GA-guided approach, are depicted.

As mentioned in sec. 4.2.2.1, preliminary experimentations with the GA algorithm have shown that the population evolution (referring to the adopted datasets) occurs in the first 20 generations. For this reason the total number of generations is limited to 20. In other words it is assumed that the GA fitness functions has a constant trend after the 20th generation, as shown for example purpose, by the fitness functions depicted in Fig. 5.4.

144

wine (40)

wbc (60)

vertebral 3 (24)

shuttle (21)

iris (42)

ionosphere (20)

glass (18)

dataset (# prot.)

76334

493839

classic A*

GA guided

24081.7

GA guided

81507

classic A*

381063 13433.8

classic A*

GA guided

24817.7

128777

classic A*

GA guided

13751.7

GA guided

40452

classic A*

20571 4719.7

classic A*

GA guided

10261.9

65366

classic A*

GA guided

expanded

ALG

27.686

21.108

5.715

8.445

5.071

27.324

6.932

(sec)

GA

2229.138

17387.666

128.727

462.618

18.069

11236.429

107.989

971.675

19.055

86.449

20.886

89.25

23.309

212.693

(sec)

A*

2256.824

17387.666

149.835

462.618

23.784

11236.429

116.434

971.675

24.126

86.449

48.21

89.25

30.241

212.693

(sec)

tot time

-417505

-57425.3

-367629.2

-103959.3

-26700.3

-15851.3

-55104.1

gain(abs)

-84.54%

-70.45%

-96.47%

-80.73%

-66.00%

-77.06%

-84.30%

gain %

expanded

-15130.842

-312.783

-11212.645

-855.241

-62.323

-41.04

-182.452

gain(abs)

time

-87.02%

-67.61%

-99.79%

-88.02%

-72.09%

-45.98%

-85.78%

gain %

4.6 (0.84)

4.2 (0.92)

5.4 (0.70)

5.8 (0.42)

5.5 (0.85)

3 (0.47)

6.5 (0.85)

GA (std)

3 (0)

3 (0)

3 (0)

3 (0)

4 (0)

4 (0)

4 (0)

4 (0)

4 (0)

4 (0)

2 (0)

2 (0)

4 (0)

4 (0)

final (std)

# cuts

0.83 (0.09)

min 0, max 1

0.57 (0.40)

min 0, max 1

0.32 (0.18)

min 0, max 1

0.70 (0.21)

min 0, max 1

0.35 (0.22)

min 0, max 1

0.71 (0.23)

min 0, max 1

0.73 (0.12)

min 0, max 1

δP oG (std)

sol. distance

Table 5.5: Datasets and experimental comparative results between the A* search (referred with classic A* ) and the GA-guided A* approach. Reported values are the averages of ten executions in which prototype positions, for the same dataset, are not changed.

5.2 Experimentations

145

Chapter 5

Experimental Results

Figure 5.4: Fitness function graphs regarding a generic run of each dataset.

146

5.2 Experimentations In the figure it is possible to observe how the fitness function trend is well stabilized at the 20th generation. Analyzing the obtained results, it is possible to observe a significant saving of the number of expanded states adopting the GA-guided A* approach (w.r.t. the A* search approach). This allows to consider the PoG, identified by means of the GA, an effective way to guide the A* search to an optimal solution, avoiding the expansion of useless states. Looking at the whole efficiency of the method, thus considering the computational time needed by the second step of DC*, the GA-guided A* approach shows a significant time gain compared with the time needed by A* (even with the new heuristic function) to find and optimal solution. Results about both the number of cuts leading to some observations. The first one is that, as expected, the final solution obtained by the GA-guided A* is optimal being composed by the same number of cuts of the one found by the A* approach. Furthermore, the GA identified solution, adopted as PoG, is always a suboptimal solution, quite stable in the number of cuts as showed by the standard deviation of the ten execution per dataset. Finally, looking at the last table column showing the distance of the PoG from the final found solution, it is possible to observe how these values are quite high (except for the Vertebral 3 dataset). Taking into account both the computational time gains and the distance values, the GA-guided A* provides always for significant computational time gains. In other words, the PoG represents a good solution to solve the problem of equal values of f (σ) in the A* priority queue, breaking ties and guiding the expansion through an optimal solution even if the PoG is positioned in a slightly different place in the search space. However, the case of Vertebral 3 dataset, where both the distance is the lowest and the computational time gain is the highest, highlights how a more precise PoG (represented by a solution state closer to the final solution) may lead to huge time savings. Summary In this experiment the efficiency improvements to DC*, caused by the GA-guided A* approach, have been tested in a comparative experimentation with the A* search approach (with the new heuristic function). It is worth to remember that both the approaches provide for an optimal solution to the problem at hand (even if the solution may not coincide).

147

Chapter 5

Experimental Results

In order to evaluate this quantity, the computational time needed by the entire search process has been taken into account (as explained in the experiment description). In general, for all the tested datasets, a significant computational time gain adopting the GA-guided A* approach, has been obtained. This leads the GA-guided A* approach to be the best DC* configuration to tackle large-complex problems of two or more classes.

5.2.4 Experiment 4: Granule Fuzzification The adoption of triangular fuzzy sets to define Strong Fuzzy Partitions (SFPs) is a common practice in the research community: due to their inherent simplicity, triangular fuzzy sets can be easily derived from data by applying suitable clustering algorithms. However, the choice of triangular fuzzy sets may be limiting for the modeling process. sec. 4.3 focuses on SFPs built up starting from cuts (points of separation between cluster projections on data dimensions), showing that a SFP based on cuts can always be defined by trapezoidal fuzzy sets. This experiment wants to compare the accuracy performances of the mechanisms proposed in sec. 4.3 to derive SFPs from cuts, with a numerical simulation. This can prove the correctness of the different approaches other than provides useful indication about the better way to fuzzify the DC* partitions when the method is adopted.

Experimental setup

The objective of this simulation, is to evaluate the DC* behavior when different strategies for generating SFPs are adopted. To this aim, a set of synthetically generated datasets are used: one of them (SD1) consists of 200 bi-dimensional examples, the other four datasets (SD2, SD3, SD4, SD5) consist of 400 bi-dimensional examples. In each case, the samples belong to 3 different classes. The datasets are depicted in Fig. 5.5.

148

5.2 Experimentations

Figure 5.5: The synthetically generated datasets adopted for the numerical simulation.

DC* has been employed to process the data. The initial clustering has been performed considering 24 multi-dimensional prototypes for SD1 and 48 multi-dimensional prototypes for SD2–SD5. (The prototypes are proportionally distributed according to the number of samples for each class). The final fuzzy partitions have been derived by alternatively applying the described procedures: Constant Slope (CS) (see sec. 4.3.2), Variable Fuzziness (VF) (see sec. 4.3.3) and Core Points (CP) (see sec. 4.3.4). Additionally, two more strategies have also been tested, oriented to the generation of triangular fuzzy partitions: • in the first case (TSFP, where T stands for triangular), SFPs have been obtained by partially exploiting the information coming from cuts: the design

149

Chapter 5

Experimental Results

of the triangular fuzzy sets is such that their core points correspond to the midpoints of the intervals defined by the cuts (e.g. see Fig. 5.6). • in the second case (T0.5-cuts), the triangular fuzzy sets are shaped so that the membership values in t1, ..., tn are set at 0.5 (e.g. see Fig. 5.7). However, as shown in sec. 4.3.1, this mechanism leaves no guarantee to derive a SFP for sure.

Figure 5.6: Example of a TSFP obtained fixing the fuzzy set cores as the mid points between cuts. It is possible to see how the 0.5 − cut values are not verified over the original cut positions. Hence the cut shifting is depicted and highlighted.

Figure 5.7: Example of a T0.5-cuts obtained fixing the fuzzy set cores as the mid points between cuts and respecting the 0.5 − cut values over the original cut positions. It is possible to see how the fuzzy sets overlap each other without deriving a SFP.

Results and discussion

In Fig. 5.8, for each dataset the problem space partition computed by DC* are depicted. Fuzzification methods are tested exploiting cut configurations over the problem axes.

150

5.2 Experimentations

Figure 5.8: Solution cuts identified by DC*

Tab. 5.6 reports the performances (in terms of percentage of classification error) of DC* for each adopted strategy. It can be verified that for each dataset the best performance is attained by applying the Core Points strategy. In general, resorting to triangular fuzzy partitions means a deterioration in the classification error values.

151

Chapter 5

Experimental Results

Table 5.6: DC* classification error (percentage values) when different strategies are applied to generate fuzzy partitions for each of the five datasets. CS VF CP TSFP T0.5-cuts

SD1 17.50 11.00 9.00 44.00 20.50

SD2 SD3 11.75 16.50 7.00 11.00 7.00 8.75 9.50 17.75 16.50 18.00

SD4 13.50 6.50 4.75 9.00 19.50

SD5 8.00 3.50 3.00 4.50 18.50

More interestingly, Fig. 5.9 depicts the different fuzzy partitions produced by DC* when the above mentioned strategies are applied (in particular, it shows the configurations related to the clustering processes performed over one of the synthetic datasets - namely, SD4). It is important to highlight how the choice for a triangular fuzzy partition forced to express a 0.5 value at the cuts points gives rise to a configuration which does not satisfy the SFP conditions. On the other hand, the fuzzy partition provided through the CP approach gives a tangible idea on the fuzziness of the linguistic terms in accordance with the core points provided by DC*: it is apparent that fuzziness is acceptable in the right side of the Universe of Discourse, while crisper linguistic terms are required to discriminate data in the center and left side.

152

5.2 Experimentations

Figure 5.9: Fuzzy partitions obtained by DC* on SD4 dataset through the adoption of different strategies.

Moreover, looking at the Fig. 5.10, it is possible to appreciate the difference (between the TSFP approach and the CP approach) when the composition between SFPs defined over the problem features is performed. The TSFP approach spreads fuzziness, thus uncertainty, over the whole input space (except for the single point where the fuzzy set core is defined). On the other hand, exploiting problem information that allow the identification of core points over the Universe of Discourse (points where the described concept is maximally represented), the CP approach restricts the fuzziness only where this is actually needed, providing for a data representation that effectively respects the data distribution. Furthermore, the adoption of the CP approach preserves the underlying semantic of data allowing to represent, over the same Universe of Discourse, more-fuzzy concepts and more-crisp concepts.

153

Chapter 5

Experimental Results

Figure 5.10: Comparison between bi-dimensional fuzzy sets generated by the TSFP approach (left-hand side) and the CP approach (right-hand side).

Summary The definition of fuzzy partition represents a key issue for designing interpretable fuzzy models since fuzzy partitions are often required to fulfill several interpretability constraints. In this sense, Strong Fuzzy Partitions (SFPs) are commonly adopted as a reliable tool to design interpretable models, and triangular SFPs are often preferred because they can be easily derived through some clustering mechanism performed over the available data. In this experiment a particular approach for defining SFPs which is based on cuts is considered. Specifically, the problem of identifying the proper shape of fuzzy sets while generating SFPs from cuts is tackled, highlighting how the choice of triangular fuzzy sets represents an additional bias for the modeling process which can be conveniently removed by resorting to trapezoidal fuzzy sets. Through some numerical simulations, it has been shown that the use of trapezoidal fuzzy sets enables the derivation of highly interpretable fuzzy partitions that are more accurate than triangular fuzzy partitions in classification tasks. In particular, three different variants of trapezoidal SFP have been taken into account: Constant Slopes, Variable Fuzziness, and Core Points. Obtained results allow to consider the Core Point approach the best way to fuzzify the information granules identified by the DC* method, exploiting at best its characteristics to work with prototypes providing a space partition based on cuts.

154

6 Conclusions In interpretable fuzzy models, and more in general, dealing with information granulation, find the right level of granularity for a particular problem is of primary importance. This becomes essential when models are automatically derived from data. As well known, interpretability introduces a bias in the models, hence, the aim of interpretable fuzzy systems is to provide for model with a good trade-off between interpretability and accuracy. In literature, algorithms (and methods) capable to provide interpretable fuzzy systems can be found, but only few of these can automatically derive an interpretable fuzzy model from data and, usually, they involve trial-and-error approaches or aposteriori optimization. In this work the Double Clustering with A* (DC*) has been presented. DC* is a method dedicated to classification problems, capable to provide an interpretable fuzzy model from pre-classified data in a totally automatic fashion (no need of user interaction during the learning phase). Being an instance of the general Double Clustering Framework (DCf ), DC* is a combination between two clustering steps. After a data compression phase (the first clustering step), a clustering over each problem dimension follows, which takes into account class labels (second clustering step operated by A*) and leads DC* to operate an interpretable fuzzy partition of the problem space based on cuts. Interpretability is ensured in DC* because a number of the most important interpretability constraints are embedded in the algorithm. Due to the search process operated by A*, DC* provides for the optimal solution to the problem at hand - i.e. the interpretable fuzzy model with the minimum number of information granules, hence requiring a shorter description in the form of fuzzy rule base. Its points of strength are: • it requires only one hyper-parameter, strictly related with the granularity of the resulting model and hence easily understandable and tunable by the user

155

Chapter 6

Conclusions

• it automatically derives the granularity level for each problem feature involved in the final model • it operates an automatic feature selection process due to the optimality of its solution. In particular, in this work the DC* v2.0 has been presented. This is an enhanced version of the method in which some weaknesses inherent to the efficiency of the original DC* have been tackled and a new way to fuzzify the information granules has been presented. Experimental results have shown that DC* v2.0 represents a valid competitor for designing interpretable Fuzzy Rule Based Systems and more in general to derive interpretable fuzzy information granules. As known, the granularity of a model should strictly depend on both its use and on data distribution - i.e. the granularity of a problem is not given a-priori. Due to its way of computing DC* is able to automatically find the granularity level of the model, taking into account the user specified granularity level (which depends by the use of th model) and the granularity of the problem. By overcoming the efficiency limits of the original version, it is possible to consider DC* v2.0 as a completely new approach to interpretable fuzzy modeling for realworld problems. Being a method to derive interpretable fuzzy models, DC* v2.0 has an immediate impact on real-world applications that require an interaction between models and human beings - i.e. decision making and support in medical, economic and other areas. In these “sensible” fields, users must rely on the extracted knowledge which must be interpretable in order to be revised, maintained and verified by domain experts. Interpretable fuzzy systems allows for a concrete exchange of knowledge between machine and human beings. As mentioned, a critical point of DC* stands in the first clustering step, the data compression. The information granules computed by DC* are strictly related to the result of this stage - i.e. the prototype positions in the problem space. In the current version of DC*, data compression is obtained by the LVQ1 algorithm. As known, LVQ1 has a random initialization of prototype positions. For particular problems, the random initialization and the way of computing of LVQ1, leads to easily obtain dead-units - i.e. prototypes that do not represent surrounding data (because their position is far from data or outside the data domain). Being the first stage of the entire method, LVQ1 affects the A* search and hence the identified granules

156

Conclusions of information. As a consequence, for problems with particular data distributions, different runs of DC* on the same data can lead to quite different results. Investigations aim to tackle the problem of the first clustering step of DC*, finding a different way to compute the data compression. Currently, approaches based on Self Organizing Maps (SOM, see [92]) and Fuzzy C-Means (FCM, see [48, 13]) have been considered and are still in testing phase. Since these algorithms are not class aware, the data compression is obtained by a composition of them, computing the algorithm for each class of the problem. Preliminary experimentations have shown a major stability of these approaches (w.r.t. the LVQ1 algorithm) which do not seem to suffer from the problem of dead-units. However, research about this task should shift in the more general clustering topic, taking into account the last contributions in literature. Other research directions aim to project DC* on big-data analysis and problems different from classification. To enable DC* on big data analysis, further improvements should be applied to its second clustering step, the search in the cut configurations space. This enables a twofold way to proceed: to preserve the optimality of the DC* solutions or to compute sub-optimal solutions with a minor computational effort. Possible approaches, which preserve the solution optimality, are described in the following: • about the A* priority queue: taking into account the problem features and the projected prototypes, it is possible to give a major priority to those cut configurations that cluster more prototypes of the same class. This could lead to information granules which contain a bigger number of prototypes and hence provide for a solution exploring a minor number of states. • about the A* search space: due to the search space structure, taking into account a state in the search space and its pure hyper-boxes, it is possible to collect the prototypes contained in the pure boxes, remove them from the search space and, then, recompute the set of cuts. This should heavily prune the offspring of that state, without taking into account cuts generated to separate prototypes already separated by another adopted cut. This process, applied to the entire search phase, should increase the efficiency of DC*. Furthermore, cut recomputing could enable to find cut positions which lead to more meaningful information granules, with impact on the interpretability of the model.

157

Chapter 6

Conclusions

On the other hand, approaches which can improve the DC* efficiency, loosing the solution optimality characteristic are: • Beam Search: as well known in literature, it is possible to apply the beam search in the A* search process of DC*. Even if this could provide for a suboptimal solution, the beam search can provide for a significant computational savings. Further research should investigate the beam width and, in particular, a tailored way to compute this value taking into account some problem information. Also, the beam width can be dynamic during the search process and could be integrated in the GA-guided A* approach. • heuristic jump: it is worth to remember that, due to the search space characteristics, the number of cuts in a state directly corresponds to its depth in the search space. This consideration enables the possibility to jump in depth of the search space taking into account the heuristic function value (which estimate the number of needed cuts to purify the state at hand). The jump landing area can be restricted by the cuts in the state from which the jump was done. In other words, landing states should have the same cuts present in the starting state. About the application of DC* to unclassified data, this could be accomplished exploiting data projections over the problem features. In particular, when there is enough distance between two projections, the prototypes could be considered as members of two different information granules and hence, a cut could be positioned with the aim to split them. Once the cuts are identified, prototypes should be marked in order to proceed with the second clustering phase. Research should quantify the needed distance to identify a cut other than investigates particular aspects deriving from the second clustering step. The way of computing underlying DC* has been identified as a prominent approach to fuzzy information granulation. However, its efficiency problems did not allow DC* to be considered applicable to mid-large problems. DC* v2.0 represents a substantial step toward the efficiency enhancement of DC*. The study of the method has led to ideas which candidate DC* v2.0 as a starting point for further investigations that aim to tackle challenging problems of intelligent data analysis.

158

Bibliography [1] J. Abonyi, R. Babuska, and F. Szeifert. Modified Gath-Geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics, 32(5):612–621, January 2002. [2] J M Alonso, L Magdalena, and O Cordon. Embedding HILK in a threeobjective evolutionary algorithm with the aim of modeling highly interpretable fuzzy rule-based classifiers. In ECSC, editor, 2010 4th International Workshop on Genetic and Evolutionary Fuzzy Systems (GEFS), pages 15–20, Mieres, March 2010. IEEE. [3] J.M. M Alonso and L. Magdalena. Generating Understandable and Accurate Fuzzy Rule-Based Systems in a Java Environment. In A M Fanelli, W Pedrycz, and A Petrosino, editors, LECTURE NOTES IN ARTIFICIAL INTELLIGENCE, Lecture Notes in Computer Science, pages 212–219. Springer-Verlag Berlin Heidelberg (ISSN: 0302-9743), Trani, Bari (Italy), 2011. [4] José Maria Alonso and Luis Magdalena. HILK++: an interpretability-guided fuzzy modeling methodology for learning readable and comprehensible fuzzy rule-based classifiers. Soft Computing, 15(10):1959–1980, June 2010. [5] P. Angelov. An approach for fuzzy rule-base adaptation using on-line clustering. International Journal of Approximate Reasoning, 35(3):275–289, March 2004. [6] R Babuska. Fuzzy Modeling for Control. Kluwer, Norwell, MA, 1998. [7] R Babuska. Data-driven fuzzy modeling: Transparency and complexity issues. In Proceedings European Symposium on Intelligent Techniques ESIT’99, Crete, Greece, 1999. ERUDIT. [8] K Bache and M Lichman. {UCI} Machine Learning Repository, 2013.

159

Bibliography [9] Baranyi, P., Y. Yam, D. Tikk, R. Patton, and P. Baranyi. Trade-off between approximation accuracy and complexity: TS controller design via HOSVD based complexity minimization. In J. Casillas, O. Cordón, F. Herrera, and L. Magdalena, editors, Interpretability Issues in Fuzzy Modeling, pages 249– 277. Springer-Verlag, Heidelberg, 2003. [10] András Bardossy and Lucien Duckstein. Fuzzy Rule-Based Modeling with Application to Geophysical, Biological and Engineering Systems. CRC Press, 1995. [11] Andrzej Bargiela and Andrzej Bargiela Witold Pedrycz. Granular Computing: An Introduction. Springer, 2003. [12] Andreas Bastian. How to handle the flexibility of linguistic variables with applications. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 3(4):463–484, 1994. [13] James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. August 1981. [14] A Botta, B Lazzerini, F Marcelloni, and Dan C Stefanescu. Context adaptation of fuzzy systems through a multi-objective evolutionary approach based on a novel interpretability index. Soft Computing, 13(5):437–449, September 2009. [15] Zhiqiang Cao and Abraham Kandel. Applicability of some fuzzy implication operators. Fuzzy Sets and Systems, 31:151–186, 1989. [16] B Carse, T C Fogarty, and A Munro. Evolving fuzzy rule based controllers using genetic algorithms. Fuzzy Sets and Systems, 80:273–294, 1996. [17] J Casillas, O Cordón, M J del Jesus, and F Herrera. Genetic tuning of fuzzy rule-based systems integrating linguistic hedges. In Information Sciences, volume 136, pages 1570–1574, 2001. [18] J Casillas, O Cordón, and F Herrera. Can linguistic modeling be as accurate as fuzzy modeling without losing its description to a high degree? Technical report, Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain, 2000. [19] J. Casillas, O. Cordón, F. Herrera, and L. Magdalena. Interpretability Issues in Fuzzy Modeling. Springer, 2003.

160

Bibliography [20] J Casillas, O Cordón, F Herrera, and J J Merelo. Cooperative coevolution for learning fuzzy rule-based systems. In 5th International Conference on Artificial Evolution, pages 97–108, 2001. [21] Jorge Casillas. Embedded Genetic Learning of Highly Interpretable Fuzzy Partitions. In IFSA/EUSFLAT Conf., pages 1631–1636, 2009. [22] Jorge Casillas, Oscar Cor\-dón, and Francisco Herrera. A methodology to improve ad hoc data-driven linguistic rule learning methods by inducing cooperation among rules. Technical report, Dept. of Computer Science and Artificial Intelligence, University of Granada, 2000. [23] Giovanna Castellano, Anna Maria Fanelli, and Corrado Mencar. A doubleclustering approach for interpretable granulation of data. In IEEE International Conference on Systems, Man and Cybernetics, volume 2, pages 483–487. IEEE, 2002. [24] Giovanna Castellano, Anna Maria Fanelli, and Corrado Mencar. Generation of interpretable fuzzy granules by a double-clustering technique. Archives of Control Science, 12(4):397–410, 2002. [25] Giovanna Castellano, Anna Maria Fanelli, and Corrado Mencar. DCClass: A Tool to Extract Human Understandable Fuzzy Information Granules for Classification. In Proceedings of 4th International Symposium on Advanced Intelligent Systems (SCIS-ISIS 2003), pages 376—-379, 2003. [26] Giovanna Castellano, Anna Maria Fanelli, and Corrado Mencar. Fuzzy granulation of multi-dimensional data by a crisp double-clustering algorithm. In Proceedings of 7th World Multi-Conference on Systemics, Cybernetics and Informatics (SCI 2003), pages 372–377, 2003. [27] Giovanna Castellano, Anna Maria Fanelli, and Corrado Mencar. DCf: A Double Clustering framework for fuzzy information granulation. In Granular Computing, 2005 IEEE International Conference on, pages 397—-400, 2005. [28] Giovanna Castellano, Anna Maria Fanelli, Corrado Mencar, V, and Vito Leonardo Plantamura. Classifying data with interpretable fuzzy granulation. In Proceedings of the 3rd International Conference on Soft Computing and Intelligent Systems and 7th International Symposium on Advanced Intelligent Systems 2006, pages 872–877, Tokyo, Japan, 2006.

161

Bibliography [29] J L Castro, J J Castro-Schez, and J M Zurita. Learning maximal structure rules in fuzzy logic for knowledge acquisition in expert systems. Fuzzy Sets and Systems, 101(3):331–342, 1999. [30] J L Castro, C J Mantas, and J M Benítez. Interpretation of artificial neural networks by means of fuzzy rules. IEEE Transactions on Neural Networks, 13(1):101–116, 2002. [31] Cheng-Liang Chen, Sheng-Nan Wang, Chung-Tyan Hsieh, and Feng-Yuan Chang. Theoretical analysis of a fuzzy-logic controller with unequally spaced triangular membership functions. Fuzzy Sets and Systems, 101(1):87–108, 1999. [32] Z Chi, H Yan, and T Pham. Fuzzy algorithms with application to image processing and pattern recognition. World Scientific, Singapore, 1996. [33] Zheru Chi, Hong Yan, and Tuan Pham. Fuzzy Algorithms: With Applications to Image Processing and Pattern Recognition. World Scientific, 1996. [34] Mo-Yuen Chow Mo-Yuen Chow, S Altug, and H J Trussell. Heuristic constraints enforcement for training of and knowledge extraction from a fuzzy/neural architecture. I. Foundation, 1999. [35] Oscar Cor\-dón and Francisco Herrera. A general study on genetic fuzzy systems. In J Periaux, G Winter, M Galán, and P Cuesta, editors, Genetic Algorithms in Engineering and Computer Science, pages 33–57. John Wiley and Sons, 1995. [36] Oscar Cor\-dón, Francisco Herrera, and O Cordón. A three-stage evolutionary process for learning descriptive and approximate fuzzy logic controller knowledge bases from examples. International Journal of Approximate Reasoning, 17(4):369–407, 1997. [37] Oscar Cor\-dón, Francisco Herrera, O Cordón, and O Cordon. A proposal for improving the accuracy of linguistic modeling. IEEE Transactions on Fuzzy Systems, 8(3):335–344, June 2000. [38] Oscar Cor\-dón, Francisco Herrera, and Manuel Lozano. A classified review on the combination fuzzy logic-genetic algorithms bibliography: 1989-1995. In E Sanchez, T Shibata, and L Zadeh, editors, Genetic Algorithms and Fuzzy Logic Systems. Soft Computing Perspectives, pages 209–241. World Scientific, 1997.

162

Bibliography [39] O. Cordón, M. J Del Jesus, F. Herrera, L. Magdalena, P. Villar, and O Cordon. A multiobjective genetic learning process for joint feature selection and granularity and context learning in fuzzy rule-based classification systems. In J Casillas, O Cordon, F Herrera, and L Magdalena, editors, Interpretability Issues in Fuzzy Modeling, pages 79–99. Springer-Verlag, Heidelberg, Heidelberg, 2003. [40] O Cordón, F Herrera, F Hoffmann, and L Magdalena. Genetic fuzzy systems: evolutionary tuning and learning of fuzzy knowledge bases. World Scientific, 2001. [41] Nelson Cowan. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24:87–114, 2001. [42] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. [43] José Valente de Oliveira and J Valente de Oliveira. Semantic constraints for membership function optimization. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 29(1):128–138, 1999. [44] Christine Decaestecker and Thierry Van de Merckt. No Title. In MACHINE LEARNING: ECML-95, volume 912 of Lecture Notes in Computer Science, pages 200–217. 1995. [45] Miguel Delgado, Antonio F Gómez-Skarmeta, and Fernando Martín. A fuzzy clustering based rapid-prototyping for fuzzy rule-based modeling. IEEE Transactions on Fuzzy Systems, 5(2):223–233, 1997. [46] D Driankov, H Hellendoorn, and M Reinfrank. An introduction to fuzzy control. Springer-Verlag, Heidelberg, Germany, 1993. [47] C Dujet and N Vincent. Force implication: a new approach to human reasoning. Fuzzy Sets and Systems, 69:53–63, 1995. [48] J. C. Dunn. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics, 3(3):32–57, January 1973. [49] A Dvo\vrák. On linguistic approximation in the frame of fuzzy logic deduction. Soft Computing, 3(2):111–116, 1999.

163

Bibliography [50] J Espinosa and J Vandewalle. Constructing fuzzy models with linguistic integrity from numerical data-AFRELI algorithm, 2000. [51] J. Espinosa and J. Vandewalle. Extracting linguistic fuzzy models from numerical data - AFRELI algorithm. In J. Casillas, O. Cordón, F. Herrera, and L. Magdalena, editors, Interpretability Issues in Fuzzy Modeling, SpringerVerlag, chapter Chapter 3, pages 100–124. Springer-Verlag, Heidelberg, 2003. [52] M J Gacto, R Alcala, and F Herrera. Integration of an Index to Preserve the Semantic Interpretability in the Multiobjective Evolutionary Rule Selection and Tuning of Linguistic Fuzzy Systems. IEEE Transactions on Fuzzy Systems, 18(3):515–531, June 2010. [53] M J Gacto, R Alcala, and F Herrera. Interpretability of linguistic fuzzy rulebased systems: An overview of interpretability measures. Information Sciences, 181(20):4340–4360, March 2011. [54] J.Q. Gan. Extracting Takagi-Sugeno Fuzzy Rules with Interpretable Submodels via Regularization of Linguistic Modifiers. IEEE Transactions on Knowledge and Data Engineering, 21(8):1191–1204, August 2009. [55] Qiong Gao, Ming Li, and Paul Vitányi. Applying MDL to learn best model granularity. Artificial Intelligence, 121(1-2):1–29, 2000. [56] J M Garibaldi, S Musikasuwan, T Ozen, and R I John. A case study to illustrate the use of non-convex membership functions for linguistic terms. In IEEE International Conference on Fuzzy Systems, pages 1403–1408. IEEE, 2004. [57] A F Gómez-Skarmeta and F Jiménez. Fuzzy modeling with hybrid systems. Fuzzy Sets and Systems, 104(2):199–208, 1999. [58] A F Gomez-Skarmeta, F Jimenez, J Ibanez, and Et Al. Pareto-optimality in Fuzzy Modeling. 6th European Congress on Intelligent Techniques and Soft Computing EUFIT98, pages 694–700, 1998. [59] A González and R Pérez. Selection of relevant features in a fuzzy genetic learning algorithm. IEEE Transactions on Systems, Man, and Cybernetics— Part B: Cybernetics, 31(3):417–425, 2001. [60] Serge Guillaume. Designing Fuzzy Inference Systems from Data:. An Interpretability-Oriented Review, IEEE Trans. on Fuzzy Sys., 9(3):426—-443, 2001.

164

Bibliography [61] Serge Guillaume and Brigitte Charnomordic. Generating an Interpretable Family of Fuzzy Partitions From Data. IEEE Transactions on Fuzzy Systems, 12(3):324–335, June 2004. [62] Serge Guillaume and Brigitte Charnomordic. Learning interpretable fuzzy inference systems with FisPro. Information Sciences, 181(20):4409–4427, October 2011. [63] Serge Guillaume and Brigitte Charnomordic. Parameter optimization of a fuzzy inference system using the FisPro open source software. 2012 IEEE International Conference on Fuzzy Systems, pages 1–8, June 2012. [64] Madan M Gupta and J Qi. Design of fuzzy logic controllers based on generalized {T}-operators. Fuzzy Sets and Systems, 40:473–489, 1991. [65] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The WEKA data mining software: an update. SIGKDD Explorations, 11(1):10–18, 2009. [66] K M Hangos. Special issue on grey box modelling. International Journal of Adaptive Control and Signal Processing, 9(6), 1995. [67] Christoph S Hermann. Symbolic Reasoning About Numerical Data: A Hybrid Approach. Applied Intelligence, 7(4):339–354, 1997. [68] F Herrera. A learning process for fuzzy control rules using genetic algorithms. Fuzzy Sets and Systems, 100(1-3):143–158, November 1998. [69] Francisco Herrera, Manuel Lozano, and José L Verdegay. Tuning fuzzy controllers by genetic algorithms. International Journal of Approximate Reasoning, 12:299–315, 1995. [70] Francisco Herrera, A Peregrin, Oscar Cor\-dón, Antonio Peregrín, and O Cordón. Applicability of the fuzzy operators in the design of fuzzy logic controllers. Fuzzy Sets and Systems, 86(1):15–41, 1997. [71] T.-P. Hong and J.-B. Chen. Finding relevant attributes and membership functions. Fuzzy Sets and Systems, 103(3):389–404, 1999. [72] H Ichihashi, T Shirai, K Nagasaka, and T Miyoshi. Neuro-fuzzy ID3: a method of inducing fuzzy decision trees with linear programming for maximizing entropy and an algebraic method for incremental learning. Fuzzy Sets and Systems, 81(1):157–167, 1996.

165

Bibliography [73] H Ishibuchi. Single-objective and two-objective genetic algorithms for selecting linguistic rules for pattern classification problems. Fuzzy Sets and Systems, 89(2):135–150, 1997. [74] H Ishibuchi, K Nozaki, N Yamamoto, and H Tanaka. Selecting fuzzy if-then rules for classification problems using genetic algorithms. IEEE Transactions on Fuzzy Systems, 3(3):260–270, 1995. [75] Hisao Ishibuchi, Ken Nozaki, and Hideo Tanaka. Distributed representation of fuzzy rules and its application to pattern classification. Fuzzy Sets and Systems, 52:21–32, 1992. [76] D P Pancho J M Alonso. Social network analysis of co-fired fuzzy rules, 2013. [77] M Jamei, M Mahfouf, and D A Linkens. Elicitation and fine-tuning of fuzzy control rules using symbiotic evolution. Fuzzy Sets and Systems, 147(1):57–74, October 2004. [78] J.-S.R. Jang. ANFIS: adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man, and Cybernetics, 23(3):665–685, 1993. [79] C Z Janikow. Fuzzy decision trees: issues and methods. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics, 28(1):1–14, January 1998. [80] N Jankowski and V Kadirkamanathan. Statistical Control of RBF-like Networks for Classification. In Wulfram Gerstner, Alain Germond, Martin Hasler, and Jean-Daniel Nicoud, editors, Artificial Neural Networks ICANN97, volume 1327 of Lecture Notes in Computer Science, pages 385–390. 1997. [81] Hans Roubos Janos Abonyi. Interpretable Semi-Mechanistic Fuzzy Models by Clustering, OLS and FIS Model Reduction. In J. Casillas, O. Cordón, F. Herrera, and L. Magdalena, editors, Interpretability Issues in Fuzzy Modeling, pages 221–248. Springer-Verlag, Heidelberg, 2003. [82] F Jimenez, A F Gomez-Skarmeta, H Roubos, R Babuska, F Jiménez, A F Gómez-Skarmeta, and R Babuška. A multi-objective evolutionary algorithm for fuzzy modeling. In Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569), pages 1222–1228, New York, 2001. IEEE. [83] Y Jin. Fuzzy modeling of high-dimensional systems: complexity reduction and

166

Bibliography interpretability improvement. IEEE Transactions on Fuzzy Systems, 8(2):212– 221, April 2000. [84] Y Jin, W Von Seelen, and B Sendhoff. On generating FC(3) fuzzy rule systems from data using evolution strategies. IEEE transactions on systems man and cybernetics Part B Cybernetics a publication of the IEEE Systems Man and Cybernetics Society, 29(6):829–845, 1999. [85] Yaochu Jin and Bernhard Sendhoff. Extracting Interpretable Fuzzy Rules from RBF Networks. Neural Processing Letters, 17(2):149–164, April 2003. [86] Stephen C. Johnson. Hierarchical clustering schemes. 32(3):241–254, September 1967.

Psychometrika,

[87] Yau-Tarng Juang, Yun-Tien Chang, and Chih-Peng Huang. Design of fuzzy PID controllers using modified triangular membership functions. Information Sciences, 178(5):1325–1333, 2008. [88] J Kiszka, M Kochanska, and D Sliwinska. The influence of some fuzzy implication operators on the accuracy of a fuzzy model - {P}arts {I} and {II}. Fuzzy Sets and Systems, 15:111–128,223–240, 1985. [89] F. Klawonn and A Keller. Fuzzy clustering and fuzzy rules. Proceedings of the 7th International Fuzzy Systems Association World Congress {(IFSA’97)}, pages 193–198, 1997. [90] A Klose, A Nurnberger, and D Nauck. Some approaches to improve the interpretability of neuro-fuzzy classifiers. In 6th European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany, pages 629–633, 1998. [91] SYOJI KOBASHI, NAOTAKE KAMIURA, YUTAKA HATA, and FUJIO MIYAWAKI. FUZZY INFORMATION GRANULATION ON BLOOD VESSEL EXTRACTION FROM 3D TOF MRA IMAGE. International Journal of Pattern Recognition and Artificial Intelligence, 14(04):409–425, June 2000. [92] T. Kohonen. Self-organizing maps, volume 30 of Information Sciences. Springer Verlag, 2001. [93] R Kowalczyk. On linguistic approximation of subnormal fuzzy sets. In 1998 Conference of the North American Fuzzy Information Processing Society NAFIPS, pages 329–333. IEEE, 1998. [94] A Krone, P Krause, and T Slawinski. A new rule reduction method for finding interpretable and small rule bases in high dimensional search spaces. In 9th

167

Bibliography IEEE International Conference on Fuzzy Systems, San Antonio, TX, USA, pages 693–699. IEEE, 2000. [95] A Krone and H Taeger. Data-based fuzzy rule test for fuzzy modelling. Fuzzy Sets and Systems, 123(3):343–358, 2001. [96] Chuen C Lee. Fuzzy logic in control systems: fuzzy logic controller – {P}arts {I} and {II}. IEEE Transactions on Systems, Man, and Cybernetics, 20(2):404–418,419–435, 1990. [97] H.-M. Lee, C.-M. Chen, J.-M. Chen, and Y.-L. Jou. An efficient fuzzy classifier with feature selection based on fuzzy entropy. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 31(3):426–432, 2001. [98] S C Lee and E T Lee. Fuzzy Neural Networks. Mathematical Biosciences, 23(1-2):151–177, February 1975. [99] T.Y. Y Lin. Granular computing on binary relations: Part I and II. Rough sets in knowledge discovery, 1:286–318, 1998. [100] Y. Linde, A. Buzo, and R. Gray. An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications, 28(1):84–95, January 1980. [101] P Lindskog. Fuzzy identification from a grey box modeling point of view. In H Hellendoorn and D Driankov, editors, Fuzzy model identification, pages 3–50. Springer-Verlag, Heidelberg, Germany, 1997. [102] J Liska and S S Melsheimer. Complete design of fuzzy logic systems using genetic algorithms. In 3rd IEEE International Conference on Fuzzy Systems, pages 1377–1382. IEEE, 1994. [103] A. Lotfi, H.C. Andersen, and Ah Chung Tsoi. Interpretation preservation of adaptive fuzzy inference systems. International Journal of Approximate Reasoning, 15(4):379–394, November 1996. [104] Luis Magdalena, J M Alonso, and G González-Rodríguez. Looking for a good fuzzy system interpretability index: An experimental approach. International Journal of Approximate Reasoning, 51(1):115–134, December 2009. [105] Luis Magdalena and Felix Monasterio. A fuzzy logic controller with learning through the evolution of its knowledge base. International Journal of Approximate Reasoning, 16(3/4):335–358, 1997.

168

Bibliography [106] E H Mamdani. Applications of fuzzy algorithm for control a simple dynamic plant. In Proceedings of the IEE, pages 121(12):1585—-1588, 1974. [107] E H Mamdani and S Assilian. An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Man-Machine Studies, 7:1–13, 1975. [108] J G Marin-Blazquez. From approximative to descriptive fuzzy classifiers. IEEE Transactions on Fuzzy Systems, 10(4):484–497, August 2002. [109] J G Marín-Blázquez, Q Shen, and A F Gómez-Skarmeta. From approximative to descriptive models. In 9th IEEE International Conference on Fuzzy Systems, pages 829–834. IEEE, 2000. [110] Corrado Mencar. Interpretability of Fuzzy Information Granules. In Andrzej Bargiela and Witold Pedrycz, editors, Human-Centric Information Processing Through Granular Modelling, volume 182/2009 of Studies in Computational Intelligence, pages 95–118. Springer Berlin / Heidelberg, 2009. [111] Corrado Mencar, Ciro Castiello, Raffaele Cannone, and Anna Maria Fanelli. Interpretability assessment of fuzzy knowledge bases: A cointension based approach. International Journal of Approximate Reasoning, 52(4):501–518, June 2011. [112] Corrado Mencar, Arianna Consiglio, Giovanna Castellano, and Anna Maria Fanelli. Improving the Classification Ability of DC* Algorithm. In Francesco Masulli, Sushmita Mitra, and Gabriella Pasi, editors, Applications of Fuzzy Sets Theory (7th International Workshop on Fuzzy Logic and Applications, WILF 2007, Proceedings), volume 4578, pages 145–151. Springer Berlin / Heidelberg, 2007. [113] Corrado Mencar, Arianna Consiglio, and Anna Maria Fanelli. DCγ : Interpretable Granulation of Data through GA-based Double Clustering. In 2007 IEEE International Fuzzy Systems Conference, pages 1–6. Ieee, June 2007. [114] Corrado Mencar, Arianna Consiglio, and Anna Maria Fanelli. Interpretable Granulation of Medical Data with DC. In 7th International Conference on Hybrid Intelligent Systems (HIS 2007), pages 162–167, Kaiserlautern, September 2007. IEEE. [115] Corrado Mencar and A.M. Anna Maria Fanelli. Interpretability constraints

169

Bibliography for fuzzy information granulation. Information Sciences, 178(24):4585–4618, December 2008. [116] Corrado Mencar, Marco Lucarelli, Ciro Castiello, and Fanelli Anna Maria. Design of Strong Fuzzy Partitions from Cuts. In Proceedings of the 8th conference of the European Society for Fuzzy Logic and Technology, Advances in Intelligent Systems Research, pages 424–431, Paris, France, 2013. Atlantis Press. [117] Jerry M Mendel. Fuzzy logic systems for engineering: a tutorial. In Proceedings of the IEEE, pages 83(3):345—-377. IEEE, 1995. [118] R S Michalski. A theory and methodology of inductive learning. Artificial Intelligence, 20:111–161, 1983. [119] George A. Miller. The magical number seven, plus or minus two: some limits on our capacity for processing information. [120] D. Nauck. Knowledge discovery with NEFCLASS. In KES’2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516), volume 1, pages 158–161. IEEE, 2000. [121] D Nauck, F Klawonn, and R Kruse. Foundations of Neuro-Fuzzy Systems. John Wiley and Sons, New York, 1997. [122] D Nauck and R Kruse. How the learning of rule weights affects the interpretability of fuzzy systems. In 1998 IEEE International Conference on Fuzzy Systems Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36228), pages 1235–1240. IEEE, 1998. [123] D Nauck and R Kruse. Obtaining interpretable fuzzy classification rules from medical data. Artificial Intelligence in Medicine, 16(2):149–169, 1999. [124] D. Nauck, U. Nauck, and R. Kruse. Generating classification rules with the neuro-fuzzy system NEFCLASS. In Proceedings of North American Fuzzy Information Processing, pages 466–470. IEEE, 1996. [125] D D Nauck. Measuring interpretability in rule-based classification systems. In Proc. of 12th IEEE International Conference on Fuzzy Systems, volume 1, pages 196–201. IEEE, 2003. [126] Hiroyoshi Nomura, Hisao Hayashi, and Noboru Wakami. A self-tuning method of fuzzy control by descendent method. In Fourth International Fuzzy Systems

170

Bibliography Association World Congress (IFSA’91), Brussels, Belgium, pages 155–158, 1991. [127] Hiroyoshi Nomura, Hisao Hayashi, and Noboru Wakami. A learning method of fuzzy inference rules by descent method. In First IEEE International Conference on Fuzzy Systems (FUZZ-IEEE’92), San Diego, USA, pages 203–210. IEEE, 1992. [128] K Nozaki, H Ishibuchi, and H Tanaka. A Simple But Powerful Heuristic Method for Generating Fuzzy Rules from Numerical Data. Fuzzy Sets and Systems, 86(3):251–270, 1997. [129] R P Paiva and A Dourado. Merging and Constrained Learning for Interpretability in Neuro-Fuzzy Systems. In Proceedings of the First International Workshop on Hybrid Methods for Adaptive Systems, Tenerife, Spain, 2001. EUNITE. [130] C A Peña Reyes and M Sipper. Applying {F}uzzy {CoCo} to breast cancer diagnosis. In Congress on Evolutionary Computation, pages 1168–1175. IEEE Press, 2000. [131] C A Peña Reyes, M Sipper, and C.a. Pena-Reyes. Fuzzy CoCo: a cooperativecoevolutionary approach to fuzzy modeling. IEEE Transactions on Fuzzy Systems, 9(5):727–737, 2001. [132] W. Pedrycz and A. Bargiela. Granular clustering: a granular signature of data. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics, 32(2):212–224, January 2002. [133] Witold Pedrycz. Why triangular membership functions? Systems, 64(1):21–30, 1994.

Fuzzy Sets and

[134] Witold Pedrycz. Fuzzy Modelling: Paradigms and Practice. Kluwer Academic Press, 1996. [135] M A Potter and K A De Jong. Cooperative coevolution: an architecture for evolving coadapted subcomponents. Evolutionary Computation, 8(1):1–29, 2000. [136] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465– 471, September 1978.

171

Bibliography [137] H. Roubos and M. Setnes. Compact and transparent fuzzy models and classifiers through iterative complexity reduction. IEEE Transactions on Fuzzy Systems, 9(4):516–524, 2001. [138] Stuart J Russell and Peter Norvig. Artificial Intelligence, A Modern Approach - 2nd Edition.pdf, 2003. [139] T Saaty and M S Ozdemir. Why the magic number seven plus or minus two. Mathematical and Computer Modelling, 38(3-4):233–244, August 2003. [140] M Setnes, R Babuska, and H B Verbruggen. Rule-based modeling: precision and transparency. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 28(1):165–169, 1998. [141] M. Setnes and H. Roubos. Transparent fuzzy modeling using fuzzy clustering and GAs. In 18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397), pages 198– 202. IEEE, July 1999. [142] J J Shann and H C Fu. A fuzzy neural network for rule acquiring on fuzzy control systems. Fuzzy Sets and Systems, 71:345–357, 1995. [143] R Silipo and M Berthold. Discriminative power of input features in a fuzzy model. In D Hand, J Kok, and M Berthold, editors, Advances in Intelligent Data Analysis, volume LNCS 1642, pages 87–98. Springer-Verlag, Heidelberg, Germany, 1999. [144] R Silipo and M Berthold. Input features’ impact on fuzzy decision processes. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 30(6):821–834, 2000. [145] T Söderström and P Stoica. System identification. Prentice Hall, NJ, USA, 1989. [146] T. Sudkamp, A. Knapp, and J. Knapp. Effect of rule representation in rule base reduction. In J. Casillas, O. Cordón, F. Herrera, and L. Magdalena, editors, Interpretability Issues in Fuzzy Modeling, pages 303–324. SpringerVerlag, Heidelberg, 2003. [147] M Sugeno and T Yasukawa. A fuzzy-logic-based approach to qualitative modeling. IEEE Transactions on Fuzzy Systems, 1(1):7, February 1993. [148] Michio Sugeno and G T Kang. Structure identification of fuzzy model. Fuzzy Sets and Systems, 28(1):15–33, 1988.

172

Bibliography [149] Hideyuki Takagi and Isao Hayashi. {NN}-driven fuzzy reasoning. International Journal of Approximate Reasoning, 5(3):191–212, 1991. [150] Hideyuki Takagi, Noriyuki Suzuki, Toshiyuki Koda, and Yoshihiro Kojima. Neural networks designed on approximate reasoning architecture and their applications. IEEE Transactions on Neural Networks, 3(5):752–760, 1992. [151] Tomohiro Takagi and Michio Sugeno. Fuzzy identification of systems and its application to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, 15(1):116–132, 1985. [152] D Tikk, T D Gedeon, and K W Wong. A Feature Ranking Algorithm for Fuzzy Modelling Problems. In J Casillas, O Cordon, F Herrera, and L Magdalena, editors, Interpretability Issues in Fuzzy Modeling, pages 176–192. SpringerVerlag, Heidelberg, 2003. [153] Enrique Trillas and Lorenzo Valverde. On implication and indistinguishability in the setting of fuzzy logic. In J Kacpryzk and R R Yager, editors, Management Decision Support Systems Using Fuzzy Logic and Possibility Theory, pages 198–212. Verlag TUV Rheinland, 1985. [154] J Valente de Oliveira. On the optimization of fuzzy systems using bioinspired strategies. In 1998 IEEE International Conference on Fuzzy Systems Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36228), pages 1229–1234, Anchorage, AK, 1998. IEEE. [155] José Valente De Oliveira. Towards neuro-linguistic modeling: Constraints for optimization of membership functions. Fuzzy Sets and Systems, 106(3):357– 380, September 1999. [156] V. Vanhoucke and R. Silipo. Interpretability in Multidimensional Classification. Interpretability Issues in Fuzzy Modeling, Springer-Verlag, pages 193– 217, 2003. [157] L.-X. Wang and J M Mendel. Generating fuzzy rules by learning from examples. IEEE Transactions on Systems, Man, and Cybernetics, 22(6):1414–1427, 1992. [158] Li X Wang. Adaptive Fuzzy Systems and Control: Design and Analysis. Prentice-Hall, 1994. [159] N Xiong and L Litz. Fuzzy modeling based on premise optimization. In 9th

173

Bibliography IEEE International Conference on Fuzzy Systems, San Antonio, TX, USA, pages 859–864. IEEE, 2000. [160] Y.Y. Yao. Granular computing using neighborhood systems. Advances in soft computing: Engineering design and manufacturing, (Springer-Verlag, London):539–553, 1999. [161] G G Yen. Quantitative measures of the accuracy, comprehensibility, and completeness of a fuzzy expert system. In 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE’02. Proceedings, volume 1, pages 284–289, Honolulu, Hawaii, 2002. IEEE. [162] Y Yoshinari, Witold Pedrycz, and K Hirota. Construction of fuzzy models through clustering techniques. Fuzzy Sets and Systems, 54:157–165, 1993. [163] L A Zadeh. The Concept of a Linguistic Variable and its Application to Approximate Reasoning - II. Information Sciences, 8(1):199–249, 1975. [164] L A Zadeh. Fuzzy Sets and Information Granularity. In M M Gupta, R K Ragade, and R R Yager, editors, Advances in Fuzzy Set Theory and Applications, pages 3–18. North Holland, The Netherlands, 1979. [165] Lofti A Zadeh. The concept of a linguistic variable and its applications to approximate reasoning - {P}arts {I}, {II} and {III}. Information Sciences, 8-9:43–80,199–249,301–357, 1975. [166] Lotfi a. Zadeh. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, 90(2):111– 127, September 1997. [167] Lotfi A. Zadeh. Toward Human Level Machine Intelligence - Is It Achievable? The Need for a Paradigm Shift. IEEE Computational Intelligence Magazine, 3(3):11–22, August 2008. [168] Lotfi a. Lofti A Zadeh. Outline of a New Approach to the Analysis of Complex Systems and Decision Processes. IEEE Transactions on Systems, Man, and Cybernetics, SMC-3(1):28–44, 1973. [169] Shang-Ming Zhou and John Q. Gan. Low-level interpretability and highlevel interpretability: a unified view of data-driven interpretable fuzzy system modelling. Fuzzy Sets and Systems, 159(23):3091–3131, December 2008.

174

Suggest Documents