Systematic mapping study on Granular computing Saber Salehia, Ali Selamata,b, * and Hamido Fujitac a
UTM-IRDA Digital Media Centre of Excellence, Universiti Teknologi Malaysia,81310 UTM Johor Bahru, Johor, Malaysia b Faculty of Computing, Universiti Teknologi Malaysia,81310 UTM Johor Bahru, Johor, Malaysia c Iwate Prefectural University, 152-52 Sugo Takizawa, Iwate, 020-0193 Japan
Abstract Granular computing (GrC) has attracted many researchers as a new and rapidly growing paradigm of information processing. In this paper, we apply systematic mapping study to classify the GrC researches to discover relative derivations to specifiy its research strength and quality. Our search scope is limited to the Science Direct and IEEE Transactions papers published between January 2012 and August 2014. We defined four perspectives of classification schemes to map the selected studies that are focus area, contribution type, research type and framework. Results of mapping the selected studies show that almost half of the research focused area belongs to category of data analysis. In addition, most of the selected papers belong to proposing the solutions in research type scheme. Distribution of papers between tool, method and enhancement categories of contribution type are almost equal. Moreover, 39% of the relevant papers belong to the rough set framework. The results show that cluster analysis has been less incubated to discover granules for classification. We applied four clustering algorithms on three datasets from UCI repository to compare the form of information granules, and then classify the patterns and define them to a specific class based on their geometry and belongings. The clustering algorithms are DBSCAN, c-means, kmeans and GAk-means, and the comparison of information granules are based on the coverage, misclassification and accuracy. The survey of experimental results mostly shows GAk-means algorithm superior to other clustering algorithms; while, c-means clustering algorithm shows inferior to other clustering algorithms. Keywords Granular computing; Granular classifier; Clustering algorithm; Systematic mapping
1. Introduction Granular computing (GrC) proposed by Zadeh[1, 2] and Lin[3] has been emerged as a unified and coherent platform of constructing, describing, and processing information granules. It is growing rapidly as a new paradigm of information processing, and attracting many practitioners and researchers. It can cover any methodologies, tools, theories and techniques to solve complex problems using information granules [4]. Although the term GrC is relatively new [5], the idea and concept related to GrC are not new [4]. GrC identifies the essential commonalities between the surprisingly diversified problems and technologies used there, which could be cast into a unified framework known as a granular world. The outcomes *
Corresponding author: Ali Selamat (email:
[email protected]), Faculty of Computing, Universiti Teknologi Malaysia,81310 UTM Johor Bahru, Johor, Malaysia
1
of GrC is achieved through the interaction of information granules and the external world that could be another granular or numeric world by collecting necessary granular information [6]. Granule is a clump of objects drawn together by indistinguishability, similarity or functionality [3]. Granules and their relationships are employed to find solutions to any desired problems. Different granularity levels, which are size of the information granules, are formed by decomposing and composing of granules. GrC is a simple classical view of concepts related to the granules notion. Different concept or rule levels are unrevealed based on defined granularity levels [7]. GrC has become an effective framework in design and implementation of efficient and intelligent information processing systems for various real life decision-making applications [3]. Several reasoning formalisms has been applied on GrC such as interval mathematics [8, 9], fuzzy sets [6, 10, 11], rough sets [12, 13], cluster analysis [14, 15] and hybrids models [16, 17]. It has been revealed in various research areas such as image segmentation [18, 19], data analysis [12, 20], granular data interaction [21]. It has been also conjuncted with learning algorithms [14, 16] to form the information granules. The three following creiteria alligne the GrC [4]: i) Information granules are used as the important constituents of knowledge processing and representation [22, 23]. ii) The size of the information granules (the level of granularity of information granules) is an essential factor to define the overall strategy of solving the problem [24]. The main shortcoming of the GrC is to find the level of granularity of information, which is assigned based on the patterns density in the universe [4]. 3) Hierarchy of information granules supports an important aspect of perception of phenomena and delivers a tangible way of dealing with complexity by focusing on the most essential facets of the problem [19]. A systematic mapping study produces a structure of the research reports in a specific topic area at a high level of granularity to perceive what evidence is available on the topic [25]. It is mostly designed to indicate the quantity of the studies in a specific research area. The aim of systematic mapping study is to discover strength and lack of relevant studies (also known as gap) by categorizing and mapping of the results in a visual summary [25]. In fact, the future research could be directed by the results of the mapping study in a suitable and appropriate area [26]. In the first step of this study, we apply systematic mapping study [26, 27] to identify and classify the GrC research in order to find the research gaps in the future research from the year 2012 to 2014. In the next step, we implement the framework among the defined frameworks, which has considered a little based on the mapping of the relevant studies to design the granular structure. The rest of the paper is organized as follows. First, we explain our research method to identify the research gaps in the GrC along with current researches in section 2. Section 3 discusses the classification schemes of GrC from various acpects. Section 4 elaborates the results of the mapping and discusses on research questions. Section 5 presents the design of the granular structure using four clustering algorithms, i.e., DBSCAN, k-means, c-means and GAk-means. In Section 6, we include some datasets from UCI machine learning repository [28] to evaluate the performance of clustering algorithms. Finally, section 7 points out the concluding remarks. 2. Research method This section divides our research method (Fig. 1) into four subsections as research question, research strategy, defining a classification scheme, and mapping relevant studies. Fig. 1 represents the four research questions in this study are answered by defining the relevant classification schemes.
2
Research Method
Research question (RQ)
Research strategy
Classification schemes
RQ1
Framework of granular models
RQ2
Research focus area
RQ3
Contribution type
RQ3
Research type
Mapping relevant studies
Fig.1. Current research method diagram
2.1. Research question This study summarizes the empirical proofs aims at identifying the research gaps in GrC and then fill this gap. We defined the following research questions (RQs) to achieve this aim: RQ1: What are the existing frameworks for GrC? RQ2: Which areas are often considered by researchers by using GrC? RQ3: Which types of contributions are interested by researchers so far? RQ4: Selected journals to which research types are related? 2.2. Research strategy We developed a straightforward research strategy to determine the impact of relevant studies in GrC. The following three factors have influenced our research strategy: i) search strings, ii) search scope and iii) search method. Table 1 represents the search strings used in the concept of GrC. We extracted the search strings from the titles and keywords of papers. The search scope of this study is based on automatic searches limited to the Science Direct (SD) and IEEE Transactions (IEEET) studies, which are published between January 2012 and August 2014. We perfomed our search on the recent state-of-art from the litreatures. Therefore, the scope of research is extended into other publications according to the article’s related works and references in automatic searches, and limited to the publications period between January 2012 and August 2014. In the following section, we describe the search method based on search strings and search scope.
3
Table 1. Search string used for automatic searches Concept Granular computing
Alternatives used Granular computing OR granular classifier Hyperbox Granule OR granulation Multigranulation OR multi-granulation
Initial set of publications (Search based on keywords 2012-2014)
Automatic searches
OR OR OR
Paper No: 96
Screen the primary study (title, keywords and abstract of relevant journal proceedings) Extend by related works and references (2012-2014)
Paper No: 33
Filter based on keywords
Manual searches
Read abstract, introduction and conclusion Apply the inclusion/exclusion criteria
Inclusion No: 119 Exclusion No: 10
Relevant studies to be used for mapping
Fig. 2. Study selection process
Fig.2. shows the current research strategy divided into automatic and manual methods. Automatic search confines the papers within January 2012 and August 2014 by given search strings. We perform the automatic serach to find the relevant primary studies in GrC. We screen a paper with a set of keywords (Table 1) though the title, keywords and abstract of the journals. Then, the relevant papers are extended by the related works and references of selected papers in automatic searches. We confine the results of manual serach similar to authomatic search. In the next step, we refine the relevant papers by reading the abstract, conclusion and introduction. Then, the refined papers are evaluated based on inclusion or exclusion criteria define as follows:
4
Inclusion: • Papers that apply GrC technique to solve a problem in a new construction or an extension of existing techniques • Studies that arrange the existing studies in GrC topic in order to analyze and discuss • Papers that implement the existing GrC techniques or frameworks in order to present the results and personal experience Exclusion: • Papers that only mentioned the GrC search strings (refer to Table 1) in abstraction or conclusion • Papers that are not available in full-text • Papers presented as an introductory for workshop • Papers contains the mentioned strings in Table 1, but not related to GrC technique in Computer Science related area Finally, relevant studies are used for mapping through the defined classification scheme (refer to section 3) to identify the existing gaps in GrC. Moreover, based on mapping results (refer to section 4), we implemented (refer to section 5) a less considered framework in designing the granular structure. Fig. 2 shows 96 identified papers based on the defined keywords in Table 1 by automatic searches. Then, the number of papers was extended by using a manual search to 123 papers. We have identified between 10 and 113 papers as exclusion and inclusion papers based on Inclusion/Exclusion criteria, respectively. 2.3. Defining a classification scheme Relevant studies in GrC are categorized based on the Petersen et al. [25] classification schemes. The contribution type category of Petersen et al. [25] is modified and defined the existing research focus areas in GrC. Furthermore, we added the framework category to Petersen et al. [25] classification schemes. We selected the articles through four perspectives that are focus area, contribution type, research type, and framework. These categories are adopted to map the studies in order to find the existing gaps in GrC. Each of these four categories are presented in section 3. 2.4. Mapping of relevant studies Relevant studies were mapped to the specific category as defined in section 3. The results of mapping are given in section 4. 3. Classification schemes Petersen et al. [25] classification schemes are followed with some modifications to classify the selected relevant studies from the following four perspectives of focus area, contribution type, research type, and framework. 3.1. Researches focus area Selected studies in GrC were categorized according to the four research focus areas from a broader perspective (RQ2). The four categories of research focus areas are defined based on unique research topics they addressed. Areas of researches in GrC are briefly described in the following subsections. 5
Research focus area
Data analysis
Human-centric Way
Concept Formation and Learning
Interaction
Segmentation
Interval-based Evolving Modeling
HistogramThreshol dingBased
Neighborhood Based
Clustering Based
Neuro-fuzzy Based
Fig.3. Research focus area categories in GrC
I.
Data analysis GrC is a rich framework for the data analysis. A Human-centric Way [12, 29] and Interval-based Evolving Modeling [4, 30] are used to represent the collection of information granules for spatiotemporal data and heterogeneous data in time-varying systems, respectively. A Human-centric Way: A Human-centric Way of data analysis is often dealing with the data established by the user and distributed in space and time [12, 29]. This is considered the representation of data in an interpretable way. The data and relationships in the granular way of data analysis are defined in the spatial and temporal domain through a collection of information granules. There are some compelling cases in various applications that considered the spatiotemporal data distributions in time and space [4]. Shifting from machine-centered approaches to human-centered approaches is considered as one of the trends in GrC research [4, 31]. Interval-based Evolving Modeling: Interval-based Evolving Modeling (IBeM) is an adaptive and flexible procedure to deal with the heterogeneous data in time-varying systems with non_stationary granular data streams [4, 30]. IBeM algorithms are used to envelop uncertainty by accumulating the values which are associated to granules and rules. A granular way of data analysis is a rich framework to discover the essence of the structure of the data. System and environment changes are tracked by evolving and updating of rules in terms of IBeM learning algorithms [32, 33]. II.
Concept formation and learning From a GrC point of view, we consider the Concept of Formation and Learning as one of the research focus area categories. There are some relations between granulations and classifications and between the granules and concepts. Concepts have been considered as essential units of human thought in various disciplines: philosophy, cognitive science, inductive learning and machine learning [34, 35]. GrC is a simple classical view of concepts, 6
which is related to the granules notion. Several issues must be considered in order to make a concept-learning algorithm effective and its results meaningful. One issue is the selection of a set of meaningful basic concepts, from which target concepts can be expressed and interpreted. Another issue is to design strategies for learning. Different strategies may lead to different descriptions of the target concepts [9, 36, 37]. III. Interaction Interaction of objects is a fundamental requirement to perform computations for modeling of complex systems [38]. The notion of the highly interactive granular system is clarified as the system in which intrastep interactions [39] with the external as well as with the internal environments take place. This interaction can occur between defined objects of soft computing approaches, machine learning or data mining techniques [40]. One of the wellknown approaches in interactive granular system is an Interactive Rough GrC (IRGC) that used to model interactive computations [41, 42]. IV.
Segmentation Use of GrC for segmentation is one of the growing areas of interest among researchers [43, 44]. GrC has been applied to segment data such as images, words, knowledge, and signals [19]. Color image and video segmentation techniques in GrC can be classified as histogram thresholding based [45], neighborhood based [44, 46], clustering based [19] and neuro-fuzzy based [43]. Histogram Thresholding Based: Histogram Thresholding is one of the simple and popular techniques for image segmentation that tries to find the valleys and peaks in histogram [43]. The underlying assumption in Histogram Thresholding is to identify the dominating peaks in the image histogram [43, 45]. Hence, the segmentation task is reduced to find thresholds dissecting the image histogram [47]. However, a major drawback in the Histogram Thresholding techniques is the lack of use of spatial relationship amongst the pixels [48]. Clustering Based: Clustering Based approaches are used to form the collection of information granules. The semantic meaningful constructions of individual pixels drawn together based on their proximity (location) to construct information granules [4]. DBSCAN [49], Fuzzy cmeans (FCM) and Fuzzy k-means (FKM) [50] clustering algorithm are the popular clustering algorithms which are used in GrC. Neuro-fuzzy Based: Neuro-fuzzy system is the integration of processing capabilities and readability of neural networks and fuzzy rule base systems, respectively [43]. Neural network and fuzzy systems have the dynamic and parallel processing capabilities to estimate the inputoutput functions. Fuzzy information granulation theory and fuzzy logic tool [1, 51] are the underlying factors to formalize the information granules in the neuro-fuzzy system. Neighborhood Based: The uniformity is a general criteria in the neighborhood-based approach to segment regions in the image [44]. Infinitesimal granules are formalized mathematically by the neighborhood-based model, which led to the invention of calculus, topology and nonstandard analysis. Each object in the neighborhood system is assigned as a finite or an infinite family of subsets which referred as a neighborhood [38, 43]. Seed points initialization is a problem of neighborhood system methods in order to examine the areas and pixels [43].
7
3.2. Research type In order to classify the research approaches in GrC (RQ4), various research types are considered in this study. A brief definition of proposed research types by Wieringa et al. [52] are presented in the following subsections. Definition 1 (Solution proposal). Solution proposal is considered the problem by proposing a novel solution or an essential extension of an existing technique. The benefits of proposed solution are highlighted through an example or thorough argumentation and reasoning. Definition 2 (Validation research). The novel techniques are implemented and investigated by validation research. Validation research examines the solution proposal that has not yet been applied in practice. Validation research is conducted to present any of these: experiments, prototypes, simulations and mathematical analysis etc. Definition 3 (Evaluation research). Evaluation research is examining a solution that has already been applied in practice, however, validation research has not yet been practically applied. The proposed methods in evaluation research are implemented to solve a problem in empirical study and usually case studies or field studies are used to present the results. Definition 4 (Conceptual proposal). Conceptual proposal is an arrangement to represent things that already exist. It is presented in a novel way; however, it is not exactly considered the particular problem solving. Conceptual proposals may include taxonomies and theoretical frameworks etc. Definition 5 (Experience paper). An experience paper consists of the personal experience report from one or more real life projects. The authors report usually elaborates the process and achievement of the project. Definition 6 (Opinion paper). An opinion paper is considered the suitability or unsuitability of a specific technique or tool based on the personal opinion of the author. These personal opinions sometimes present the way of tool or technique’s development. 3.3. Contribution type In this study, the contribution type in the GrC and granular classification are divided into five categories as below (RQ3): Tool: This type of contribution concentrates on providing a tool in the GrC or granular classification. A tool or prototype form can be integrated with the other frameworks. Model: This type of contribution investigates the relationships, the comparisons of the proposed techniques, the existing challenges, or makes a classification among the papers. Metric: It refers to contribution that focuses on proposed metrics in order to calculate the effectiveness of GrC approaches. Enhancement: This type of contribution is focused on hybridizing the existing framework in GrC with the optimization methods in order to overcome the demerits or limitations of the GrC framework. 8
Technique: This type of contribution focuses on proposing the new technique to discover the information granules. It represents how a specific technique will be used in GrC technology. 3.4. Framework of granular models Information granules can be defined in order to represent granular objects and handle tools in several different reasoning formalisms such as sets (interval analysis) [9], fuzzy sets [6], rough sets [12], shadow sets [53], cluster analysis [14], decision trees [54], neighborhood systems [55] and hybridizations frameworks [16] (RQ1). Interval analysis: Two different areas are concentrated on interval-valued data, namely Symbolic Data Analysis and Interval Analysis [56]. Interval analysis is applied to capture the key part of the GrC structure by set (interval) calculus. Sets construct the areas of the feature space based on the high homogeneity of the patterns [57, 58]. Interval Analysis introduces intervals as a fundamental means of representing real data, thus providing methods for numeric processing of these intervals. Interval-number algebra and interval-set algebra are classified as concrete models of GrC [59]. Fuzzy sets: Fuzzy sets are defined as sets whose patterns have degrees of membership. Fuzzy sets can be used to generate information granules in GrC [60, 61]. In addition, it can be used to cope with the points located outside the information granules created by the other frameworks. In fact, the membership function is used as a solution to solve the problem of allocation of degrees of belongingness [51, 57, 58]. Rough sets: Rough set theory is often considered as one of the fundamental techniques of GrC in order to solve the vagueness of information problem in the data mining field [62]. In this view, granules are formed by means of rough inclusions as the classes of objects close to a specified center of granule to a given degree [63-65]. Shadow sets: Shadow sets intents to capture and quantify the factor of uncertainty coming with the construction of any information granule [53, 66]. Shadowed sets are algorithmically induced by fuzzy sets assuming three values, which could be interpreted as full membership, full exclusion, and uncertain. Shadowed sets are conceptually close to rough sets in spite of deifferences in their mathematical foundations [66]. Cluster analysis: Cluster analysis is often used as a major data analysis technique widely applied for many practical applications in emerging areas of data mining. The quality of a clustering result depends on both the similarity measure either used by a method and its implementation, or used by its ability to discover some or all of the hidden patterns [50, 57]. Clustering algorithms build the seeds of information granules, which are known and designed as the centers of the clustersto grow the granular structures [15, 19, 58, 67]. The design of the granular structure is explained by various clustering algorithm details in section 5. Decision trees: A top-down or a bottom-up method can be applied for granulation. Decision tree is a kind of intuitive knowledge representation method. It uses tree structures to create decision sets as an efficient classifier. Decision tree algorithms which usually take the topdown greedy algorithm choose the best attribute as the current attribute, and then recursively expand the branches of decision tree until it satisfies a certain terminal condition [54]. There are two main problems [38] in constructing decision trees, where different solutions lead to various classification methods. The first problem is the attribute selection to make new 9
branches in the tree, and the second problem is pruning to omit and decrease the tree. During the construction of a decision tree, a critical step in the context of GrC is to select the node attributes of a tree that has the least number of branches [54]. Hybrids: Optimizing the collection of information granules is an important issue in the context of any hybrid system [58]. Every individual techniques of a hybrid system should symbiotically overcome the demerits or limitations of other techniques rather than escalating the issues. This hybridization is performed through a fusion of learning and optimization techniques such as: Genetic Algorithm (GA) [17, 45], Particle Swarm Optimization (PSO) [33, 58, 68] and Neural Network [16] etc. 4. Mapping and discussion of research question In this section, the volume of published papers in various forums are considered (refer to Table 2). Also, the percentage of publications in the defined classification schemes are presented and analyzed (refer to Fig. 3). Then, the existing gap and the emphasis of existing research in GrC are mapped and discussed (refer to Fig. 4 and 5). Finally, the high citation papers in the included papers are presented in Fig. 6. Table 2 represents the number of selected papers in different publications based on the Table 1 search strings. Table 2 shows that the most papers published in Science Direct journals compare to the IEEE Transactions journals. Information Science journal has the most relevant publications compare to the other journals. On the other hand, most of the GrC papers published in 2013. Fig. 4 (a) to Fig. 4 (d) repspectively shows the overviewes of four classification schemes in GrC researches, which are research focus area, research type, contribution type, and framework of granular models. Fig. 4 (a) shows that almost half of the research focus area belongs to the category of data analysis. Aformentioned, we applied the manual and automated searches to select the GrC papers on IEEET and SD journals. Fig. 4 (b) shows most of the selected papers belong to the categories of solution proposal and validation research in research type scheme. Fig. 4 (c) shows the distribution of papers between tool, method and enhancement categories of contribution type that are almost equal. Fig. 4 (d) represents the 39% of the relevant papers belong to the rough set framework. Almost 42% of selected papers have used the fuzzy sets and hybrid frameworks in GrC. Less than the 20% of papers have applied the other frameworks to solve the existing problems.
10
Table 2. Overview of inclusion publication forums for selected result Source (paper number)
Year (paper number) 2012 (38) 2013 (46) 2014 (37)
Science Direct (100) Applied Soft Computing (11) Engineering Applications of AI (1) European journal (5) Expert System with Application (7) Fuzzy Sets and Systems (4) Information Fusion (2) Information Science (28) Internationall Journal of Approximate Reasoning(13) Journal of Network and Computer Applications (1) Neuro Computing (3) Artificial Intelligence in Medicine (2) Journal of the Egyptian Mathematical Society (1) Knowledge-Based Systems (9) Theoretical Computer Science (2) AASRI Procedia (1) Computers & Operations Research (1) Neural Networks (2) Comput. Methods Appl. Mech. Engrg (1) Pattern Recognition (3) Mechanical Systems and Signal Processing (1) Computer Methods in Applied Mechanics (1)
1 1 3 2 9 4 1 1 5 2 1 1 -
9 2 1 1 9 4 2 1 1 1 1 2 2 1 -
1 1 2 3 1 2 10 5 1 1 4 1
1 1 -
2 2 1 2 1
1 1 1 1
1 -
1 -
IEEETransactions (13) Fuzzy systems (3) Industry applications (1) Systems, man, and cybernetics (1) Knowledge and data engineering (3) Neural networks and learning systems (2) Cybernetics (2) Information forensics and security (1)
-
-
Other publications (6) IEEE conferance (3) International Journal of General Systems (1)
2
Springer (2)
2
-
11
(a)
(b)
(c)
(d)
Fig. 4. Distribution of (a) research focus area, (b) research type, (c) contribution type and (d) framework of granular model
Fig. 5. Map of research focus area on GrC
Fig. 5 and 6 present mappings the category of research focus area and existing frameworks in GrC with respect to research and contribution types, respectively. An outline of aforementioned mapping presents the existing research gaps as well as the emphasis of existing research in GrC. Fig. 5 shows that the majority of research papers are dedicated to propose models in data analysis category of research focus area. Moreover, the composition 12
of enhancement contribution and the concept of formation and learning category, and also the composition of method contribution and data analysis are underdeveloped.
Fig. 6. Map of framework on GrC
As mentioned in Fig. 4 (b), most of the papers have been published in solution proposal category of research type. Among these papers, most of them were considered the data analysis and concept of formation and learning categories. Fig. 6 shows that there is a big gap in using the decision trees and neighborhood system frameworks with the existing contribution types and research types. In addition, there is a gap in the composition of cluster analysis framework and existing research and contribution types. On the other hand, most of the researchers proposed the models and methods by using the fuzzy sets and rough set frameworks. Table 3 shows the high citation papers among the selected papers in GrC research based on the Google Scholar website. Skowron et al. [42] paper in Information Science journal has the most citation among the other papers with 62 citation. Moreover, Qian et al. [65] paper in Approximate Reasoning journal is cited 26 times during 8 months.
13
Table 3. Overview of inclusion publication forums for selected result Rank 1 2 3 4 5 6 7 8 9 10 11
Paper Skowron et al. [42] Wei et al. [69] Liang et al. [70] Zhang et al. [32] Lin et al. [71] Qian et al. [65] Zhu et al. [72] Escobar et al. [73] Wang et al. [30] Chen et al. [7] Yao et al. [4]
Year 2012 2012 2012 2012 2012 2014 2012 2013 2012 2012 2013
Times cited 62 31 31 29 29 26 24 21 19 17 17
5. Design of the granular structure by Clustering algorithms It has gain less attraction on cluster analysis to discover certain category of granular classifiers, referred here as hyperboxes among much works on granular models based on Fig. 4 (d). In this section, we defined the geometry of the information granules for designing effective classifiers with four of the popular clustering algorithms as DBSCAN [49], k-means [50], c-means [50] and GAk-means [74]. The structure of the hyperboxes is created by clustering algorithm to build the granular structures. The hyperboxes are applied to discover clusters and noise points. The underlying concept is that the typical density of patterns in the created cluster is more than outside of that cluster through some predefined probability density requirements. Two global parameters are required to build hyperboxes, namely, radius of the EPS-neighborhood (EPS) and the minimum number of points (MinPoints) in an EPS-neighborhood of that point. It is difficult to create an explicit relationship between ‘MinPoints’ and ‘EPS’ [49]. Current study proposes the relationship between two significant parameters (‘EPS’ and ‘MinPoints’) to create a relationship between two significant parameters. To do so, we follow the design heuristic presented by Ester et al. [49]. 5.1. Relevant definitions We describe and introduce some relevant definitions, which are used to construct the granular structure. Definition 1 (Neighborhood). It is composed of homogeneous points. These points located based on the predefined distance. The neighborhood figure is made by the distance functions such as Tchebyshev and Euclidean for two points ‘p’ and ‘q’. Definition 2 (Seed_point). A point is called a seed_point whenever it has more points than ‘MinPoints’ through a defined ‘EPS’ value as presented in Fig. 7. The neighborhoods of seed_points must represent the same class. Definition 3 (EPS-neighborhood). The EPS-neighborhood of a point ‘p’ wrt. ‘EPS’, ‘MinPoints’ is defined as follows: •
∀𝑝, 𝑞 ∈ 𝐷: 𝑁𝐸𝐸𝐸 (𝑝) = {𝑞 ∈ 𝐷|𝑑(𝑝, 𝑞) ≤ 𝐸𝐸𝐸}
Where ‘D’ is any dataset.
14
In fact, a point ‘q’ in a cluster ‘C’ is placed inside of the EPS-neighborhood of point ‘p’. EPSneighborhood figure of ‘p’ is presented in Fig. 8.
Fig. 7. Hyperboxes of seed_point, border point, noise point and EPS
Definition 4 (Directly density-reachable). A point ‘p’ is directly density-reachable from point ‘q’ wrt. EPS, MinPoints if: 1) 𝑝𝑝𝑁𝐸𝐸𝐸 (𝑞) 2) |NEPS(q)| ≥ MinPoints (core point condition). Fig. 8 (a) shows that the pairs of core points are symmetrical, however, the core point and border point in a hyperbox are asymmetrical patterns. Definition 5 (Density-reachable). A density-reachable point in any dataset space is defined as follows: •
∀𝑝, 𝑞: If a chain of points 𝑝1, … , 𝑝𝑛 exist, where 𝑝1 = 𝑞 and 𝑝𝑛 = 𝑝 and there is a point ′𝑝𝑖+1 ′ in a chain of points, which is directly density-reachable from ′𝑝𝑖 ′, then point ′𝑝′ is density reachable from the point ′𝑞′ in regard to ‘EPS’ and ‘MinPoints’.
Density-reachability is the same as direct density-reachability. This relation is transitive, but it is not symmetric as depicted in Fig. 8 (b). Definition 6 (Density-connected). A density-connected point in ‘D’ is defined as follows: •
∀p, q, o ∈ D: If ‘p’and ‘q’are density-reachable from ‘o’, then a point ‘p’is densityconnected to point ‘q’ with regard to ‘EPS’ and ‘MinPoints’.
Density-connected points are symmetric for pairs as shown in Fig. 8 (c). Definition 7 (Density-based cluster). A density-based cluster in any dataset space is defined by the following conditions: (1) ∀p, q: if p ∈ C and ‘q’ is density-reachable from ‘p’ with respect to ‘EPS’ and ‘MinPoints’, then q ∈ C. (Maximality) (2) ∀p, q ∈ C: ‘p’ is density-connected to ‘q’ in regard to EPS and MinPoints. (Connectivity) 15
Definition 8 (Hyperbox).A hyperbox is an object that comprises a seed_point, ‘EPS’ and ‘MinPoints’. A seed_point is placed at the center of a hyperbox. In addition, the border of the hyperbox from the seed_point is estimated with the defined ‘EPS’ value. Moreover, the object of hyperbox is included at least ‘MinPoints’ objects within a radius EPS-neighborhood. In fact, each cluster is constructed by a grouping of hyperboxes in which seed_points of hyperboxes are density-connected.
Fig. 8. Relevant notations used in the primary structure:(a) ‘p’ is directly density reachable from ‘q’ but ‘q’ is not directly density reachable from ‘p’, (b) ‘p’ is density-reachable from ‘q’ and ‘q’ is not densityreachable from ‘p’, (c) ‘p’ and ‘q’ are density-connected to each other by ‘o’
Definition 9 (Noise point).Noise point does not belong to any created clusters as presented in Fig. 7. Definition 10 (Border point).A point is called a border point whenever it is densityreachable from the other seed_point located inside of the hyperbox; however, it is not a seed_point by itself as presented in Fig. 7. 5.2. Clustering algorithms We use clustering algorithms to build the seeds of the hyperboxes, which are designed as the centers of the clusters. In fact, interval analysis is applied to capture the key part of the classifier’s structure. Sets construct the areas of the feature space based on the high homogeneity of the patterns. The classifiers used to establish the geometry of the feature space and the decision boundaries are interpreted as classification process and analyzed according to some criteria as presented in section 6.1. I.
DBSCAN and k-means In this section, the procedure that is used in DBSCAN and k-means algorithms will be described. The aim of this phase is to build hyperboxes for discovering the information granules to create clusters and noisy points in a dataset according to noise points and densitybased cluster definitions. First, the training datasets are split based on class label. Finding hyperboxes procedure looks at the EPS-neighborhood of each pattern in the training patterns and returns all density-reachable patterns from that pattern with regard to ‘EPS’ and ‘MinPoints’. Thus, this process yields a new hyperbox ‘Hi’ which contains the points in EPSneighborhood. If there is no a cluster for the specified class label, the new cluster ‘C’ is created for the specified class label. Otherwise, the new hyperbox is added to the existing cluster. Each hyperbox represents only a single class. In the next step, all the unvisited points (border points) in ‘Hi’ are checked by the EPS-neighborhood definition. If EPS-neighborhood of each point in ‘Hi’ contains more than ‘MinPoints’ among datasets, then the new hyperbox ‘Hi+1’ is created and added to cluster ‘C’ in order to build it. This recursive call is performed 16
iteratively to gather all directly density-reachable points until there is no more new point exists for adding to the current cluster ‘C’. DBSCAN algorithm extends its finding hyperboxes procedure for all the points that have not been visited yet to create new hyperboxes for the different classes and create a new cluster. In this research, formula (1) and (2) are used to as a distance formula in DBSCAN [49] and k-means [50] clustering algorithms, respectively. 𝑛
𝑑(𝑥, 𝑦) = ��(𝑥𝑖 − 𝑦𝑖 )2 𝑖=1
𝑛
𝑑(𝑥, 𝑦) = � |𝑥𝑖 − 𝑦𝑖 |2 𝑖=1
(1) (2)
where ‘x’ and ‘y’ denote the vectors of two datasets and ‘xi’ and ‘yi’ denote the attribute of them. ‘n’ presents the number of attributes. II. C-means The underlying step of using clustering algorithm, to discover clusters in GrC is to build the seeds of the hyperboxes designed as the centers of the clusters. Formula (5) is used as a distance formula to create hyperboxes in c-means [50]. This formula separated into two parts. The first part is the k-means formula and the second part is a membership function. The membership degree of point ‘x’ is indicated here by ‘ui’ (refers to formula (4)). To obtain the ‘ui’, we use formula (2) to find the distance of point ‘x’ and seed point of each cluster. Therefore, in order to solve this problem, we applied the k-means algorithm to discover the seeds. Next, finding hyperboxes procedure looks at the EPS-neighborhood of each pattern in the training patterns and returns all density-reachable patterns from that pattern with regard to ‘EPS’ and ‘MinPoints’. Thus, this process yields a new hyperbox ‘Hi’ and cluster ‘C’ contains the points in EPS-neighborhood. 𝑎𝑎𝑎(𝑥, 𝐶𝑖 ) =
𝑢𝑖 =
𝑑(𝑥, 𝑆1 ) + 𝑑(𝑥, 𝑆2 ) + ⋯ + 𝑑(𝑥, 𝑆𝑛 ) 𝑛
1 ∑𝑐𝑗=1(𝑎𝑎𝑎(𝑥, 𝐶𝑖 ) ÷ 𝑎𝑎𝑎�𝑥, 𝐶𝑗 �)2 𝑛
𝑑(𝑥, 𝑦) = � |𝑥𝑖 − 𝑦𝑖 |2 ×𝑢𝑖 𝑖=1
(3)
(4)
(5)
where ‘x’ and ‘y’ denote the vectors of two datasets and ‘xi’ and ‘yi’ denote the attribute of them. ‘n’ presents the number of attributes. III. GAk-means Ujjwal et al.[74] proposed the GAk-means algorithm. In this study, the k-means algorithm was applied in the first step to discover the seed points. The discovered seed points play the role of chromosomes for genetic algorithm. Then, the mutation and crossover applied on the selected chromosomes. We assumed 60% and 20% for total rate and token rate, respectively. 17
We calculate the fitness value of selected chromosomes in each iteration based on Maulik and Bandyopadhyay [74]. 6. Experimental studies This section aims at obtaining and comparing the best form of information granule in the aforementioned clustering algorithms (section 5.2) to classify the patterns and define their geometry and belongings to a specific class. In the following subsections, we consider the experimental datasets and the performance evaluation criteria that have used in this study. Finally, we analyze the performances of the clustering algorithms based on evaluation criteria. We have analyzed the effect of the proposed model on testing datasets. As in experimental result given in tables of Appendix A-D, two third or more of the datasets included as seed_points in the training datasets for classification. For example, Table 3 shows 398 training datasets for Wisconsin dataset. On the other hand, Table 8 in Appendix A represents 254 to 387 numbers of seed_points in training phase of k-means clustering algorithm in various ‘MinPts’ and ‘EPS’ values. 6.1. Machine learning datasets In this study, we have used three machine learning datasets from UCI repository related to classification problems [28]. Table 4 gives the details of each dataset by their area, number of classes (#class), number of attributes (#Atts.), number of examples (#Ex.), number of training datasets (#Train) and number of testing datasets (#Test). We devided the datasets into training and testing parts by randomly assigning 70% and 30% of total datasets to the training phase and testing phase, respectively. Table 4. Datasets summary Dataset Glass Ecoli Wisconsin
Area #Class #Att Physical 7 9 Life 8 8 Life 2 30
#Ex 214 336 569
#Train 150 235 398
#Test 64 101 171
I.
Glass Datasets Glass datasets are used in the criminological investigation. The identification of Glass as a “float” type or not is an important issue at the scene of the crime. Thus, we consider the Glass classification in two situations: the Glass is a type of “float” or “not a float”, and the Glass is a type of “non_float” or “not a non_float” as illustrated in Table 5. Table 5. Glass various categories for classification Glass category Float Not a float
#Ex
Type of Glass
87 (building_windows_float_processed, vehicle_windows_float_processed)
Glass #Ex Type of Glass category Non_ 76 (building_windows_non_float_processed, Float vehicle_windows_non_float_processed)
127 (building_windows_non_float_processed, vehicle_windows_non_float_processed, containers, tableware, headlamps)
not a non_ 138 (building_windows_float_processed, float vehicle_windows_float_processed, containers, tableware, headlamps)
II. Wisconsin Datasets Wisconsin dataset [28] is related to life and medical research. The features of Wisconsin dataset are computed from the breast mass to diagnose the cancer. It has two class types “malignant” and “benign”.
18
III. Ecoli Datasets Ecoli datasets [28] contain protein localization sites used in the life area. Table 4 indicates eight classes for Ecoli datasets. We used two types of categories for classification as “Cytoplasm or Not a Cytoplasm” and “Inner membrane or Not an Inner membrane” as presented in Table 6. Table 6. Ecoli various categories for classification Ecoli category Cytoplasm
Not a Cytoplasm
#Ex
Type of Glass
Ecoli category
#Ex
143
Cytoplasm(cp)
Inner membrane
116 inner membrane without signal sequence (im), inner membrane (imU), inner membrane lipoprotein (imL), inner membrane (imS) 220 (cytoplasm(cp), perisplasm (pp), outer membrane (om), outer membrane lipoprotein (omL))
193 (perisplasm (pp),inner membrane without signal sequence (im), inner membrane (imU), inner membrane lipoprotein (imL), inner membrane (imS), outer membrane (om), outer membrane lipoprotein (omL))
Not an Inner membrane
Type of Glass
6.2. Evaluation Criteria Performances of k-means, c-means, DBSCAN and GAk-means structure were evaluated by the classification error and coverage percentage. In fact, the classification error shows the division of patterns which do not belong to any hyperboxes and are known as misclassified patterns by the whole number of patterns (refer to formula (6)). In addition, formula (7) depicts and quantifies the percentage of patterns which are covered by each structure of clustering methods. We calculate the coverage percentage by dividing the number of patterns that are covered by the clustering structure by the whole number of patterns. number of misclassified patterns in the primary structure 𝛼(%) = � � × 100 number of patterns in the Primary structure
number of patterns covered by the primary/secondary 𝐶𝐶𝐶(%) = � � × 100 number of patterns
(6) (7)
In this research, accuracy formula is used to calculate the performance of the proposed technique in classification. Classification accuracy is calculated in dealing with False Negative (FN) and False Positive (FP). FP represents the percentage of legitimate data and FN represents the percentage of non-legitimate data [75]. TP + TN 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = � � × 100 Number of Patterns
(9)
6.3. Experimental results In this part of the study, we present and analyze the result of k-means, c-means, DBSCAN and GAk-means clustering algorithms that are used in GrC. We applied the proposed model based on 10 runs each is based on new training-testing datasets. The compared models used the same number of runs, where their average results are presented in various MinPts and EPS values. Appendix A-D present the result of k-means, DBSCAN, cmeans and GAk-means on Wisconsin, Ecoli and Glass datasets. Furthermore, Fig. 8-12 shows the best results for these datasets based on mentioned evaluation criteria. 19
Fig. 9 represents the best result for the “float” or “not a float” category of the Glass datasets belongs to the formation in GAk-means clustering algorithm, where ‘MinPoint’ and ‘EPS’ values are ‘2’ and ‘0.6’. In addition, c-means has the worst result among the existing clustering algorithms for the “float” or “not a float” category of the Glass datasets. Fig. 10 shows that the best accuracy result in “non-float” and “not a non-float” category of Glass datasets were occurred in the DBSCAN clustering algorithm, where ‘MinPoint’ and ‘EPS’ values are ‘2’ and ‘0.1’. In addition, DBSCAN and GAk-means has the best coverage and misclassification error results in the formation, where MinPoint’ and ‘EPS’ values are ‘2’ and ‘0.1’ for DBSCAN, ‘2’ and ‘0.6’ for GAk-means in coverage result and ‘3’ and ‘0.6’ for GAk-means in misclassification result. `
Fig. 9. The Best coverage (a), misclassification error (b)and accuracy mean and deviation results of k-means, c-means, DBSCAN and GAk-means clustering methods in different MinPts and EPS values for the float/not a float category of the Glass datasets
20
Fig. 10. The Best coverage (a), misclassification error (b) and accuracy mean and deviation (c) results of k-means, c-means, DBSCAN and GAk-means clustering methods in different MinPts and EPS values for the non-float/not a non-float category of the Glass datasets
Fig. 11 shows the best results of coverage, misclassification and accuracy for “cytoplasm” and “not a cytoplasm” category of the Ecoli datasets in the mentioned clustering algorithms. Unlike the results of accuracy for Glass dataset, the best accuracy for Ecoli dataset was occurred for the c-means clustering algorithm for the formation, where ‘MinPoint’ and ‘EPS’ values are ‘2’ and ‘0.1’. In addition, the best coverage and misclassification results belong to the DBSCAN clustering algorithms; where, ‘MinPoint’ and ‘EPS’ values are ‘2’ and ‘0.9’ for coverage result, and ‘MinPoint’ and ‘EPS’ values are ‘4’ and ‘0.95’ for misclassification error result. Fig. 12 shows the best coverage, misclassification and accuracy results were occurred for the k-means clustering algorithm for “inner membrane” and “not an inner membrane” categories of the Ecoli datasets in the mentioned clustering algorithms. Furthermore, GAk-means algorithm revleaded the worst results among the existing clustering algorithms.
21
Fig. 11. The Best coverage (a), misclassification error (b) and accuracy mean and deviation (c) results of k-means, c-means, DBSCAN and GAk-means clustering methods in different MinPts and EPS values for the cytoplasm/not a cytoplasm category of the Ecoli datasets
Fig. 12. The Best coverage (a), misclassification error (b) and accuracy mean and deviation (c) results of kmeans, c-means, DBSCAN and GAk-means clustering methods in different MinPts and EPS values for the inner membrane /not be an inner membrane category of the Ecoli datasets
22
Fig. 13 represents the best results of coverage, misclassification and accuracy for the Wisconsin datasets in the different clustering algorithms. Fig. 13(a) and Fig. 13(b) show that the best coverage and misclassification results belongs to the DBSCAN and k-means clustering algorithms, respectively. In addition, the formation of ‘MinPoint’ and ‘EPS’ value ‘2’ and ‘0.6’ results the best accuracy for GAk-means clustering algorithm. Table 7 is summarized the best clustering algorithms for various machine learning datasets.
Fig. 13. The Best coverage (a), misclassification (b) and accuracy mean and deviation (c) results of k-means, cmeans, DBSCAN and GAk-meansclustering methods in different MinPts and EPS values for the Wisconsin datasets
Table 7. Best clustering algorithms based on coverage, misclassification error and accuracy results in GrC
Glass(float/not a float) Glass(non_float/not a non_ float) Ecoli (Cytoplasm/Not aCytoplasm) Ecoli(Inner/Not an Inner) wisconsin
Coverage GAk-means GAk-means & DBSCAN DBSCAN k-means k-means
Evaluation criteria Misclassification error GAk-means GAk-means & DBSCAN
Accuracy GAk-means DBSCAN
DBSCAN
c-means
k-means k-means
k-means GAk-means
23
7. Conclusion This paper applied the systematic mapping study to categorize the GrC to discover relative derivations to specifiy its research strength and quality in this field. We supported our research strategy with 113 relevant published articles. The result indicates the following points: i) Almost half of the research focus area belongs to the category of data analysis in the scheme of research focus area, ii) most of the selected articles are belonging to the solution proposal and validation research categories in research type scheme, iii) the distribution of articles between tool, method and enhancement categories of contribution type are almost equal; and iv) 39% of the relevant articles belong to the rough set framework. Furthermore, the mapping results show a big gap in using the decision trees and neighborhood systems with the existing contribution types and research types. Moreover, there is a gap in the composition of cluster analysis and existing contribution types. Among much works have been done on granular models, less attention has paid on cluster analysis to discover certain category of granular classifiers – referred here as hyperboxes. Therefore, we defined the geometry of the information granules to designe the effective classifiers with four popular clustering algorithms, i.e., DBSCAN, k-means, c-means and GAk-means. We applied these algorithms on Wisconsin, Ecoli and Glass datasets. The best clustering algorithms are based on coverage, misclassification error and accuracy results in various machine learning datasets (see Table 6). Acknowledgements The Universiti Teknologi Malaysia (UTM) and Ministry of Education Malaysia under research university grants 00M19 and 01G72 are hereby acknowledged for some of the facilities that were utilized during the course of this research work. References [1] L.A. Zadeh, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy sets and systems, 90 (1997) 111-127. [2] L.A. Zadeh, Fuzzy logic= computing with words, Fuzzy Systems, IEEE Transactions on, 4 (1996) 103-111. [3] S.K. Pal, S.K. Meher, Paper: Natural computing: A problem solving paradigm with granular information processing, Applied Soft Computing, 13 (2013) 3944-3955. [4] J.T. Yao, A.V. Vasilakos, W. Pedrycz, Granular computing: perspectives and challenges, Cybernetics, IEEE Transactions on, 43 (2013) 1977-1989. [5] J. Yao, A ten-year review of granular computing, in: Granular Computing, 2007. GRC 2007. IEEE International Conference on, IEEE, 2007, pp. 734-734. [6] W. Pedrycz, Allocation of information granularity in optimization and decision-making models: Towards building the foundations of Granular Computing, European Journal of Operational Research, 232 (2014) 137-145. [7] H. Chen, T. Li, D. Ruan, Maintenance of approximations in incomplete ordered decision systems while attribute values coarsening or refining, Knowledge-Based Systems, 31 (2012) 140-161. [8] V.G. Kaburlasos, A. Kehagias, Fuzzy inference system (FIS) extensions based on lattice theory, (2013). [9] M.G. Cimino, B. Lazzerini, F. Marcelloni, W. Pedrycz, Genetic interval neural networks for granular data regression, Information Sciences, 257 (2014) 313-330. [10] W. Lu, W. Pedrycz, X. Liu, J. Yang, P. Li, The modeling of time series based on fuzzy information granules, Expert Systems with Applications, 41 (2014) 3799-3808. [11] F.J. Cabrerizo, E. Herrera-Viedma, W. Pedrycz, A method based on PSO and granular computing of linguistic information to solve group decision making problems defined in heterogeneous contexts, European Journal of Operational Research, 230 (2013) 624-633. [12] A. Albanese, S.K. Pal, A. Petrosino, Rough Sets, Kernel Set, and Spatiotemporal Outlier Detection, Knowledge and Data Engineering, IEEE Transactions on, 26 (2014) 194-207. [13] X. Zhang, D. Miao, Quantitative information architecture, granular computing and rough set models in the doublequantitative approximation space of precision and grade, Information Sciences, 268 (2014) 147-168. [14] X.-Q. Tang, P. Zhu, Hierarchical clustering problems and analysis of fuzzy proximity relation on granular space, Fuzzy Systems, IEEE Transactions on, 21 (2013) 814-824. [15] M.A. Sanchez, O. Castillo, J.R. Castro, P. Melin, Fuzzy granular gravitational clustering algorithm for multivariate data, Information Sciences, 279 (2014) 498-511. [16] R. Davtalab, M.H. Dezfoulian, M. Mansoorizadeh, Multi-level fuzzy min-max neural network classifier, (2014).
24
[17] D. Sánchez, P. Melin, Optimization of modular granular neural networks using hierarchical genetic algorithms for human recognition using the ear biometric measure, Engineering Applications of Artificial Intelligence, 27 (2014) 41-56. [18] D. Sanchez-Valdes, A. Alvarez-Alvarez, G. Trivino, Linguistic description about circular structures of the Mars’ surface, Applied Soft Computing, 13 (2013) 4738-4749. [19] P. Lingras, A. Elagamy, A. Ammar, Z. Elouedi, Iterative meta-clustering through granular hierarchy of supermarket customers and products, Information Sciences, 257 (2014) 14-31. [20] H. Li, S. Yang, H. Liu, Study of Qualitative Data Cluster Model based on Granular Computing, AASRI Procedia, 4 (2013) 329-333. [21] D.K. Chiu, T.W. Lui, NHOP: A nested associative pattern for analysis of consensus sequence ensembles, Knowledge and Data Engineering, IEEE Transactions on, 25 (2013) 2314-2324. [22] Y. Yao, The art of granular computing, in: Rough Sets and Intelligent Systems Paradigms, Springer, 2007, pp. 101112. [23] Y. Yao, Perspectives of granular computing, in: Granular Computing, 2005 IEEE International Conference on, IEEE, 2005, pp. 85-90. [24] J. Yao, Information granulation and granular relationships, in: Granular Computing, 2005 IEEE International Conference on, IEEE, 2005, pp. 326-329. [25] K. Petersen, R. Feldt, S. Mujtaba, M. Mattsson, Systematic mapping studies in software engineering, in: 12th International Conference on Evaluation and Assessment in Software Engineering, 2008, pp. 1. [26] S. Keele, Guidelines for performing systematic literature reviews in software engineering, in, Technical report, EBSE Technical Report EBSE-2007-01, 2007. [27] D. Budgen, M. Turner, P. Brereton, B. Kitchenham, Using mapping studies in software engineering, in: Proceedings of PPIG, 2008, pp. 195-204. [28] K. Bache, M. Lichman, UCI machine learning repository, URL http://archive. ics. uci. edu/ml, 19 (2013). [29] V.G. Kaburlasos, T. Pachidis, A lattice-computing ensemble for reasoning based on formal fusion of disparate data types, and an industrial dispensing application, Information Fusion, 16 (2014) 68-83. [30] L. Wang, X. Yang, J. Yang, C. Wu, Relationships among generalized rough sets in six coverings and pure reflexive neighborhood system, Information Sciences, 207 (2012) 66-78. [31] J. Yao, I. Global, Novel developments in granular computing: applications for advanced human reasoning and soft computation, Information Science Reference, 2010. [32] Z. Zhang, C. Guo, A method for multi-granularity uncertain linguistic group decision making with incomplete weight information, Knowledge-Based Systems, 26 (2012) 111-119. [33] S. Mingli, W. Pedrycz, Granular neural networks: concepts and development schemes, IEEE transactions on neural networks and learning systems, 24 (2013) 542-553. [34] Y. Yao, X. Deng, A granular computing paradigm for concept learning, in: Emerging Paradigms in Machine Learning, Springer, 2013, pp. 307-326. [35] A.R. Solis, G. Panoutsos, Granular computing neural-fuzzy modelling: A neutrosophic approach, Applied Soft Computing, 13 (2013) 4010-4021. [36] H. Hu, L. Pang, D. Tian, Z. Shi, Perception granular computing in visual haze-free task, Expert Systems with Applications, 41 (2014) 2729-2741. [37] B. Huang, Y.-l. Zhuang, H.-x. Li, Information granulation and uncertainty measures in interval-valued intuitionistic fuzzy information systems, European Journal of Operational Research, 231 (2013) 162-170. [38] F. Seifi, H. Ahmadi, M. Kangavari, Twins Decision Tree Classification: A Sophisticated Approach to Decision Tree Construction, in: ICCSA, 2007, pp. 337-341. [39] A. Skowron, J. Bazan, M. Wojnarski, Interactive rough-granular computing in pattern recognition, in: Pattern Recognition and Machine Intelligence, Springer, 2009, pp. 92-97. [40] Y. Yao, L. Zhao, A measurement theory view on the granularity of partitions, Information Sciences, 213 (2012) 113. [41] T. Yang, Q. Li, B. Zhou, Related family: A new method for attribute reduction of covering information systems, Information Sciences, 228 (2013) 175-191. [42] A. Skowron, J. Stepaniuk, R. Swiniarski, Modeling rough granular computing based on approximation spaces, Information Sciences, 184 (2012) 20-43. [43] A.V. Nandedkar, An Interactive Colour Video Segmentation: A Granular Computing Approach, in: Electrical Engineering and Intelligent Systems, Springer, 2013, pp. 135-146. [44] X. Yang, J. Yang, Neighborhood System and Rough Set in Incomplete Information System, in: Incomplete Information System and Rough Set Theory, Springer, 2012, pp. 101-130. [45] H.S. Bhatt, S. Bharadwaj, R. Singh, M. Vatsa, Recognizing surgically altered face images using multiobjective evolutionary algorithm, Information Forensics and Security, IEEE Transactions on, 8 (2013) 89-100. [46] S.K. Pal, S.K. Meher, S. Dutta, Class-dependent rough-fuzzy granular space, dispersion index and classification, Pattern Recognition, 45 (2012) 2690-2707. [47] A. Rosenfeld, L.S. Davis, Iterative histogram modification, in, IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC 345 E 47TH ST, NEW YORK, NY 10017-2394, 1978, pp. 300-302. [48] M.K. Kundu, S.K. Pal, Thresholding for edge detection using human psychovisual phenomena, Pattern Recognition Letters, 4 (1986) 433-441. [49] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Kdd, 1996, pp. 226-231.
25
[50] T. Velmurugan, Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data, Applied Soft Computing, 19 (2014) 134-146. [51] Y.W. Lim, S.U. Lee, On the color image segmentation algorithm based on the thresholding and the fuzzy c-means techniques, Pattern Recognition, 23 (1990) 935-952. [52] R. Wieringa, N. Maiden, N. Mead, C. Rolland, Requirements engineering paper classification and evaluation criteria: a proposal and a discussion, Requirements Engineering, 11 (2006) 102-107. [53] P. Grzegorzewski, Fuzzy number approximation via shadowed sets, Information Sciences, 225 (2013) 35-46. [54] P. Hońko, Relational pattern updating, Information Sciences, 189 (2012) 208-218. [55] L. Wang, X. Liu, W. Qiu, Nearness approximation space based on axiomatic fuzzy sets, International Journal of Approximate Reasoning, 53 (2012) 200-211. [56] A.M. San Roque, C. Maté, J. Arroyo, Á. Sarabia, iMLP: Applying multi-layer perceptrons to interval-valued data, Neural Processing Letters, 25 (2007) 157-169. [57] W. Pedrycz, B.-J. Park, S.-K. Oh, The design of granular classifiers: A study in the synergy of interval calculus and fuzzy sets in pattern recognition, Pattern Recognition, 41 (2008) 3720-3735. [58] S. Salehi, A. Selamat, R. Masinchi, H. Fujita, The Synergistic Combination of Particle Swarm Optimization and Fuzzy Sets to Design Granular Classifier, Knowledge-Based Systems, (2015). [59] Y. Yao, Interval sets and interval-set algebras, in: Cognitive Informatics, 2009. ICCI'09. 8th IEEE International Conference on, IEEE, 2009, pp. 307-314. [60] B.Q. Hu, Three-way decisions space and three-way decisions, Information Sciences, (2014). [61] W. Pedrycz, R. Al-Hmouz, A. Morfeq, A.S. Balamash, Building granular fuzzy decision support systems, Knowledge-Based Systems, 58 (2014) 3-10. [62] P. Hońko, Rough-Granular Computing Based Relational Data Mining, in: Advances on Computational Intelligence, Springer, 2012, pp. 290-299. [63] M. Ye, X. Wu, X. Hu, D. Hu, Anonymizing classification data using rough set theory, Knowledge-Based Systems, 43 (2013) 82-94. [64] C. Liu, D. Miao, J. Qian, On multi-granulation covering rough sets, International Journal of Approximate Reasoning, 55 (2014) 1404-1418. [65] Y. Qian, H. Zhang, Y. Sang, J. Liang, Multigranulation decision-theoretic rough sets, International Journal of Approximate Reasoning, 55 (2014) 225-237. [66] W. Pedrycz, G. Vukovich, Granular computing with shadowed sets, International Journal of Intelligent Systems, 17 (2002) 173-197. [67] S.-B. Roh, W. Pedrycz, T.-C. Ahn, A design of granular fuzzy classifier, Expert Systems with Applications, (2014). [68] W. Pedrycz, B. Russo, G. Succi, Knowledge transfer in system modeling and its realization through an optimal allocation of information granularity, Applied Soft Computing, 12 (2012) 1985-1995. [69] W. Wei, J. Liang, Y. Qian, A comparative study of rough sets for hybrid data, Information Sciences, 190 (2012) 116. [70] J. Liang, F. Wang, C. Dang, Y. Qian, An efficient rough feature selection algorithm with a multi-granulation view, International Journal of Approximate Reasoning, 53 (2012) 912-926. [71] G. Lin, Y. Qian, J. Li, NMGRS: Neighborhood-based multigranulation rough sets, International Journal of Approximate Reasoning, 53 (2012) 1080-1093. [72] W. Zhu, F.-Y. Wang, The fourth type of covering-based rough sets, Information Sciences, 201 (2012) 80-92. [73] J.W. Escobar, R. Linfati, P. Toth, A two-phase hybrid heuristic algorithm for the capacitated location-routing problem, Computers & Operations Research, 40 (2013) 70-79. [74] U. Maulik, S. Bandyopadhyay, Genetic algorithm-based clustering technique, Pattern recognition, 33 (2000) 14551465. [75] S. Salehi, A. Selamat, M. Bostanian, Enhanced genetic algorithm for spam detection in email, in: Software Engineering and Service Science (ICSESS), 2011 IEEE 2nd International Conference on, IEEE, 2011, pp. 594-597.
26
Appendix A: k-means A.1. Wisconsin datasets Table 8. Result of k-means method for the Wisconsin training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 19.3 20.8 23.4 19.3 20.8 23.4 19.3 20.8 23.4
Hyperbox No 332.933±5.1699 356.933±3.6599 387.2±1.90438 291.866±6.7415 325.733±4.1063 374.133±4.3949 254.8±7.53834 298.666±6.7197 360.2±4.08574
Coverage (%) Misclassification (%) 97.1356±0.6791 0.00719±0.0017 98.6599±0.5401 0.00336±0.0013 99.8157±0.1111 0.22951±0.1384 96.3986±0.6783 0.00904±0.0017 98.7604±0.3722 0.00310±0.3487 99.7822±0.1553 0.18794±0.1531 95.7118±1.0559 0.01077±0.0026 97.8223±0.7146 0.00546±0.0017 99.7654±0.1708 0.16716±0.1558
Table 9. Result of k-means method for the Wisconsin testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 19.3 20.8 23.4 19.3 20.8 23.4 19.3 20.8 23.4
Coverage (%) 97.1149±0.8665 98.9083±0.7951 99.6101±0.4088 95.3605±1.4711 98.1286±0.9356 99.6881±0.4198 94.8147±1.4773 98.2456±0.8804 99.6881±0.3615
Misclassification (%) 0.01686±0.005 0.00637±0.0046 0.00227±0.0023 0.02712±0.0086 0.01093±0.0054 0.00181±0.0024 0.03031±0.0086 0.01025±0.0051 0.00181±0.0021
Accuracy (%) 73.2163±2.7858 83.5477±3.4023 94.3469±1.7038 67.9531±3.8507 81.2085±3.1228 93.4112±1.4866 61.3644±4.3104 76.9979±3.0845 92.0857±2.1016
A.2. Glass: A.2.1. Float Table 10. Result of k-means method for the float/not be a float category of the Glass training datasets (mean±standard deviation) MinPoints 2 2 2 2 3 3 3 3 4 4 4 4
EPS 0.007 0.001 0.1 0.9 0.007 0.001 0.1 0.9 0.007 0.001 0.1 0.9
Hyperbox No 133.866±1.5434 136.066±1.7688 142.133±1.7838 146.866±1.1469 131.4±3.46025 133.866±2.8015 140.533±1.6679 145.733±1.1813 128.866±1.7838 131.133±2.3342 139.266±1.611 143.933±1.1234
Coverage (%) Misclassification (%) 90.9171±1.0358 0.06095±0.0069 91.9015±1.1086 0.05434±0.0074 96.1520±1.2121 0.02582±0.0081 99.0603±0.41 0.00629±0.0027 89.6643±2.1476 0.06936±0.0144 90.7829±1.8086 0.06185±0.0121 95.3019±1.2253 0.03152±0.0082 99.2393±0.4818 0.00509±0.0032 88.8142±1.1921 0.07506±0.008 89.5749±1.4261 0.06996±0.0095 94.6308±0.9169 0.03603±0.0061 99.0603±0.4777 0.00629±0.0032
Table 11. Result of k-means method for the float/not be a float category of the Glass testing datasets (mean±standard deviation) MinPoints 2 2 2 2 3 3 3 3 4 4 4 4
EPS 0.007 0.001 0.1 0.9 0.007 0.001 0.1 0.9 0.007 0.001 0.1 0.9
Coverage (%) 88.0208±3.5477 89.8958±2.4782 95.9375±2.898 99.0625±0.9547 88.2291±3.6916 90.4166±3.4169 93.6458±2.5811 98.6458±1.1219 87.8125±2.6269 90.7291±2.9901 93.6458±2.0937 99.0625±1.1121
Misclassification (%) 0.18717±0.0554 0.15787±0.0387 0.06347±0.0452 0.01464±0.0149 0.18391±0.0576 0.14973±0.0533 0.09928±0.0403 0.02115±0.0175 0.19042±0.041 0.14485±0.0467 0.09928±0.0327 0.01464±0.0173
Accuracy (%) 73.125±6.01647 82.3958±3.9308 94.375±3.64434 98.4375±1.7116 73.6458±5.3784 81.7708±3.3104 92.9166±2.6679 97.9166±1.7738 74.0625±3.9446 82.3958±3.1492 92.9166±1.9654 97.9166±1.9487
27
A.2.2. Non-float Table 12. Result of k-means method for the non-float/not be a non-float category of the Glass training datasets (mean±standard deviation) MinPoints 2 2 2 2 3 3 3 3 4 4 4 4
EPS 0.007 0.001 0.1 0.9 0.007 0.001 0.1 0.9 0.007 0.001 0.1 0.9
Hyperbox No 127.2±3.05941 132.866±1.7461 142.0±1.82574 146.533±0.4988 121.866±2.5525 129.0±2.42212 140.066±1.9136 146.866±0.6182 121.666±2.6749 125.6±2.6783 138.8±1.75878 146.0±1.36626
Coverage (%) Misclassification (%) 87.9641±1.5 0.08077±0.01 90.4697±1.2055 0.06395±0.008 95.7941±1.2843 0.02822±0.0086 98.9708±0.3348 0.0069±0.00224 85.8612±1.4387 0.09488±0.0096 89.6643±2.0765 0.06936±0.0139 94.8545±1.1667 0.03452±0.0078 99.0156±0.3348 0.00659±0.0022 86.7113±1.2055 0.08918±0.008 87.9194±1.7672 0.08107±0.0118 94.4965±1.0454 0.03693±0.007 98.8813±0.5293 0.00750±0.0035
Table 13. Result of k-means method for the non-float/not be a non-float category of the Glass testing datasets (mean±standard deviation) MinPoints 2 2 2 2 3 3 3 3 4 4 4 4
EPS 0.007 0.001 0.1 0.9 0.007 0.001 0.1 0.9 0.007 0.001 0.1 0.9
Coverage (%) 86.0416±3.3495 88.2291±3.9473 95.1041±2.0465 99.2708±0.7795 85.3125±2.5387 87.1875±4.0824 94.6875±2.4738 99.1666±0.7795 82.9166±2.9901 87.9166±3.3978 94.0625±2.434 98.9583±1.2325
Misclassification (%) 0.21809±0.0523 0.18391±0.0616 0.07649±0.0319 0.01139±0.0121 0.22948±0.0396 0.20018±0.0637 0.08300±0.0386 0.01301±0.0121 0.26692±0.0467 0.18879±0.053 0.09276±0.038 0.01627±0.0192
Accuracy (%) 74.7916±4.5237 81.1458±4.2491 92.8125±1.875 96.7708±1.559 76.4583±2.5811 81.4583±5.4086 93.0208±2.7282 96.3541±2.3292 72.9166±4.8635 80.625±4.375 92.8125±2.4071 96.6666±2.2726
A.3. Ecoli A.3.1. Cytoplasm Table 14. Result of k-means method for the cytoplasm/not be a cytoplasm category of the Ecolitraining datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.5 0.6 0.8 0.5 0.6 0.8 0.5 0.6 0.8
Hyperbox No 230.466±1.1469 231.6±1.35646 233.733±0.4422 228.733±1.1234 230.933±1.8061 233.6±0.87939 225.933±1.5691 229.066±1.7688 232.466±1.2036
Coverage (%) Misclassification (%) 98.9503±0.4071 0.00446±0.0017 99.1205±0.3283 0.00374±0.0013 99.6595±0.2304 0.00144±0.803 98.6949±0.5027 0.00555±0.0021 99.3474±0.3763 0.00277±0.0016 99.7446±0.2085 0.00108±0.8671 98.2978±0.712 0.00723±0.003 99.3758±0.3055 0.00265±0.0012 99.6595±0.2779 0.00144±0.0011
Table 15. Result of k-means method for the cytoplasm/not be a cytoplasm category of the Ecolitesting datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.5 0.6 0.8 0.5 0.6 0.8 0.5 0.6 0.8
Coverage (%) 98.5478±1.0140 99.6039±0.6049 99.3399±0.4667 98.1518±0.9473 99.1419±0.7109 99.2739±0.764 98.2178±1.2099 99.1419±0.8756 99.5379±0.6121
Misclassification (%) 0.01437±0.0100 0.00392±0.0059 0.00653±0.0046 0.01829±0.0093 0.00849±0.007 0.00718±0.0075 0.01764±0.0119 0.00849±0.0086 0.00457±0.006
Accuracy (%) 75.0494±2.2922 81.1880±4.2470 89.4389±1.7955 73.4653±3.0074 80.6600±3.6489 87.3267±2.7102 74.9174±4.7643 81.7161±3.149 88.0527±4.3604
28
A.3.2. Inner membrane Table 16. Result of k-means method for the Inner membrane/not be anInner membrane category of the Ecolitraining datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.5 0.6 0.8 0.5 0.6 0.8 0.5 0.6 0.8
Hyperbox No 230.066±1.7307 231.2±1.64113 231.866±1.2578 226.866±1.7461 229.266±1.6519 233.266±0.9285 223.4±1.92527 227.466±1.5434 230.6±1.58324
Coverage (%) Misclassification (%) 98.7801±0.4071 0.00518±0.0017 99.3191±0.5106 0.00289±0.0021 99.5176±0.3055 0.00205±0.0012 98.6098±0.5027 0.00591±0.0021 99.2339±0.4712 0.00325±0.002 99.6027±0.244 0.00168±0.001 98.2978±0.3806 0.00723±0.0016 99.3191±0.4339 0.00289±0.0018 99.6311±0.3055 0.00156±0.0012
Table 17. Result of k-means method for the Inner membrane/not be an Inner membrane category of the Ecoli testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.5 0.6 0.8 0.5 0.6 0.8 0.5 0.6 0.8
Coverage (%) 98.7458±0.9878 99.2739±0.6731 99.6699±0.5903 98.3498±1.2348 99.2079±0.6467 99.7359±0.4378 97.7557±1.375 98.6798±1.1807 99.4719±0.6121
Misclassification (%) 0.01241±0.0097 0.00718±0.0066 0.00326±0.0058 0.01633±0.0122 0.00784±0.0064 0.00261±0.0043 0.02221±0.0136 0.01306±0.0116 0.00522±0.006
Accuracy (%) 79.0758±6.4861 84.4884±5.4069 94.8514±2.0831 77.4257±6.2952 83.8283±6.6434 94.0594±1.6567 68.3168±11.478 85.2805±3.291 93.4653±4.3654
Appendix B: DBSCAN B.1. Wisconsin Table 18. Result of DBSCAN method for the Wisconsin training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 2.1 2.15 2.2 2.1 2.15 2.2 2.1 2.15 2.2
Hyperbox No 333.933±6.2765 365.6±3.73809 374.8±2.97097 295.8±6.76461 337.333±7.309 385.6±1.81842 260.733±7.5583 311.133±8.539 360.666±4.6856
Coverage (%) Misclassification (%) 97.0183±0.6347 0.00748±0.0015 99.0954±0.3646 0.02309±0.0774 99.7654±0.1441 0.20881±0.1472 96.3986±0.8033 0.00904±0.002 98.9781±0.3239 0.02338±0.0773 99.7152±0.1553 0.25058±0.1247 95.5443±0.6186 0.01118±0.0015 98.7101±0.2404 0.00323±0.0281 99.6984±0.188 0.18815±0.1528
Table 19. Result of DBSCAN method for the Wisconsin testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 2.1 2.15 2.2 2.1 2.15 2.2 2.1 2.15 2.2
Coverage (%) 97.0370±1.3415 99.2202±0.5912 99.7660±0.2864 95.9843±1.1669 99.0253±0.6974 99.922±0.1987 95.5165±1.4271 98.8693±0.7539 99.6881±0.2917
Misclassification (%) 0.01731±0.0078 0.00455±0.0034 0.00136±0.0016 0.02347±0.0068 0.00569±0.004 0.54666±0.0011 0.02620±0.0083 0.00660±0.0044 0.00181±0.0017
Accuracy (%) 74.5418±3.9075 87.1734±2.3534 94.113±1.7925 68.8888±3.8507 84.0935±2.3411 94.1130±1.6187 63.4307±4.731 80.6627±1.9155 92.7094±0.9275
29
B.2. Glass B.2.1. Float Table 20. Result of DBSCAN method for the float/not be a float category of the Glass training datasets (mean±standard deviation) MinPoints 2 2 2 2 3 3 3 3 4 4 4 4
EPS 0.3 0.32 0.8 0.95 0.3 0.32 0.8 0.95 0.3 0.32 0.8 0.95
Hyperbox No 134.466±2.3342 135.2±1.75878 144.733±1.2364 146.666±0.7888 134.2±2.8095 136.6±2.15406 145.4±1.35646 145.4±1.58324 131.266±3.0652 132.0±2.68328 142.0±1.78885 143.533±1.0873
Coverage (%) Misclassification (%) 91.3198±1.8577 0.05825±0.0124 91.4988±1.0004 0.05705±0.0067 98.2102±0.8003 0.01200±0.0053 98.8366±0.2967 0.00779±0.0019 91.3646±1.6415 0.05795±0.011 92.2147±1.2686 0.05224±0.0085 98.6129±0.9944 0.00930±0.0066 99.1051±0.4692 0.00599±0.0031 89.9328±1.4498 0.06756±0.0097 89.7538±1.9676 0.06876±0.0132 97.3153±1.2967 0.01801±0.0087 98.9261±0.3287 0.00719±0.0022
Table 21. Result of DBSCAN method for the float/not be a float category of the Glass testing datasets (mean±standard deviation) MinPoints 2 2 2 2 3 3 3 2 4 4 4 2
EPS 0.3 0.32 0.8 0.95 0.3 0.32 0.8 0.95 0.3 0.32 0.8 0.95
Coverage (%) 89.2708±3.2207 88.9583±3.5843 98.2291±1.6002 99.5833±0.6909 89.7916±4.1874 89.4791±3.3978 97.8125±1.5934 98.9583±1.0925 88.8541±3.6473 90.2083±4.4365 96.25±2.78341 99.375±0.76546
Misclassification (%) 0.16763±0.0503 0.17252±0.056 0.02766±0.025 0.0065±0.0107 0.15950±0.0654 0.16438±0.053 0.03417±0.0248 0.01627±0.017 0.17414±0.0569 0.15299±0.0693 0.05858±0.0434 0.00976±0.0119
Accuracy (%) 75.4166±4.0931 81.3541±3.972 96.875±3.01903 98.6458±1.8807 74.8958±5.5159 82.8125±3.1766 96.0416±2.9017 97.6041±2.543 76.4583±4.3625 83.125±5.07752 94.1666±3.4923 97.9166±2.5301
B.2.2. Non-float Table 22. Result of DBSCAN method for the non-float/not be a non-float category of the Glass training datasets (mean±standard deviation) MinPoints 2 2 2 2 3 3 3 3 4 4 4 4
EPS 0.24 0.3 0.8 1.0 0.24 0.3 0.8 1.0 0.24 0.3 0.8 1.0
Hyperbox No 124.266±2.4073 134.4±1.6248 145.933±0.9977 147.8±0.83266 119.933±2.5157 131.6±2.9844 144.266±1.9136 148.333±0.4714 118.333±3.0037 129.0±1.96638 143.533±1.4544 148.333±0.5962
Coverage (%) Misclassification (%) 86.8903±2.1062 0.08797±0.0141 91.4988±1.1667 0.05705±0.0078 98.6129±0.713 0.00930±0.0047 99.5525±0.3164 0.00299±0.0021 85.4138±1.7061 0.09788±0.0114 90.3802±1.7096 0.06455±0.0114 97.8075±1.2367 0.01470±0.0083 99.5525±0.3164 0.00299±0.0021 84.6084±1.6705 0.10329±0.0112 89.3064±1.4387 0.07176±0.0096 97.9417±1.2121 0.01380±0.0081 99.6420±0.3348 0.0024±0.00224
30
Table 23. Result of DBSCAN method for the non-float/not be a non-float category of the Glass testing datasets (mean±standard deviation) MinPoints 2 2 2 2 3 3 3 3 4 4 4 4
EPS 0.24 0.3 0.8 1.0 0.24 0.3 0.8 1.0 0.24 0.3 0.8 1.0
Coverage (%) 84.0625±3.083 89.1666±3.8045 98.3333±1.4508 99.4791±0.7365 86.1458±3.0118 87.7083±3.7355 97.8125±1.5934 99.4791±0.7365 83.125±3.70634 88.3333±2.9017 97.7083±2.0465 99.2708±0.7795
Misclassification (%) 0.24901±0.0481 0.16926±0.0594 0.02603±0.0226 0.00813±0.0115 0.21646±0.047 0.19205±0.0583 0.03417±0.0248 0.00813±0.0115 0.26366±0.0579 0.18228±0.0453 0.03580±0.0319 0.01139±0.0121
Accuracy (%) 72.7083±4.4512 84.8958±4.1405 95.0±2.14997 98.75±1.02062 73.6458±4.8029 83.6458±4.5237 94.1666±2.0144 98.6458±1.1219 71.0416±4.6654 83.75±4.93446 94.6875±2.2678 98.6458±0.7795
B.3. Ecoli B.3.1. Cytoplasm Table 24. Result of DBSCAN method for the cytoplasm/not be a cytoplasm category of the Ecoli training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.8 0.9 0.95 0.8 0.9 0.95 0.8 0.9 0.95
Hyperbox No 229.0±1.63299 232.066±0.4422 232.133±0.8844 224.8±2.13541 231.866±1.3597 233.866±0.8055 221.8±1.46969 230.2±1.51437 233.2±1.16619
Coverage (%) Misclassification (%) 98.7233±0.5602 0.00543±0.0023 99.2340±0.2304 0.00325±0.803 99.4325±0.2537 0.00241±0.001 98.0708±0.6925 0.00820±0.0029 99.4609±0.3283 0.00229±0.0013 99.7162±0.2006 0.00120±0.5324 97.7304±0.6531 0.00965±0.0027 99.4893±0.3543 0.00217±0.0015 99.6027±0.2893 0.00168±0.0012
Table 25. Result of DBSCAN method for the cytoplasm/not be a cytoplasm category of the Ecoli testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.8 0.9 0.95 0.8 0.9 0.95 0.8 0.9 0.95
Coverage (%) 98.0198±1.3527 99.7359±0.764 99.6699±0.4667 97.0297±1.7338 99.1419±0.6121 99.3399±0.5903 96.6996±1.6032 98.8118±0.8244 99.7359±0.4378
Misclassification (%) 0.01960±0.0133 0.00261±0.0075 0.00326±0.0046 0.02940±0.0171 0.00849±0.006 0.00653±0.0058 0.03267±0.0158 0.01176±0.0081 0.00261±0.0043
Accuracy (%) 71.5511±3.9515 83.6303±4.2295 89.3729±2.6187 68.6468±4.3932 82.5742±3.2698 89.1749±2.9695 70.0989±3.197 82.1121±3.4764 88.8448±2.4908
B.3.2. Inner membrane Table 26. Result of DBSCAN method for the Inner membrane/not be an Inner membrane category of the Ecoli training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.8 0.9 0.95 0.8 0.9 0.95 0.8 0.9 0.95
Hyperbox No 227.866±1.7838 231.8±1.10754 233.333±0.5962 223.066±1.7307 230.133±1.5434 232.333±1.2995 219.266±1.9136 228.466±1.6275 230.866±1.2578
Coverage (%) Misclassification (%) 98.6098±0.611 0.00591±0.0025 99.3474±0.263 0.00277±0.0011 99.6027±0.244 0.00168±0.001 97.9857±0.6493 0.00856±0.0027 99.3474±0.4071 0.00277±0.0017 99.6595±0.2304 0.00144±0.803 97.2481±0.6189 0.01170±0.0026 99.2339±0.3543 0.00325±0.0015 99.6311±0.3055 0.00156±0.0012
31
Table 27. Result of DBSCAN method for the Inner membrane/not be an Inner membrane category of the Ecoli testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.8 0.9 0.95 0.8 0.9 0.95 0.8 0.9 0.95
Coverage (%) 97.4917±1.44 99.4059±0.7047 99.6039±0.485 97.6237±1.437 99.1419±1.0765 99.4719±0.6121 96.8976±1.5281 99.3399±0.5903 99.6039±0.6049
Misclassification (%) 0.02482±0.0142 0.00588±0.0069 0.00392±0.0048 0.02352±0.0142 0.00849±0.0106 0.00522±0.006 0.03071±0.0151 0.00653±0.0058 0.00392±0.0059
Accuracy (%) 69.3729±8.9836 88.0527±7.9657 94.3234±2.028 64.6864±11.735 86.3366±7.2324 94.1914±2.9117 61.9141±10.192 86.9966±6.6846 94.0594±2.6812
Appendix C: C-means C.1. Wisconsin Table 28. Result of C-means method for the Wisconsin training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 18.0 20.0 22.0 18.0 20.0 22.0 18.0 20.0 22.0
Hyperbox No 298.8±3.96988 346.4±5.23832 374.8±2.85657 250.0±5.51361 307.6±5.64269 353.6±3.00665 200.4±14.7594 275.6±8.11418 330.8±1.72046
Coverage (%) Misclassification (%) 93.3667±0.7212 0.01666±0.0018 98.1406±0.7212 0.00466±0.0018 99.5476±0.188 0.12607±0.1526 91.2562±1.1392 0.02196±0.0028 98.1406±0.3408 0.00466±0.5929 99.3969±0.3408 0.12644±0.1522 86.9346±1.3853 0.03282±0.0034 97.0351±0.5816 0.00744±0.0014 99.0451±0.4322 0.00239±0.001
Table 29. Result of C-means method for the Wisconsin testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 18.0 20.0 22.0 18.0 20.0 22.0 18.0 20.0 22.0
Coverage (%) 91.228±1.22668 96.4912±1.6121 99.6491±0.2864 85.0292±0.4678 97.1929±1.1339 99.5321±0.4376 84.5613±2.2375 96.9590±0.8594 99.6491±0.4678
Misclassification (%) 0.05129±0.0071 0.02050±0.0094 0.00204±0.0016 0.08754±0.0027 0.01640±0.0066 0.00273±0.0025 0.09027±0.013 0.01777±0.005 0.00204±0.0027
Accuracy (%) 58.4794±4.6343 78.3624±2.8649 91.8128±1.1695 47.3683±2.9356 74.152±3.72435 88.1870±1.7504 42.5730±2.7027 71.4619±1.936 87.8362±2.0724
C.2. Glass C.2.1. Float Table 30. Result of C-means method for the float/not be a float category of the Glass training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.006 0.01 0.04 0.006 0.01 0.04 0.006 0.01 0.04
Hyperbox No 131.2±2.61278 134.2±2.63818 137.733±1.8427 129.333±1.8135 132.2±2.2271 136.466±1.4996 127.666±2.0221 132.066±1.9136 135.666±1.3984
Coverage (%) Misclassification (%) 89.7538±1.4387 0.06876±0.0096 90.7829±1.6705 0.06185±0.0112 92.7068±1.3615 0.04894±0.0091 88.9037±1.4879 0.07446±0.0099 90.0223±1.2936 0.06695±0.0086 91.7225±1.1922 0.05554±0.008 88.7247±1.4947 0.07566±0.01 90.2907±1.0358 0.06515±0.0069 91.0514±0.9385 0.06005±0.0062
32
Table 31. Result of C-means method for the float/not be a float category of the Glass testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.006 0.01 0.04 0.006 0.01 0.04 0.006 0.01 0.04
Coverage (%) 89.5833±3.1595 90.3125±2.8067 92.5±2.62698 87.3958±2.4517 90.3125±2.6269 91.1458±2.1092 88.3333±2.7282 88.9583±2.6434 91.1458±2.185
Misclassification (%) 0.16275±0.0493 0.15136±0.0438 0.11718±0.041 0.19693±0.0383 0.15136±0.041 0.13834±0.0329 0.18228±0.0426 0.17252±0.0413 0.13834±0.0341
Accuracy (%) 74.8958±4.6514 82.2916±3.4548 92.0833±2.5173 72.2916±3.8045 81.7708±2.8905 90.7291±2.17 72.6041±4.6304 82.3958±4.8902 90.625±1.97642
C.2.2. Non-float Table 32. Result of C-means method for the non-float/not be a non-float category of the Glass training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.005 0.01 0.1 0.005 0.01 0.1 0.005 0.01 0.1
Hyperbox No 130.0±2.73252 134.2±1.75878 139.866±2.1561 124.333±3.1763 132.133±3.3638 139.133±1.7074 122.933±2.8859 130.666±2.3285 142.466±1.2578
Coverage (%) Misclassification (%) 88.7247±1.9327 0.07566±0.0129 90.6487±1.1871 0.06275±0.0079 94.6755±1.3075 0.03572±0.0087 86.8008±1.8284 0.08858±0.0122 90.2907±1.943 0.06515±0.013 94.4965±0.9863 0.03693±0.0066 86.9798±1.9574 0.08737±0.0131 89.8433±1.1459 0.06816±0.0076 96.0625±1.0064 0.02642±0.0067
Table 33. Result of C-means method for the non-float/not be a non-float category of the Glass testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.005 0.01 0.1 0.005 0.01 0.1 0.005 0.01 0.1
Coverage (%) 85.9375±3.9114 91.25±2.95363 94.0625±2.2243 86.1458±4.2644 88.9583±4.5452 93.9583±2.1998 84.6875±4.8479 89.7916±2.8451 95.1041±2.4116
Misclassification (%) 0.21972±0.0611 0.13671±0.0461 0.09276±0.0347 0.21646±0.0666 0.17252±0.071 0.09439±0.0343 0.23925±0.0757 0.15950±0.0444 0.07649±0.0376
Accuracy (%) 78.3333±4.7002 88.75±3.57217 93.4375±2.3662 75.5208±4.7621 87.0833±5.1811 92.8125±2.2678 76.0416±5.7527 88.5416±3.001 92.6041±2.5173
C.3. Ecoli C.3.1. Cytoplasm Table 34. Result of C-means method for the cytoplasm/not be a cytoplasm category of the Ecoli training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.55 0.7 0.8 0.55 0.7 0.8 0.55 0.7 0.8
Hyperbox No 231.333±1.1352 233.2±0.83266 232.333±0.6992 229.8±1.90438 232.066±0.8537 233.6±0.7118 228.2±1.93907 230.933±1.0624 233.0±1.03279
Coverage (%) Misclassification (%) 99.2056±0.3763 0.00337±0.0016 99.3758±0.263 0.00265±0.0011 99.6027±0.244 0.00168±0.001 98.9219±0.6189 0.00458±0.0026 99.4609±0.3284 0.00229±0.0013 99.6311±0.2123 0.00156±0.0298 98.7233±0.7288 0.00542±0.003 99.4609±0.3633 0.00229±0.0015 99.6311±0.1446 0.00156±0.1528
33
Table 35. Result of C-means method for the cytoplasm/not be a cytoplasm category of the Ecoli testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.55 0.7 0.8 0.55 0.7 0.8 0.55 0.7 0.8
Coverage (%) 98.4818±0.7975 99.2739±0.4378 99.5379±0.6121 98.2838±1.2242 99.2739±0.764 99.6039±0.485 98.8778±0.7975 99.2079±0.8244 99.5379±0.4939
Misclassification (%) 0.01502±0.0078 0.00718±0.0043 0.00457±0.006 0.01698±0.0121 0.00718±0.0075 0.00392±0.0048 0.01110±0.0078 0.00784±0.0081 0.00457±0.0048
Accuracy (%) 79.3399±3.3108 84.1584±2.632 90.0989±2.5047 77.8217±5.2739 84.8844±2.6681 89.1749±3.1406 78.6798±4.5854 83.3003±3.2309 88.1188±3.4677
C.3.2. Inner membrane Table 36. Result of C-means method for the Inner membrane/not be an Inner membrane category of the Ecoli training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.45 0.6 0.7 0.45 0.6 0.7 0.45 0.6 0.7
Hyperbox No 229.133±1.4996 231.066±1.2892 232.0±1.63299 224.333±1.7384 229.666±1.8856 231.0±1.26491 221.2±2.31516 226.733±1.5691 229.2±1.55777
Coverage (%) Misclassification (%) 98.7801±0.4358 0.00518±0.0018 99.2623±0.2892 0.00313±0.0012 0.76595±1.1955 0.42227±0.005 98.2127±0.6628 0.00760±0.0028 99.3191±0.3028 0.00289±0.0012 99.3758±0.263 0.00265±0.0011 97.6737±0.9434 0.00989±0.004 99.1772±0.3632 0.00349±0.0015 99.4609±0.3951 0.00229±0.0016
Table 37. Result of C-means method for the Inner membrane/not be an Inner membrane category of the Ecoli testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.45 0.6 0.7 0.45 0.6 0.7 0.45 0.6 0.7
Coverage (%) 97.6237±1.2934 99.2739±0.5678 99.4719±0.6121 97.6237±1.4818 99.0759±0.764 99.4719±0.6121 96.9636±1.6355 99.0099±0.723 99.2079±0.8244
Misclassification (%) 0.02352±0.0128 0.01176±0.0119 0.00522±0.006 0.02352±0.0146 0.00914±0.0075 0.00522±0.006 0.03005±0.0161 0.00980±0.0071 0.00784±0.0081
Accuracy (%) 74.1253±6.0481 86.6006±3.1902 90.759±5.8816 69.7689±8.1233 85.6105±3.8578 89.4389±5.2846 67.7227±7.2559 82.9042±8.9326 89.8349±5.3191
Appendix D: GAK-means D.1. Wisconsin Table 38. Result of GAk-means method for the Wisconsin training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 22.0 24.0 30.0 22.0 24.0 30.0 22.0 24.0 30.0
Hyperbox No 380.4±1.85472 389.8±1.16619 397.0±0.63245 360.8±5.6 380.2±2.13541 397.0±0.63245 337.8±3.24961 372.2±4.44522 396.2±0.74833
Coverage (%) Misclassification (%) 92.7135±1.042 0.01830±0.0026 96.6331±0.6666 0.00845±0.0016 99.0451±0.4871 0.00239±0.0012 91.1557±1.2652 0.02221±0.0031 95.4773±0.6926 0.01135±0.0017 99.0954±0.3408 0.00226±0.5457 91.1557±1.7363 0.02221±0.0043 95.9798±0.8557 0.01009±0.0021 98.7436±0.3178 0.00315±0.9689
34
Table 39. Result of GAk-means method for the Wisconsin testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 22.0 24.0 30.0 22.0 24.0 30.0 22.0 24.0 30.0
Coverage (%) 93.3332±2.3274 96.1403±0.9501 99.1812±0.5963 92.9824±1.6121 97.7777±1.2487 99.0643±0.4678 93.8011±1.1459 96.0233±1.5865 99.2982±0.2339
Misclassification (%) 0.03897±0.0136 0.02256±0.0055 0.00477±0.0034 0.04103±0.0094 0.01298±0.0073 0.00546±0.0027 0.03624±0.0067 0.02324±0.0092 0.00409±0.0013
Accuracy (%) 85.3800±3.7718 91.2279±1.654 98.7134±0.4376 83.9765±2.5784 93.8011±2.1117 97.4268±0.7932 84.0935±1.936 91.2280±2.667 98.3625±0.7758
D.2. Glass D.2.1. Float Table 40. Result of GAk-means method for the float/not be a float category of the Glass training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.15 0.3 0.6 0.15 0.3 0.6 0.15 0.3 0.6
Hyperbox No 139.266±1.2892 141.533±1.1469 144.533±1.3097 136.133±2.2764 138.6±2.05912 145.0±1.46059 135.6±2.55081 137.933±2.0154 142.4±1.30639
Coverage (%) Misclassification (%) 94.4518±1.1354 0.03723±0.0076 97.2706±0.9944 0.01831±0.0066 99.5078±0.6232 0.00330±0.0041 92.1699±1.7615 0.05254±0.0118 96.1968±1.289 0.02551±0.0086 99.4630±0.4384 0.00359±0.0029 91.6778±1.8306 0.05584±0.0122 95.0782±1.379 0.03302±0.0092 99.3288±0.6003 0.00450±0.004
Table 41. Result of GAk-means method for the float/not be a float category of the Glass testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.15 0.3 0.6 0.15 0.3 0.6 0.15 0.3 0.6
Coverage (%) 93.6458±3.3978 97.7083±2.1245 99.6875±0.625 90.8333±3.3689 95.7291±2.17 99.5833±0.896 91.0416±2.7043 93.9583±1.8807 99.1666±1.2586
Misclassification (%) 0.09927±0.053 0.03580±0.0331 0.00488±0.0097 0.14322±0.0526 0.06672±0.0339 0.00650±0.0139 0.13996±0.0422 0.09439±0.0293 0.01301±0.0196
Accuracy (%) 93.125±3.31701 96.4583±2.7043 98.75±0.84625 90.625±3.60843 94.0625±2.8067 98.5416±1.6601 90.4166±3.118 92.6041±2.8792 97.7083±1.6989
D.2.2. Non-float Table 42. Result of GAk-means method for the non-float/not be a non-float category of the Glass training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.1 0.3 0.6 0.1 0.3 0.6 0.1 0.3 0.6
Hyperbox No 134.4±1.81842 142.533±1.586 144.066±1.2364 133.4±2.18479 141.866±1.586 143.333±1.3498 133.2±2.10396 138.333±1.1352 143.733±1.436
Coverage (%) 90.2013±1.1701 97.3601±1.3075 99.3735±0.754 89.6643±1.4663 97.0916±1.3121 99.5078±0.713 90.0223±1.6596 95.1677±1.0738 99.6867±0.4819
Misclassification (%) 0.06575±0.0078 0.01771±0.0087 0.00420±0.005 0.06936±0.0098 0.01951±0.0088 0.00330±0.0047 0.06695±0.0111 0.03242±0.0072 0.0021±0.00323
35
Table 43. Result of GAk-means method for the non-float/not be a non-float category of the Glass testing datasets (mean±standard deviation) MinPoints EPS 2 0.1 2 0.3 2 0.6 3 0.1 3 0.3 3 0.6 4 0.1 4 0.3 4 0.6
Coverage (%) 90.7291±2.3844 97.2916±1.7554 99.4791±0.9316 89.375±3.75 95.7291±2.8221 99.4791±0.7365 89.1666±2.8221 94.7916±2.1092 99.1666±1.3819
Misclassification (%) 0.14485±0.0372 0.04231±0.0274 0.00813±0.0145 0.16601±0.0585 0.06672±0.044 0.00813±0.0115 0.16926±0.044 0.08137±0.0329 0.01301±0.0215
Accuracy (%) 90.625±2.48694 95.7291±3.4453 98.3333±0.896 89.1666±3.5843 93.8541±2.4517 97.6041±1.6989 88.2291±3.4169 93.2291±2.185 97.3958±2.398
D.3. Ecoli D.3.1. Cytoplasm Table 44. Result of GAk-means method for the cytoplasm/not be a cytoplasm category of the Ecoli training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.3 0.4 0.6 0.3 0.4 0.6 0.3 0.4 0.6
Hyperbox No 225.266±1.12348 228.333±1.49071 232.8±0.74833 223.6±2.47116 223.8±1.10754 230.533±1.1469 218.266±2.3513 223.333±2.0221 230.266±1.1813
Coverage (%) 95.8864±0.87438 97.5886±0.78721 98.9219±0.488 95.8864±1.2591 96.9644±0.6381 98.3545±0.5573 94.1843±1.095 96.7659±0.7086 98.8084±0.6628
Misclassification (%) 0.01749±0.00372 0.01025±0.00334 0.00458±0.002 0.01749±0.0053 0.01291±0.0027 0.00699±0.0023 0.02474±0.0046 0.01375±0.003 0.00506±0.0028
Table 44. Result of GAk-means method for the cytoplasm/not be a cytoplasm category of the Ecoli testing datasets (mean±standard deviation) MinPoints EPS 2 0.3 2 0.4 2 0.6 3 0.3 3 0.4 3 0.6 4 0.3 4 0.4 4 0.6
Coverage (%) 95.1815±1.34627 96.7656±1.82446 98.6138±0.792 94.3894±2.2481 96.7656±1.0519 98.9438±0.9193 92.7392±2.3055 95.9735±2.1225 98.4818±1.1917
Misclassification (%) 0.04770±0.01333 0.03201±0.01806 0.01372±0.0078 0.05554±0.0222 0.03201±0.0104 0.01045±0.009 0.07188±0.0228 0.03986±0.021 0.01502±0.0117
Accuracy (%) 47.6567±4.9023 56.4356±4.27772 67.2606±5.6295 47.1946±4.5107 54.1253±3.9383 66.9306±5.7709 44.8844±4.1484 55.1814±3.935 66.0065±4.6251
D.3.2. Inner membrane Table 46. Result of GAk-means method for the Inner membrane/not be an Inner membrane category of the Ecoli training datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.5 0.6 0.7 0.5 0.6 0.7 0.5 0.6 0.7
Hyperbox No 229.0±1.41421 230.466±1.5434 232.733±0.6798 227.266±1.0624 230.2±1.46969 228.066±1.6918 224.533±1.5434 228.533±1.3097 230.4±1.14309
Coverage (%) Misclassification (%) 97.8723±0.7923 0.00904±0.0033 97.9006±0.6855 0.00892±0.0029 98.9503±0.5991 0.00446±0.0025 97.3332±0.8143 0.01134±0.0034 97.8439±0.8716 0.00917±0.0037 98.0141±0.7559 0.00844±0.0032 96.5389±0.9561 0.01472±0.004 97.7871±0.52 0.00941±0.0022 98.4113±0.2893 0.00675±0.0012
36
Table 47. Result of GAk-means method for the Inner membrane/not be an Inner membrane category of the Ecoli testing datasets (mean±standard deviation) MinPoints 2 2 2 3 3 3 4 4 4
EPS 0.5 0.6 0.7 0.5 0.6 0.7 0.5 0.6 0.7
Coverage (%) 98.0198±1.3035 98.3498±0.781 99.0099±0.8855 96.2376±1.4551 97.2277±1.3623 97.2277±1.4095 95.7095±2.4431 97.0297±1.8076 98.4818±1.1917
Misclassification (%) 0.01960±0.0129 0.01633±0.0077 0.0098±0.0087 0.03724±0.0144 0.02744±0.0134 0.02744±0.0139 0.04247±0.0241 0.02940±0.0178 0.01502±0.0117
Accuracy (%) 47.7887±5.4884 57.8217±8.2742 64.9504±5.7596 46.2045±5.7807 57.5577±5.4574 63.3663±8.8925 46.1385±7.1011 56.6996±5.8235 62.9702±7.9685
37