RESEARCH ARTICLE
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors Yuzhen Zhao, Xiyu Liu*, Xiufeng Li College of Business, Shandong Normal University, Jinan, 250014, China *
[email protected]
a1111111111 a1111111111 a1111111111 a1111111111 a1111111111
OPEN ACCESS Citation: Zhao Y, Liu X, Li X (2018) An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors. PLoS ONE 13(12): e0200751. https://doi.org/10.1371/journal. pone.0200751 Editor: Manabu Sakakibara, Tokai University, JAPAN
Abstract Density-based spatial clustering of applications with noise (DBSCAN) algorithm can find clusters of arbitrary shape, while the noise points can be removed. Membrane computing is a novel research branch of bio-inspired computing, which seeks to discover new computational models/framework from biological cells. The obtained parallel and distributed computing models are usually called P systems. In this work, DBSCAN algorithm is improved by using parallel evolution mechanism and hierarchical membrane structure in cell-like P systems with promoters and inhibitors, where promoters and inhibitors are utilized to regulate parallelism of objects evolution. Experiment results show that the proposed algorithm performs well in big cluster analysis. The time complexity is improved to O(n), in comparison with conventional DBSCAN of O(n2). The results give some hints to improve conventional algorithms by using the hierarchical framework and parallel evolution mechanism in membrane computing models.
Received: November 22, 2017 Accepted: May 3, 2018 Published: December 17, 2018 Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. Data Availability Statement: The Iris database is available from the UC Irvine Machine Learning Repository: http://archive.ics.uci.edu/ml. Funding: This work was supported by National Natural Science Foundation of China (No.61806114 to Y. Zhao, No. 61472231 to X. Liu, No. 61876101 to X. Liu, No. 61602282, No. 61402187, No. 61502283, No. 61802234, No.61703251, http:// www.nsfc.gov.cn/), China Postdoctoral Science Foundation (No.2018M642695 to Y. Zhao, http://jj. chinapostdoctor.org.cn/V1/Program3/Default. aspx) and the Natural Science Foundation of
1 Introduction Cluster analysis is the process of partitioning dataset into several clusters, with intra-cluster data being similar, and inter-cluster data being dissimilar. Cluster analysis is widely used in the fields of business intelligence [1, 2], Web search [3, 4], security [5, 6], biology [7, 8] and so on [9, 10] to discover implicit pattern or knowledge. As one subfield of data mining, cluster analysis can also be used as a stand-alone tool to obtain the data distribution, observe the characteristics of each cluster, deeply analyse special clusters, compress data (a cluster obtained by cluster analysis can be seen as a group) and so on. Further more, it can also be used as a preprocessing step for other algorithms, that is, these algorithms operate on the clusters or selected attributes [11]. The density-based spatial clustering of applications with noise (short for DBSCAN) algorithm is known as a density-based clustering algorithm, which clusters data points with large enough density [12] and achieves many significant improvements [13–20]. DBSCAN algorithm can recognize clusters of arbitrary shape, even the oval clusters and the “s” shape clusters, further more, the noise points can be removed from clusters. However, for big data
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
1 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Shandong Province (No. ZR2016AQ21, http:// www.sdnsf.gov.cn/portal/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.
processing, particular for big data cluster analysis, the improvement on computational efficiency of DBSCAN is needed. Cell-like P systems with promoters and inhibitors are abstracted based on the structure and function of the living cell, which have three main components, the membrane structure, multisets of objects evolving in a synchronous maximally parallel manner, and evolution rules. Objects in P systems evolve in a maximum parallel mechanism, regulated by promoters and inhibitors, such that the systems perform an efficient computation [21]. Therefore, cell-like P systems with promoters and inhibitors are a kind of suitable tool to improve the computational efficiency of DBSCAN. In this work, DBSCAN algorithm is improved by using parallel evolution mechanism and hierarchical structure in cell-like P systems with promoters and inhibitors. As a result, a so called DBSCAN-CPPI algorithm is obtained. Specifically, core objects from the dataset are parallel detected and regulated by a set of promoters and inhibitors. As well, n + 1 membranes are used to store the detected results, and a specific output membrane is used to output the clustering result. Experimental results based on Iris database of UC Irvine Machine Learning Repository [22] and the banana database show that the proposed algorithm performs well in data clustering, which achieves accuracy 81.33% (as well as the conventional DBSCAN), while the cost of time is reduced from O(n2) to O(n).
2 Preliminaries In this section, some basic concepts and notions in DBSCAN and cell-like P systems with promoters and inhibitors are recalled [12, 23].
2.1 The DBSCAN algorithm Density-based spatial clustering of applications with noise, shortly known as DBSCAN, is a density-based clustering algorithm, which clusters data points having large enough density. neighborhood: The neighborhood of an object is the space within the radius ( > 0) centered at this object. Core object: An object q is a core object if the number of objects in its neighborhood is greater than or equal to the threshold MinPts. Directly density-reachable: An object p is directly density-reachable from a core object q if and only if object p is in the neighborhood of object q. Density-reachable: Object p is density-reachable from object q if and only if there is a sequence p1, p2, . . ., pn such that p1 = q, pn = p, and each pi+1 is directly density-reachable from pi. noise: An object is a noise point if it does not belong to any cluster of the dataset. The general procedure of DBSCAN is as follows. Input: the dataset containing n objects, the neighborhood radius , the density threshold MinPts Step 1. All objects in the dataset are marked as “unvisited”. Step 2. An unvisited object p is chosen randomly, the mark of this object p is changed to “visited”, and the number of objects in the neighborhood of p is counted to check whether p is a core object. If p is not a core object, it is marked as a noise point; otherwise, a new cluster C is built and the object p is added to this cluster. The objects, which are in the neighborhood of p and do not belong to other clusters, are added to this cluster, too. Step 3. For each unvisited object p0 in cluster C, if p0 is unvisited, the mark of p0 is changed to “visited”, and the number of objects in the neighborhood of p is counted to check whether
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
2 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
p0 is a core object. If p0 is a core object, objects, which are in the neighborhood of p0 and do not belong to other cluster, are added to this cluster C. Step 4. Steps 2 and 3 are repeated until all objects are visited. Output: the clustering result Since the dissimilarity is measured by the distance between two objects, the algorithm can be applied to various types of objects.
2.2 Cell-like P systems with promoters and inhibitors Biological systems, such as cells, tissues, and human brains, have deep computational intelligences. Biologically inspired computing, or bio-inspired computing in short, focuses on abstracting computing ideas from biological systems to construct computing models and algorithms [24–29]. Membrane computing is a novel research branch of bio-inspired computing, initiated by Gh. Păun in 2002, which seeks to discover new computational models from the study of biological cells, particularly of the cellular membranes [23, 30]. The obtained models are distributed and parallel bio-inspired computing devices, usually called P systems. There are three mainly investigated P systems, cell-like P systems [23], tissue P systems [31], and neural-like P systems [32] (and their variants, see e.g. [33–40]). It has been proved that many P systems are universal, that is, they are able to do what a Turing machine can do efficiently [41–46]. The parallel evolution mechanism of variants of P systems has been found to perform well in doing computation, even solving computational hard problems [47–51]. A cell-like P system with promoters and inhibitors consists of three main components: the hierarchical membrane structure, objects and evolution rules. By membranes, a cell-like P system with promoters and inhibitors is divided into separated regions. Objects (information carriers) and evolution rules (by which objects can evolve to new objects) present in these regions. Objects are represented by symbols from an alphabet or strings of symbols. Evolution rules are executed in a non-deterministic and maximally parallel way in each membrane. The definition of a cell-like P system with promoters and inhibitors is as follows. P ¼ ðO; m; w1 ; . . . ; wm ; R1 ; . . . ; Rm ; r; iout Þ; where – O is the alphabet which includes all objects of the system. – μ is a rooted tree (the membrane structure). – wi describes the initial objects in membrane i, symbol λ denotes the empty string, and it shows that there is no object in membrane i. – Ri is the set of rules in membrane i with the form of uα ! v, where u is a string composed of objects in O, and v is a string over {ahere, aout, ainj|a 2 O, 1 j t} (ahere means object a remains in membrane i in which here can be omitted; aout means object a goes into the outer layer membrane, and ainj means object a goes into the inner layer membrane j), α 2 {z, ¬z0 } is a promoter or an inhibitor. A rule can be executed only when promoter z appears and cannot be executed when inhibitor z0 appears. – ρ defines the partial order relationship of the rules, i.e., higher priority rule means the rule should be executed with higher priority. – iout is the membrane where the computation result is placed. In the system, rules are executed in non-deterministic maximally parallel manner in each membrane. That is, at any step, if more than one rule can be executed but the objects in the membrane can only support some of them, a maximal number of rules will be executed. Each
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
3 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
P system contains a global clock as the timer, and the execution time of one rule is set to a time unit. The computation halts if no rule can be executed in the whole system. The computational results are represented by the types and numbers of specified objects in a specified membrane. Because objects in a P system evolve in maximally parallel, the system computes very efficiently. For more details one can refer to [23].
3 The improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors In this section, the DBSCAN algorithm is improved by using parallel evolution mechanism and hierarchical membrane structure in cell-like P systems promoters and inhibitors, where promoters and inhibitors are utilized to regulate parallelism of objects evolution. The obtained algorithm is shortly called DBSCAN-CPPI. Before introducing DBSCAN-CPPI, two matrices, called the distance matrix and dissimilarity matrix, are defined. Assume the dataset with n objects is X = {x1, x2, , xn}, and Euclidean distance is used to define their dissimilarity. The distance matrix D0nn between any two objects is defined as follows. 0 0 1 f11 f120 f1n0 B 0 C B f21 f220 f2n0 C C; D0nn ¼ B ð1Þ B C @ A fn10 fn20 fnn0 where fij0 is the distance between xi and xj. The dissimilarity matrix, denoted by Dnn, can be obtained from the distance matrix D0nn . If all elements in D0nn are integers, Dnn ¼ D0nn ; otherwise, the element fij of matrix Dnn is obtained by multiplying fij0 for 100 times and rounding off, thus getting a natural number. The dissimilarity matrix Dnn is as follows. 0 1 f11 f12 f1n B C B f21 f22 f2n C C: Dnn ¼ B ð2Þ B C @ A fn1 fn2 fnn
3.1 The cell-like P system for improving DBSCAN In general, for a clustering problem with n points, the dissimilarity matrix Dnn, a neighborhood radius and a density threshold MinPts, a membrane structure with n + 3 membranes labelled by 0, 1, . . ., n + 2 is used as the framework for DBSCAN-CPPI, which is shown in Fig 1. The dataset of objects to be dealt with is placed in membrane 0. Each point will be determined whether it is a core object or not in a parallel manner, using parallel evolution mechanism in cell-like P systems. The determined results of the n objects are stored in membranes 1, 2, . . ., n, respectively. After that, using maximum parallel mechanism, determined results of the n objects can be read/moved into target membranes by using evolution rules. The clustering result is stored in membrane n + 2. Hence, comparing with conventional DBSCAN algorithm, the time consumption of determining whether an object is a core object can be reduced by reading results in membrane 0.
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
4 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Fig 1. Membrane structure for the improved DBSCAN algorith. https://doi.org/10.1371/journal.pone.0200751.g001
The cell-like P system with promoters and inhibitors for DBSCAN-CPPI is as follows. P ¼ ðO; m; w0 ; w1 ; . . . ; wnþ2 ; R0 ; R1 ; . . . ; Rnþ2 ; r; iout Þ; where; – O = {xi, ai, Wij, W0 ij, bi, cij, Ai, θ, θij, φi, φn+1, E|1 i, j n}; – μ = [0[1]1[2]2. . .[n+2]n+2]0; – w0 = θ, w1 = . . . = wn+2 = λ; – iout = n + 2; – ρ = {ri > rj|i < j}; – R0 is the set of rules in membrane 0: 8 r1 > > > > > > > > > > > > > > > > r2 > > > < r3 > > > > > > > > > r4 > > > > > r5 > > > > > : r6
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
f
f
iðiþ1Þ ¼ fxi ! ai Wi1fi1 . . . Wiðiiði 1Þ1Þ Wiðiþ1Þ . . . Winfin
W 0 i1 . . . W 0 iði
1Þ
W 0 iðiþ1Þ . . . W 0 in j1 i ng
¼ fWij Wij0 ! lj1 i; j ng ¼ fW 0 tij ! bi cij j1 i; j n; 1 t g ¼ fbMinPts bti ! Ai j1 i n; 0 t n i
Ming
¼ fbi ! lj1 i ng ¼ fWij ! lj1 i; j ng
5 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Generally, r1, r2. . ., r6 are used to find all core objects and their neighbors. Initially, x1, x2, . . ., xn are placed into the membrane 0, and the system starts its computation. With xi in membrane 0, r1 generates fij copies of Wij and copies of W0 ij, where is the radius of neighborhood and fij represents the dissimilarity between xi and xj. The value of fij can be computed from Dnn and the value of is set by the user. After the execution of r1, Wij and W0 ij are generated such that r2 can be used. It has the following two cases: • If fij , then after using r2 there are fij − copies of Wij. In this case, the Wij remaining will be consumed in one step with parallel using r6 in membrane 0. It means xj is out of the radius of neighborhood of xi. • If fij < , then after the application of r2 there are − fij copies of Wij0 left in membrane 0. This means xj is in the radius of neighborhood of xi. In this case, r3 is applied to generate bi and cij. Objects bi work as a counter which count the number of points in the neighborhood of xi, and objects cij are used to mark xj is in the neighborhood of xi. The value of MinPts is initially set to define the minimal number of neighbors that a core object should has. If there are more than or equal to MinPts copies of bi in membrane 0, which means the number of neighbors of xi is enough to let it become a core object, then r4 can be used to generate Ai to distinguish the core object xi from the others. If the number of bi is less than MinPts, then xi is not a core object and bi will be consumed by r5. 8 r7 ¼ fyAi ai ! yii ðai Þini j1 i ng > > > > > > > r8 ¼ fðat cjt Þyij :ðAt Þ ! ðat Þini j1 i; j; t ng > > > > < r9 ¼ fðAt at cjt Þyij ! yit ðat Þini j1 i; j; t ng > > > > > > > r10 ¼ fyij ! lj1 i; j n; i 6¼ jg > > > > : r11 ¼ fyii ! yj1 i ng
Rules r7, r8. . ., r11 are used to separate objects to different clusters. Object Ai is chosen arbitrarily as a core object to built a new cluster i. With using r8, its neighbors aj that are not belonging to other clusters are put into membrane i. If there are other core objects in its neighborhood, this process is repeated. When there is no object that belongs to cluster i, another core object Aj is chosen arbitrarily to build another cluster j. Object θ is an auxiliary variable used to control the cycles.
( r ¼ fða Þ ! ða Þ 12 i y i inðnþ1Þ j1 i ng r13 ¼ fy ! ðφ1 bÞin1 ðφ1 bÞin2 . . . ðEφ1 bÞinðnþ1Þ g
The remaining objects are put into membrane n + 1 as noise points by using r12. Objects β and φ1 are placed into membranes 1 to n + 1 accordingly. – R1, R2, . . ., Rn are the sets of rules in membranes 1, 2, . . ., n:
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
6 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Each membrane i, 1 i n, has the following set of rules 8 r14 ¼ fðφi bÞai ! φiþ1 bai j1 i ng > > > < r15 ¼ fðφi Þ:ai ! φiþ1 j1 i ng > > > : r16 ¼ fðφnþ1 bÞ ! ðbÞinðnþ2Þ g Object β is a string and ai in current membrane will be added to the end of string β. Object φi is an auxiliary object used to control the cycles. – Rn+1 is the set of rules in membrane n + 1: 8 r17 ¼ fðEφi bÞai ! Eφiþ1 bai j 1 i ng > > > < r18 ¼ fðφi Þ:ai ! φiþ1 j 1 i ng > > > : r19 ¼ fðEφnþ1 bÞ ! ðEbÞinðnþ2Þ g Object ai in membrane n + 1 is the noise point, and E is added at the beginning of the string. – Rn+2 is the set of rules in membrane n + 2, which is empty. Membrane n + 2 is used to output the final cluster result, which has no rule inside.
3.2 An example An example is used to show how the system works. Four data points (1, 1), (1, 2), (3, 2), (3, 3) are considered. Let = 2 and MinPts = 1. In this example, the square Euclidean distance is chosen as the distance measure. The dissimilarity matrix D44 is as follows. 0 1 0 1 5 8 B C B1 0 4 5C C D44 ¼ B ð3Þ B 5 4 0 1 C: @ A 8 5 1 0 The computational process is shown in Table 1. The four data points are divided into two clusters by the P system.
3.3 Time complexity analysis In this subsection, the time cost in the worst case of DBSCAN-CPPI is analyzed. Initially, 6 steps are needed to find all core objects and their neighbors by using r1 to r6 in a maximal parallel manner. 3 steps are needed to put a core object and its neighbors into the corresponding cluster. In the worst case, the n objects are all core objects. In this case, it needs 3n steps to separate the n objects to different clusters. Subsequently, 2 steps (using r10 and r11) are needed to remove the auxiliary objects, and 2 steps are needed to find the noise points and activate the rules in membranes 1, 2, . . ., n + 1. Till now, the time cost is 6 + 3n + 2 + 1 + 1 = 3n + 10 steps. The rules in membranes 1, 2, . . ., n + 1 are executed in a parallel manner. By using r17 and r18, object ai is added to the string β in its corresponding membrane i, which costs n steps. After that, with using r19, string β is passed into the output membrane n + 2, which costs 1 step. Hence, it needs n + 1 steps to output the result. The time complexity is (3n + 10) + (n + 1) = 4n + 11, which is O(n).
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
7 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Table 1. The computational process of the example. step
membrane 0
membrane 1
membrane 3
0
θ, x1, x2, x3, x4(r1)
1
5 8 y; a1 ; W12 ; W13 ; W14 ; 0 0 0 W12 2; W13 2; W14 2; 4 5 a2 ; W21 ; W23 ; W24 ; 0 0 0 W21 2; W23 2; W24 2; 5 4 a3 ; W31 ; W32 ; W34 ; 0 0 0 W31 2; W32 2; W34 2; 8 5 a4 ; W41 ; W42 ; W43 ; 0 0 0 W41 2; W42 2; W43 2ðr2 Þ
2
3 6 0 y; a1 ; W13 ; W14 ; W12 ; 2 3 0 a2 ; W23 ; W24 ; W21 ; 3 2 0 a3 ; W31 ; W32 ; W34 ; 6 3 0 a4 ; W41 ; W42 ; W43 ðr3 Þ 3 6 y; a1 ; W13 ; W14 ; b1 ; c12 ; 2 3 a2 ; W23 ; W24 ; b2 ; c21 ; 3 2 a3 ; W31 ; W32 ; b3 ; c34 ; 6 3 a4 ; W41 ; W42 ; b4 ; c43 ðr4 Þ
4
3 6 y; a1 ; W13 ; W14 ; A1 ; c12 ; 2 3 a2 ; W23 ; W24 ; A2 ; c21 ; 3 2 a3 ; W31 ; W32 ; A3 ; c34 ; 6 3 a4 ; W41 ; W42 ; A4 ; c43 ðr6 Þ
5
θ, a1, A1, c12, a2, A2, c21, a3, A3, c34, a4, A4, c43(r7)
6
θ11, c12, a2, A2, c21, a3, A3, c34, a4, A4, c43(r9)
a1
7
θ11, θ12, c21, a3, A3, c34, a4, A4, c43(r10)
a1, a2
8
θ11, c21, a3, A3, c34, a4, A4, c43(r11)
a1, a2
9
θ, c21, a3, A3, c34, a4, A4, c43(r7)
a1, a2
10
θ33, c21, c34, a4, A4, c43(r9)
a1, a2
a3
11
θ33, θ34, c21, c43(r10)
a1, a2
a3, a4
12
θ33, c21, c43(r11)
a1, a2
a3, a4
13
θ, c21, c43(r13)
a1, a2
a3, a4
14
c21, c43
a1, a2, φ1, β(r14)
a3, a4, φ1, β(r15)
15
c21, c43
a1, a2, φ2, βa1(r14)
a3, a4, φ2, β(r15)
16
c21, c43
a1, a2, φ3, βa1a2(r15)
a3, a4, φ3, β(r14)
17
c21, c43
a1, a2, φ4, βa1a2(r15)
a3, a4, φ4, βa3(r14)
18
c21, c43
a1, a2, φ5, βa1a2(r16)
a3, a4, φ5, βa3a4(r16)
19
c21, c43
a1, a2,
a3, a4
membrane 6
βa1a2, βa3a4
https://doi.org/10.1371/journal.pone.0200751.t001
Some comparisons results between DBSCAN-CPPI and the conventional/improved DBSCAN algorithm are shown in Table 2.
4 Experiments and analysis 4.1 Illustrative experiment Take eighteen data points (4, 5), (3.7, 7), (4.5, 8), (4.5, 3), (5, 4), (5, 6), (5.5, 8), (6, 2.8), (6, 4), (6, 5.5), (6.5, 2), (7, 3), (10, 7), (10, 12), (11, 6), (11, 8), (12, 6.5), (12.5, 8) shown in Fig 2 as an example. Let = 5 and MinPts = 5.
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
8 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Table 2. Comparisons results of time complexity of some proposed DBSCAN algorithms. algorithm
time complexity
DBSCAN [12]
O(n2)
Rough-DBSCAN [13]
O(n + k2)
DBSCAN using a pruning technique on bit vectors [14]
O(k n mk + (1 − p) (n − 1) m)
A prototype-based modified DBSCAN [15]
max{O(n T), O(K0 t q m)}
G-DBSCAN [16]
O(n2)
BDE-DBSCAN [17]
O(nlogn)
SS-DBSCAN [18]
O(nlogn)
DBSCAN based on grid cell [19]
O(n + mk2)
DBSCAN with Spark [20]
O(n + Km)
DBSCAN-CPPI
O(n)
https://doi.org/10.1371/journal.pone.0200751.t002
Fig 2. The data points waiting for being clustered. https://doi.org/10.1371/journal.pone.0200751.g002
The conventional DBSCAN algorithm is used to cluster the data points firstly. Two clusters are gained as shown in Fig 3. The proposed DBSCAN-CPPI is tested to cluster the same data points, which obtains the same result as with conventional DBSCAN.
4.2 Applied experiments In this subsection, the Iris database and the banana database are used as experiments. The Iris database. The Iris database of UC Irvine Machine Learning Repository [22] is used to test DBSCAN-CPPI. This database contains 150 records. The 150 records are
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
9 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Fig 3. The two clusters formed by the conventional algorithm. https://doi.org/10.1371/journal.pone.0200751.g003
numbered orderly from 1 to 150. Each record contains four Iris properties values and the corresponding Iris species. All records are divided into three species, data from 1 to 50, data from 51 to 100 and data from 101 to 150, respectively. In the experiments, the value of is set to be 17 and MinPts is with value 5. The proposed DBSCAN-CPPI is tested by clustering the Iris database. The cluster result is shown in Table 3. In this work, the cluster accuracy is defined by the ratio between the number of records which are correctly clustered and the total number of records in the database. The cluster accuracy obtained by the proposed DBSCAN-CPPI is 81.33%, which is as good as the conventional DBSCAN. The banana database. The database consisting of two banana shaped clusters (shown in Fig 4) is used to test DBSCAN-CPPI. Such database contains 1000 records which are numbered from 1 to 1000. Each record contains 2 property values, and all records are separated into clusters, data from 1 to 500 and data from 501 to 1000, respectively. The value of is set to Table 3. The 3 clusters and noise points on Iris database using DBSCAN-CPPI algorith. Cluster
Serial number of data in the corresponding cluster
1
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,43,44,45,46,47,48,49,50
2
51,52,53,54,55,56,57,59,60,62,64,66,67,68,70,72,74,75,76,77,78,79,80,81,82,83,85,86,87,89,90,91,92,93,95,96,97,98,100
3
71,73,84,102,103,104,105,111,112,113,114,116,117,121,122,124,125,126,127,128,129,130,133,134,137,138,139,140,141,142,143,144,145,146,147,148,149,150
Noise points
23,42,58,61,63,65,69,88,94,99,101,106,107,108,109,110, 115,118,119,120,123,131,132,135,136
https://doi.org/10.1371/journal.pone.0200751.t003
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
10 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Fig 4. The banana shaped database. https://doi.org/10.1371/journal.pone.0200751.g004
be 26 and the value of MinPts is 10. The cluster result is shown in Fig 5 (yellow points are noise points, blue points and red points represent the two clusters, respectively) and the accuracy is 87.00% which is as good as the conventional DBSCAN.
4.3 Algorithm analysis In this subsection, the sensitivity and clustering quality of DBSCAN-CPPI, comparing with the classic k-means algorithm are donsidered. Sensitivity analysis. In the initialization of DBSCAN-CPPI, it needs to set the values of and MinPts, which are usually set by experiences. In the following, the relationships between the different values of the two parameters and the accuracy are analyzed. The results are shown in Figs 6 and 7. From Figs 6 and 7, it is found that DBSCAN-CPPI is sensitive to the values of the two parameters. With the simulation results, the best result of the Iris database is obtained when = 17 and MinPts = 3,4,5,6,7. The best result of the banana database is obtained when = 26 and MinPts = 2, 3, . . ., 14. Clustering quality analysis. We compare the clustering quality of DBSCAN-CPPI with k-means algorithm on Iris database. The cluster result of k-means algorithm on Iris database is shown in Table 4 with cluster accuracy 89.33%. In the cluster result by k-means algorithm, thirteen objects, which should be clustered in cluster 3, are placed to cluster 2; two objects belonging to cluster 2 are clustered in cluster 3. While, with DBSCAN-CPPI, no object is clustered in wrong clusters.
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
11 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Fig 5. The 2 clusters and noise points with DBSCAN algorithm. https://doi.org/10.1371/journal.pone.0200751.g005
Fig 6. The cluster accuracy of different parameter values in the Iris database obtained by DBSCAN-CPPI. https://doi.org/10.1371/journal.pone.0200751.g006
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
12 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Fig 7. The cluster accuracy of different parameter values in the banana database obtained by DBSCAN-CPPI. https://doi.org/10.1371/journal.pone.0200751.g007
Table 4. The 3 clusters with k-means algorithm. Cluster
Serial number of data in the corresponding cluster
1
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50
2
51,52,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,102,107, 114,115,120,122,124,127,128,134,139,143,147,150
3
53,78,101,103,104,105,106,108,109,110,111,112,113,116,117,118,119,121,123,125,126,129,130,131,132,133,135,136,137,138,140,141,142,144,145,146,148,149
https://doi.org/10.1371/journal.pone.0200751.t004
The k-means algorithm is also used to deal with banana database. The cluster result is shown in Fig 8 (yellow points are the points being separated to wrong clusters). The cluster accuracy is 75.10%. The accuracy of DBSCAN-CPPI on banana database is 11.9% higher than k-means algorithm accuracy. The k-means algorithm divides the “two bananas” from the middle and more points are misclassified, and DBSCAN-CPPI algorithm sets 124 points as noise points and only 6 points are misclassified.
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
13 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
Fig 8. The 2 clusters with k-means algorithm on banana database. https://doi.org/10.1371/journal.pone.0200751.g008
5 Conclusions In this work, an improved DBSCAN algorithm, named DBSCAN-CPPI is proposed by using parallel evolution mechanism and hierarchical membrane structure in cell-like P systems promoters and inhibitors. The time complexity is improved to O(n), in comparison with conventional DBSCAN of O(n2). Experimental results, based on Iris database and banana database, show that 1. DBSCAN-CPPI performs well on these two databases, it can find clusters of arbitrary shape, the cluster results are better especially when the clusters are not spherical-shaped; 2. DBSCAN-CPPI is suitable for big cluster analysis due to the low time complexity. The results give some hints to improve conventional algorithms by using the hierarchical framework and parallel evolution mechanism in membrane computing models. For further research, it is of interests to use neural-like membrane computing models, see e.g. [52–55], to improve DBSCAN algorithm. A possible way is to use the memory mechanism in neural computing models to store some potential cluster results, and then select the best one as computing result. Also, some other algorithms can be improved by using parallel evolution mechanism and hierarchical membrane structure [56, 57].
Author Contributions Writing – original draft: Yuzhen Zhao. Writing – review & editing: Xiyu Liu, Xiufeng Li.
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
14 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
References 1.
Kou G, Peng Y, Wang G. Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Information Sciences, 2014, 275(11):1–12. https://doi.org/10.1016/j.ins.2014.02.137
2.
Durante F, Pappadà R, Torelli N. Clustering of financial time series in risky scenarios. Advances in Data Analysis and Classification, 2014, 8(4):359–376. https://doi.org/10.1007/s11634-013-0160-4
3.
Katariya K, Aluvalu R. Agglomerative clustering in web usage mining: a survey. International Journal of Computer Applications, 2014, 89(8):24–27. https://doi.org/10.5120/15523-4306
4.
Chawla S. A novel approach of cluster based optimal ranking of clicked URLs using genetic algorithm for effective personalized web search. Applied Soft Computing, 2016, 46:90–103. https://doi.org/10. 1016/j.asoc.2016.04.042
5.
Ahmed M, Mahmood AN. Novel approach for network traffic pattern analysis using clustering-based collective anomaly detection. Annals of Data Science, 2015, 2(1):1–20. https://doi.org/10.1007/s40745015-0035-y
6.
Rodrı´guez S D, Barletta DA, Wilderjans TF, Bernik DL. Fast and efficient food quality control using electronic noses: adulteration detection achieved by unfolded cluster analysis coupled with time-window selection. Food Analytical Methods, 2014, 7(10):2042–2050. https://doi.org/10.1007/s12161-0149841-7
7.
Gollapalli P, Hanumanthappa M, Pattar S. Cluster analysis of protein-protein interaction network of mycobacterium tuberculosis during host infection. Advances in Bioresearch, 2015, 6(5):38–46.
8.
Li Z, Qiao Z, Zheng W, Ma W. Network cluster analysis of protein-protein interaction network-identified biomarker for type 2 diabetes. Diabetes Technology and Therapeutics, 2015, 17(7):475–481. https:// doi.org/10.1089/dia.2014.0204 PMID: 25879401
9.
Bearth A, Cousin ME, Siegrist M. Poultry consumers’ behaviour, risk perception and knowledge related to campylobacteriosis and domestic food safety. Food Control, 2014, 44:166–176. https://doi.org/10. 1016/j.foodcont.2014.03.055
10.
Selemetas N, Phelan P, O’Kiely P, Waal T. Cluster analysis of fasciolosis in dairy cow herds in Munster province of Ireland and detection of major climatic and environmental predictors of the exposure risk. Geospatial Health, 2015, 9(2):271–279. https://doi.org/10.4081/gh.2015.349 PMID: 25826308
11.
Han J, Kamber M. Data Mining: Concepts and Techniques, Elsevier, Amsterdam, US, 2006.
12.
Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996, 96(34): 226–231.
13.
Viswanath P, Babu VS. Rough-DBSCAN: A fast hybrid density based clustering method for large data sets. Pattern Recognition Letters, 2009, 30(16):1477–1488. https://doi.org/10.1016/j.patrec.2009.08. 008
14.
Mimaroglu S, Aksehirli E. Improving DBSCAN’s execution time by using a pruning technique on bit vectors. Pattern Recognition Letters, 2011, 32(13):1572–1580. https://doi.org/10.1016/j.patrec.2011.06. 003
15.
Edla DR, Jana PK, Member IS. A prototype-based modified DBSCAN for gene clustering. Procedia Technology, 2012, 6:485–492. https://doi.org/10.1016/j.protcy.2012.10.058
16.
Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L. G-DBSCAN: A GPU accelerated algorithm for density-based clustering. Procedia Computer Science, 2013, 18:369–378. https://doi.org/ 10.1016/j.procs.2013.05.200
17.
Karami A, Johansson R. Choosing DBSCAN parameters automatically using differential evolution. International Journal of Computer Applications, 2014, 91(7):1–11. https://doi.org/10.5120/15890-5059
18.
Zhang L. Stable saturation density of DBSCAN algorithm. Application Research of Computers, 2014, 31(07):1972–1975.
19.
Liu S, Meng D, Wang X. DBSCAN algorithm based on grid cell. Journal of Jilin University (Engineering and Technology Edition), 2014, 44(4):1135–1139.
20.
Han D, Agrawal A, Liao W, Choudhary A. A novel scalable DBSCAN algorithm with Spark. 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2016:1393–1402.
21.
Păun Gh. A quick introduction to membrane computing. Journal of Logic and Algebraic Programming, 2010, 79(1):291–294.
22.
http://archive.ics.uci.edu/ml
23.
Păun Gh. Computing with membranes. Journal of Computer and System Sciences, 2000, 61(1):108– 143. https://doi.org/10.1006/jcss.1999.1693
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
15 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
24.
Liu X, Xue J. Spatial cluster analysis by the Bin-Packing problem and DNA computing technique. Discrete Dynamics in Nature and Society, 2013, 2013(5187):845–850.
25.
Liu X, Xiang L, Wang X. Spatial cluster analysis by the Adleman-Lipton DNA computing model and flexible grids. Discrete Dynamics in Nature and Society, 2012, 2012(1-4):132–148.
26.
Zeng X, Lin W, Guo M, Zou Q. A comprehensive overview and evaluation of circular RNA detection tools. PLOS Computational Biology, 2017. https://doi.org/10.1371/journal.pcbi.1005420 PMID: 28594838
27.
Zeng X, Liao Y, Liu Y, Zou Q. Prediction and validation of disease genes using HeteSim Scores. IEEE/ ACM Transactions on Computational Biology and Bioinformatics, 2017, 14(3):687–695. https://doi.org/ 10.1109/TCBB.2016.2520947 PMID: 26890920
28.
Zeng X, Zhang X, Song T, Pan L. Spiking neural P systems with thresholds. Neural Computation, 2014, 26(7):1340–1361. https://doi.org/10.1162/NECO_a_00605 PMID: 24708366
29.
Song T, Wang X, Li X, Zheng P. A programming triangular DNA origami for doxorubicin loading and delivering to target ovarian cancer cells. Oncotarget, 2017.12.28. online.
30.
Păun Gh, Rozenberg G, Salomaa A. The Oxford Handbook of Membrane Computing, Oxford University Press, Oxford, UK, 2010.
31.
Martı´n-Vide C, Păun Gh, Pazos J, Rodrı´gues-Pato´n A. Tissue P systems. Theoretical Computer Science, 2003, 296(2):295–326. https://doi.org/10.1016/S0304-3975(02)00659-X
32.
Ionescu M, Păun Gh, Yokomori T. Spiking neural P systems. Fundamenta Informaticae, 2006, 71(2, 3):279–308.
33.
Song T, Wang X. Homogenous spiking neural P systems with inhibitory synapses. Neural Processing Letters, 2015, 42(1):199–214. https://doi.org/10.1007/s11063-014-9352-y
34.
Song T, Pan L. Spiking neural P systems with rules on synapses working in maximum spikes consumption strategy. IEEE Transactions on Nanobioscience, 2015, 14(1): 38–44. https://doi.org/10.1109/TNB. 2014.2367506 PMID: 25389243
35.
Zeng X, Zhang X, Pan L. Homogeneous spiking neural P systems. Fundamenta Informaticae, 2009, 97 (1):275–294.
36.
Zhang X, Wang B, Pan L. Spiking neural P systems with a generalized use of rules. Neural Computation, 2014, 26(12):2925–2943. https://doi.org/10.1162/NECO_a_00665 PMID: 25149700
37.
Zhang X, Zeng X, Luo B, Pan L. On some classes of sequential spiking neural p systems. Neural Computation, 2014, 26(5):974–997. https://doi.org/10.1162/NECO_a_00580 PMID: 24555456
38.
Cabarle F, Adorna HN, Jiang M, Zeng X. Spiking neural P systems with scheduled synapses. IEEE Transactions on Nanobioscience, 2017. https://doi.org/10.1109/TNB.2017.2762580 PMID: 29035221
39.
Peng H, Yang J, Wang J, Wang T, Sun Z, Song X et al. Spiking neural P systems with multiple channels. Neural Networks the Official Journal of the International Neural Network Society, 2017, 95 (66):66–71. https://doi.org/10.1016/j.neunet.2017.08.003 PMID: 28892672
40.
Zhao Y, Liu X, Wang W. Spiking neural P systems with neuron division and dissolution. PLOS One, https://doi.org/10.1371/journal.pone.0162882, 2016.
41.
Song T, Zou Q, Liu X, Zeng X. Asynchronous spiking neural P systems with rules on synapses. Neurocomputing, 2015, 151(1):1439–1445. https://doi.org/10.1016/j.neucom.2014.10.044
42.
Song T, Xu J, Pan L. On the universality and non-universality of spiking neural P systems with rules on synapses. IEEE Transactions on Nanobioscience, 2015, 14(8):960–966. https://doi.org/10.1109/TNB. 2015.2503603 PMID: 26625420
43.
Wang X, Song T, Gong F, Zheng P. On the computational power of spiking neural P systems with selforganization. Scientific Reports, 2016 Jun; 6, 27624; https://doi.org/10.1038/srep27624 PMID: 27283843
44.
Song T, Liu X, Zeng X. Asynchronous spiking neural P systems with anti-spikes. Neural Processing Letters, 2015, 42(3):633–647. https://doi.org/10.1007/s11063-014-9378-1
45.
Zeng X, Xu L, Liu X, Pan L. On languages generated by spiking neural P systems with weights. Information Sciences, 2014, 278(10):423–433. https://doi.org/10.1016/j.ins.2014.03.062
46.
Zhang X, Pan L, Paun A. On the universality of axon P systems. IEEE Transactions on Neural Networks and Learning Systems, 2017, 26(11):2816–2829. https://doi.org/10.1109/TNNLS.2015.2396940
47.
Peng H, Shi P, Wang J, Riscos-Nu´ñez A, Pe´rez-Jime´nez M. Multiobjective fuzzy clustering approach based on tissue-like membrane systems. Knowledge-Based Systems, 2017, 125:74–82. https://doi. org/10.1016/j.knosys.2017.03.024
48.
Ju Y, Zhang S, Ding N, Zeng X, Zhang X. Complex network clustering by a multi-objective evolutionary algorithm based on decomposition and membrane structure. Scientific Reports, https://doi.org/10.1038/ srep33870, 2016.
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
16 / 17
An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors
49.
Liu X, Li Z, Liu J, Liu L, Zeng X. Implementation of arithmetic operations with time-free spiking neural P systems. IEEE Transactions on Nanobioscience, 2015, 14(6):617–624. https://doi.org/10.1109/TNB. 2015.2438257 PMID: 26335555
50.
Liu X, Zhao Y, Sun M. An improved Apriori algorithm based on an evolution-communication tissue-like P system with promoters and inhibitors. Discrete Dynamics in Nature and Society, 2017, 2017(1):1–11. https://doi.org/10.1155/2017/6978146
51.
Liu X, Xue J. A cluster splitting technique by Hopfield networks and P systems on simplices. Neural Processing Letters, 2017:1–24.
52.
Song T, Pan L. Spiking neural P systems with rules on synapses working in maximum spiking strategy. IEEE Transactions on Nanobioscience, 2015, 14(4):465–477. https://doi.org/10.1109/TNB.2015. 2402311
53.
Song T, Pan L. Spiking neural P systems with request rules. Neurocomputing, 2016, 193(12):193–200. https://doi.org/10.1016/j.neucom.2016.02.023
54.
Song T, Zheng P, Wong M L D, Wang X. Design of logic gates using spiking neural P systems with homogeneous neurons and astrocytes-like control. Information Sciences, 2016, 372:380–391. https:// doi.org/10.1016/j.ins.2016.08.055
55.
Song T, Gong F, Liu X, Zhao Y, Zhang X. Spiking neural P systems with white hole neurons. IEEE Transactions on Nanobioscience, 2016, https://doi.org/10.1109/TNB.2016.2598879
56.
Zhang X, Tian Y, Cheng R, Jin Y. A decision variable clustering based evolutionary algorithm for largescale many-objective optimization. IEEE Transactions on Evolutionary Computation, 2016, in press. https://doi.org/10.1109/TEVC.2016.2600642
57.
Tian Y, Cheng R, Zhang X, Cheng F, Jin Y. An indicator based multi-objective evolutionary algorithm with reference point adaptation for better versatility, IEEE Transactions on Evolutionary Computation, 2017, in press. https://doi.org/10.1109/TEVC.2017.2749619
PLOS ONE | https://doi.org/10.1371/journal.pone.0200751 December 17, 2018
17 / 17