FAUM: Fast Autonomous Unsupervised Multidimensional classication ∗
Hugo Javier Curti , Rubén Sergio Wainschenker
Instituto de Investigación en Tecnología Informática Avanzada (INTIA) Facultad de Ciencias Exactas Universidad Nacional del Centro de la Provincia de Buenos Aires Campus Universitario, Paraje Arroyo Seco s/n, Tandil, Buenos Aires, Argentina
Abstract
This article presents
sional,
Faum: Fast Autonomous Unsupervised Multidimen-
an automatic clustering algorithm that can discover natural groupings
in unlabeled data.
Faum
is aimed to optimize the resources provided by a mod-
ern computer to process big datasets. The present algorithm can nd disjoint spherical symmetrical clusters in a deterministic way and without the indication of the number of clusters to nd, iterations or initializations. ably fast compared to processed. Since
Faum
K-Means
Faum
is remark-
when a big multidimensional set of data is
has an average
O(N )
space and time complexity, it can
process datasets of several hundred megabytes size in less than a minute on a standard laptop computer.Furthermore,
Faum
is not sensitive to outliers and
may be used by itself or to provide the whole initialization for a deterministic
K-Means Keywords:
processing. Big data, Fast clustering, Autonomous clustering
1. Introduction
Obtaining information from satellite imagery requires ordering and grouping a big amount of multidimensional data. Clustering techniques implemented on computers provide acceptable solutions to achieve this task.
The techniques
in use today can obtain good results; however, they tend to be slow and may require human intervention in many steps of the process. Furthermore, some of these techniques strongly depend on specic initializations and contain non-
∗ Corresponding author at: INTIA. Universidad Nacional del Centro de la Provincia de Buenos Aires, Paraje Arroyo Seco s/n, Tandil, Buenos Aires, Argentina. Phone: +54-249438-5683 ext 2350 - 2353 Email addresses:
[email protected] (Hugo Javier Curti),
[email protected] (Rubén Sergio Wainschenker)
Preprint submitted to Information Sciences
June 7, 2018
deterministic steps, leading to a long try-and-error process until an adequate result is found. [23, 24] The purpose of this work is to present
supervised Multidimensional,
FAUM: Fast Autonomous Un-
an automatic clustering algorithm that can
discover natural groupings in unlabeled big data by eciently generating different multidimensional histograms, dened as Hyper-histograms, and choosing one to obtain the information.
Faum is aimed to optimize the resources provided
by a modern computer. When the present research started, the idea was to develop a
K-Means [30]
deterministic initialization method, without providing the number of clusters to nd and also, applicable to satellite imagery. During the testing phase, it was found that the algorithm can also be extended to a full clustering algorithm on its own. Moreover, it can be easily generalized to process any big multidimensional data set, with remarkable timing performance compared to
K-Means.
Although the algorithm is autonomous, it can be ne-tuned by adjusting some parameters.
Faum
treats the clustering process as a succession of steps. Each step ex-
tracts relevant information from the input dataset, generating a new, smaller dataset, to be used in the next step. The aim of this article is to present the
Faum clustering, which obtains acceptable results with datasets K-Means. At Faum can be considered a linear clustering method. More steps are
rst two steps of
that contain disjoint spherical symmetrical clusters, similar to this point,
currently under research and will be presented in the future. These steps might allow
Faum
to approach non-linear clusterings, or even include other non-linear
techniques in use today, like COLL, MEAP, among others [44, 45], that might take advantage of the dataset size reduction. Section 2 presents a classication and a brief summary of clustering techniques in use today. In section 3, the proposed new method is presented, described and analyzed in terms of computational and space complexity. Section 4 presents two implementations as proofs of concept, together with the achieved results and timings under dierent scenarios. preted in section 5.
The obtained results are inter-
Finally, section 6 states the general conclusions and the
direction of future work derived from this research.
2. Background
Obtaining information from satellite imagery requires the ordering and grouping of big amounts of data. Data in this analysis is the radiation received from the satellite for each band, visible and infrared, represented by If each pixel of the image is described in a space of
d
n
dots or pixels.
dimensions, where each
dimension represents a band, each group of points in this feature space can be associated with a type of land cover. Clustering satellite images means to group similar points in this feature space to distinguish dierent land covers. On the other hand, classifying satellite images means to associate a point with a known, specic land cover. Clustering may also be named as Unsupervised Classica-
2
tion. When working with big datasets, the use of computers is indispensable for implementing either clustering or classication. [23, 24] Clustering algorithms can be divided into the three general groups described below:
•
Hierarchical clustering
•
Partitioning-based clustering
•
Grid-based clustering
2.1. Hierarchical clustering Hierarchical clustering algorithms recursively nd nested clusters, either in
agglomerative mode or in divisive mode. The agglomerative mode starts with each data point in its own cluster and merges the most similar pair of clusters
bottom-up strategy. The divisive mode begins with all the data points in one cluster and recursively divides each cluster into smaller ones: a top-down strategy. successively to form a cluster hierarchy; thus, conforming a
On the other hand,
CURE [21], DenPEHC [49, 34] are also examples of hierarchical clustering.
Examples of the former strategies are AGNES and DIANA [27].
Genie
[17] and
2.2. Partitioning-based clustering Partitioning-based clustering algorithms nd all the clusters simultaneously as a partition of the data. The partition is based on similarity, or dissimilarity, functions between the points, and do not impose a hierarchical structure.
K-Means and Affinity Prop-
The simplest and most popular partitioning-based algorithm is its variants. Another modern partitioning-based technique is
agation
and its variants [14, 45, 39].
Both techniques are briey described
below.
2.2.1. K-Means
K-Means the
algorithm nds a partition minimizing the squared error between
empirical mean of a cluster and the points in said cluster. The goal of K-
means
is to minimize the summation of the squared error over all
k
clusters.
Minimizing this objective function is known to be an NP-hard problem, even for
k=2
[8]. Thus
K-Means,
which is a greedy algorithm, can only converge
on a local minimum, even though recent studies show that
K-Means
could
converge on the global optimum, with high probability, when clusters are well separated [31].
K-Means
starts with an initial partition with
assigns patterns to clusters to reduce the squared error. error always decreases when the number of clusters minimized for a xed number of clusters
k.
k
k
clusters and
Since the squared
increases, it can only be
The main steps of
K-Means
classic
algorithm [25] are described as follows: 1. Initialize the number of clusters, the distance function, the end criteria, and the centroid of each cluster. The centroids may be either manually specied or randomly set.
3
2. For each point, compute the distance to each centroid and assign the point to the cluster represented by the nearest centroid. 3. Compute the new centroid of each cluster as the mass center of the cluster. 4. Repeat steps 2 and 3 until the end criteria are met. The main advantages of
K-Means
are that it can achieve good results, it is
simple to implement, and its space complexity is
K-Means
O(N ).
On the other hand,
highly depends on the initialization parameters, being remarkable
sensitive to the presence of outliers. Additionally, the number of clusters and the end criteria must be dened in advance. Furthermore, there is an implicit assumption of spherical symmetrical point distribution in each cluster [23, 24]. Finally, the computational complexity of number of points, To enhance
k
K-Means
the number of clusters and
l
is
O(n.k.l),
n
being
the
the number of iterations [26].
K-Means, some extensions have been created. Examples of these Isodata, Forgy, Fuzzy C-Means, K-Mediods, Dbscan, K-
extensions are
Means++
among others [50, 25, 40, 9, 29].
2.2.2. Affinity Propagation (AP) The main idea behind
Affinity Propagation
is to locate points that are
in the middle of concentrated areas of points in the feature space. These points are called
exemplars. AP neither depends on the initialization nor requires to
know in advance the number of clusters to nd. The algorithm works by passing messages between every point, gathering information to determine which are the
exemplar candidates. The Similarity Matrix
S
sized
(n, n),
being
n
the
total number of points in the dataset, must be provided as input data. Each
S(i, k)
position holds information about the similarity between points
i
and
k.
The negative squared error, i.e. the Euclidean Distance, is commonly used as similarity measure. The diagonal positions
S(k, k)
contain a
preference value
initialized in zero and updated during the run. Finally, points with higher preference value have more probability to be
exemplars. During the run, messages
are passed between points to interchange information used to update and to build two other matrices: the
ability Matrix
A,
Responsibility Matrix
R,
and the
S(k, k) Avail-
(n, n). R(i, k) values quantify how well-suited Xk is to serve as the exemplar for Xi , relative to other candidate exemplars for Xi . A(i, k) values represent how appropriate it would be for Xi to choose Xk as its exemplar, relative to other points preference for Xk as an exemplar[14]. MEAP both sized
is an extension of AP that performs better on more complicated structures, like non-linear clusters or multiple objective functions [45, 4, 5, 37, 36]. The space and time computational complexity of these algorithms is
O(N 2 ) [16, 45, 39, 33],
turning their usage in big-data dicult. For instance, the full satellite imagery used in the test described in section 4.2.3 would need hundreds of Tera Bytes available in the main memory of the computer.
2.3. Grid-based clustering The grid-based clustering approach diers from the other clustering algorithms in that it focuses on the value space that surrounds the data points, and
4
not on the data points themselves. In general, a typical grid-based clustering algorithm consists of the following ve basic steps [20]: 1. Create the grid structure, i.e., partition the data space into a nite number of cells. 2. Calculate the cell density for each cell. 3. Sort the cells according to their densities. 4. Identify cluster centers. 5. Traverse neighbor cells. Grid-based, also known as Density-based, approaches are popular for mining clusters in a large multidimensional space wherein clusters are regarded as denser regions than their surroundings [22]. The most important advantages of gridbased clustering are its signicant reduction of the computational complexity, especially for clustering really large datasets, and its tolerance to outliers. As an example of grid-based clustering, STING and CLIQUE are described here: STING (STatistical INformation Grid-based clustering method) was proposed in [46] to cluster spatial databases. The algorithm can be used to facilitate several kinds of spatial queries. The spatial area is divided into rectangular cells, represented by a hierarchical structure. Let the root of the hierarchy be at level
1, its children at level 2, etc.
The number of layers could be obtained by
changing the number of cells that form a higher level cell. A cell in level responds to the union of the areas of its children in level algorithm, each cell has
4
i + 1.
i
cor-
In the STING
children and each child corresponds to one quadrant
of the parent cell. Only two-dimensional spaces are considered in this algorithm. Some related work can be found in [47]. CLIQUE [2] is a scalable clustering algorithm designed to nd subspaces in the data with high density clusters.
CLIQUE does not undergo the problem
of high dimensionality since it estimates the density only in a low dimensional subspace.
2.3.1. Uses of histograms in clustering A histogram partitions the feature space into buckets or bins. In each bin, data distribution is often assumed uniform and recorded using simple statistic data.
The distribution of each bin can also be approximated using more
complex functions and statistical data.
Histograms are used to capture rele-
vant information about the data in a concise representation [48].
Classically,
histograms for multi-dimensional spaces, dened as Hyper-histograms, are built either by projecting the data to each dimension and building one histogram per dimension or by directly partitioning the feature space into Hyper-Bins, and using grid-based clustering to group the Hyper-Bins into clusters [43, 15, 18]. For the histogram to show correct information, the use of an optimal bin size is highly important. Historically, the bin size for a histogram was determined using statistical methods, like those described in [38, 41, 13, 42].
5
3. Proposed method
To nd clusters in big multidimensional data spaces, i.e., four or more dimensions, as is the case with satellite imagery multi-band data, eciently in computational terms, steps.
Faum
treats the clustering process as a succession of
Each step extracts relevant information from the input dataset, gen-
erating a new smaller dataset to be used in the next step. The rst steps work with bigger datasets, and tend to have a low computational complexity. On the other hand, the last steps tend to have a higher computational complexity, but work with the smaller datasets generated by the previous steps. Therefore, the whole process complexity is kept bounded. The rst version
Faum
presented in
this article proposes two steps: 1. To build a Hyper-histogram using certain criteria to determine a good Hyper-Bin size. The Hyper-histogram should reect the high density regions to gather information about the potential clusters present in it. This step is called Zero Order Clustering in this work. 2. To nd clusters grouping the Hyper-Bins by proximity. It is important in this step to determine the proximity criteria, and then, choose a distance function, considering the multidimensional nature of the space [30, 1]. This step is called First Order Clustering in this work. Both steps are detailed in sections 3.1 and 3.2, respectively. the relation between
Faum
In section 3.3,
input parameters is theoretically analyzed.
The
ne-tuning purpose is stated in section 3.4. Finally, section 3.5 shows a brief computational and space complexity analysis [3]. With the aforementioned steps, like
K-Means,
(O(N )).
Faum
can handle linearly separable clusters,
but keeping linear space and time computational complexity
Furthermore, if the parameters are tuned to be conservative,
will generate a bigger number of clusters.
Faum
These clusters can be considered
intermediate and used as the input for a Second Order Clustering Step in a future
Faum
version. See section 6.
3.1. Zero Order Clustering The aim of the Zero Order Clustering step is to eciently nd higher density zones, and reduce the dataset size for the next step. To nd higher density zones in a discrete dataset, the Zero Order Clustering step divides the space into uniform discrete parts, which should be big enough to represent the local density, and small enough to show the shape of the density distribution; thus, conforming a grid-based approach. In this work Zero Order Clustering is achieved by dening the obtained uniform parts as Hyper-Bins, nding their adequate size and generating a Hyperhistogram. At this point, it is important to consider that the Hyper-histogram is built by a computer, using a discrete model.
6
3.1.1. Hyper-Bin size determination criteria
1 The statistical
For this work, the Hyper-Bin was dened as hyper-cube.
methods described in [38, 41, 13, 42] were tried, and dierent results were found. None of them probed to be appropriate for this work on satellite imagery, mainly because they produced either small Hyper-Bins that did not reect the high density regions, or big ones that masqueraded dierent high density regions into one; thus, hiding the shape of the density distribution. Consequently, the development of empirical methods was conducted.
The
rst decision taken is to use power of two sized Hyper-Bins with equal sides: hyper-cubes. Restricting possible sizes to powers of two values allows a faster Hyper-Bin determination for every data-point in a digital computer since the operation may be implemented by bit shifting techniques; as opposed to much slower standard arithmetical divisions used in other cases. This restriction reduces the Hyper-Bin size possibilities to
s = log2 M , with M the maximum data s = 8 (bits per pixel band) for
value. In the special case of Landsat imagery,
s = 16
Landsat 5 or Landsat 7 and of
s
for Landsat 8. These relatively small values
make it feasible to try each possibility when an ecient algorithm to build
the Hyper-histogram, like the one explained in section 3.1.2, is used.. The Hyper-Bin size may be searched in a top-down approach: starting with the biggest side size of side size of
1.
2s ,
or a bottom-up approach: starting with a Hyper-Bin
In this work the Hyper-Bin cardinality is dened as the count of
2 For every proposed criterion in this work, the
data points contained in it.
0
cardinality Hyper-Bins are neither considered nor represented in any way in the implementations to reduce the space complexity [3] of the problem. In the top
2s , only one Hyper-Bin is obtained, containing all the data points in it, i.e. cardinality n, where n is the total data-point count. On the other hand, in the bottom extreme of Hyper-Bin side size of 1, many low cardinality Hyper-Bins are expected, being the n Hyper-Bins of cardinality 1 extreme of Hyper-Bin side size of
the worst case. Both extreme cases are the least adequate to obtain information about the clusters. The middle way Hyper-histograms tend to depict the shape of the distribution better, showing more information and emphasizing the high density regions. Figure 1 shows a simple graphical example applied to a twodimension dataset. It can be appreciated that side size the local density, while side size
4
1
bins do not represent
bins tend to unify close modes.
In this
work, every Hyper-histogram is generated starting with the bigger Hyper-Bin size, until an adequate size is found. Two empirical methods to nd a suitable Hyper-Bin size are presented. Both methods should be considered initial ideas that deserve a deeper study, even though good results were achieved with them. New methods may be conceived in the future. The decision about which method should be used depends on each particular problem. See section 3.4 for a more
1 The same size is used for every dimension. This denition allows covering all the space without superposition. 2 The cardinality concept may correlate with the density concept since the Hyper-Bin size is constant for a given Hyper-histogram.
7
a) Bin side size = 4
b) Bin side size = 2
12 10 8 6 4 2 0
12 10 8 6 4 2 0
c) Bin side size = 1
8 7 6 5 4 3 2 1 0
12 10 8 6 4 2 0
2 1.5 0
0.5
0 0.5
1 1
0.5 1.5
2 0
Plural Bin Count: 3 Cardinality Dispersion: 1
1 1.5
2 2.5
3 3.5
4 0
0.5
1
1.5
2
2.5
3
3.5
12 10 8 6 4 2 0
12 10 8 6 4 2 0
4 0
Plural Bin Count: 7 Cardinality Dispersion: 7
1
2
3
4
5
6
7
8 0
1
2
3
4
5
6
7
8
Plural Bin Count: 6 Cardinality Dispersion: 2
Figure 1: Simple example of dierent bin side sizes 4, 2 and 1 respectively, and the corresponding histogram in a two-dimension dataset. detailed explanation.
The plural Hyper-Bin count empirical method. When generating multiple Hyperhistograms of the same discrete xed dataset, using dierent Hyper-Bin sizes, it can be observed that, when the Hyper-Bin Size is reduced, the total count of populated Hyper-Bins increases, while the cardinality of the Hyper-Bins tends
1 increases. 1 Hyper-Bins may neither represent the local density nor per-
to reduce. Particularly, the amount of Hyper-Bins with cardinality These cardinality
mit to reduce the dataset size; thus, being harmful for the purpose of this step. It seems that when an important amount of cardinality 1 Hyper-Bins are present, the Hyper-histogram cannot represent the high density areas correctly. In this work a
Plural Hyper-Bin is dened as a Hyper-Bin whose cardinality
is greater than or equal to two. The Plural Hyper-Bin count method consists in determining the Hyper-Bin size that maximizes the amount of Plural HyperBins present in the Hyper-histogram. In Figure 1, the Plural Hyper-Bin count is shown under each histogram.
For example, Figure 2 shows the amount of
plural Hyper-Bins present in each Hyper-histogram generated with two-powered Hyper-Bin side sizes for the Landsat 7 225-86 2000-01-19 image. According to this method, the Hyper-histogram with Hyper-Bin side size Particularly, the relation between cardinality Hyper-Bins rises to
98.33%
1
2
should be used.
Hyper-Bins and total populated
when the Hyper-Bin side size
1
is used. Hence, no
information about local densities can be obtained.
The Cardinality Dispersion empirical method. The second method consists in determining the Hyper-Bin size by maximizing the Cardinality Dispersion. In
8
1x108 Total populated Hyper-bins Plural Hyper-bins
1x107
Hyper-bin count
1x106 100000 10000 1000 100 10 1 1
2
4
8
16
32
64
128
256
Hyper-bin side size
Figure 2: Amount of plural Hyper-Bins (green bars) and total populated Hyper-Bins (purple line) in each of the Hyper-histograms generated with two-powered Hyper-Bin side sizes. 5000 Cardinality Dispersion
Cardinality Dispersion
4000
3000
2000
1000
0 1
2
4
8
16
32
64
128
256
Hyper-bin side size
Figure 3: Cardinality Dispersion in each of the Hyper-histograms generated with two-powered Hyper-Bin side sizes. this work the
Cardinality Dispersion of a Hyper-histogram is dened as the total
number of dierent cardinality values in it, excluding cardinality
0.
For instance,
in a space with homogeneous distributed points, every Hyper-Bin should have the same cardinality, then the Cardinality Dispersion is one.
Studying the
relation between the Hyper-Bin size and the Cardinality Dispersion, it can be observed that extreme values tend to have lower Cardinality Dispersion.
It
seems that when bigger Hyper-Bin sizes are used, the Cardinality Dispersion is low because there are few populated Hyper-Bins. Whereas, when smaller HyperBin sizes are used, all the populated Hyper-Bins tend to have uniform, small values of cardinality; thus, decreasing the Cardinality Dispersion again. As in the case of the Plural Hyper-Bin count method, those Hyper-Bins with small values do not seem to represent the local density. In Figure 1, the Cardinality Dispersion is shown under each histogram.
9
Figure 3 shows the Cardinality
Dispersion for the same Hyper-histograms used in Figure 2. According to this method, the Hyper-histogram with Hyper-Bin side size
8
should be used.
3.1.2. Hyper-histogram generation Since the proposed methods require the generation of many Hyper-histograms for the same data point set, it is important to use an ecient algorithm to create them, using all the resources the computer and the operating system oer. Firstly, the data point set le, e.g., the Landsat image, is mapped in the main memory of the process, avoiding the need of physical main memory to load the whole data point set. The operating system loads each part on demand. If there is enough physical memory, only the rst read requires access to the much slower secondary memory, i.e., a hard disk. If there is not enough physical memory, more accesses may be required, but the program may still function although it is slower. Without this memory mapping based implementation, the program could not function when there is not enough physical memory to store the whole data point set. Secondly, to generate the Hyper-histogram, the
sparse matrix concept ap-
plied to a histogram [7] is used. Only the cardinalities of the Hyper-Bins that actually contain points, are positively stored. The disadvantage of sparse matrices is that it is not trivial to obtain access to their elements with
O(1),
i.e.,
constant, computational complexity, which is easily achieved in normal dense matrices where the position of each element depends exclusively on its coordinates.
To solve this inconvenience, a hash table structure [28] is used to
implement the sparse matrix, storing the cardinality of each Hyper-Bin. The Hyper-Bin address is computed from the data point value, and the hash table key is computed from the Hyper-Bin address. Both computations are performed using extremely fast bit shifting operations and the hash table allows really fast access, of almost constant complexity, to the cardinality using the hash table key. Consequently, the cardinality can be accessed either from the data point value or from the Hyper-Bin address, using a fast function with almost constant complexity. Finally, starting with an empty hash table, the Hyper-histogram is populated traversing the data points once.
For each data point, the cardinality of its
Hyper-Bin is searched on the hash table. If it is found, then its cardinality is incremented, and if it is not found, zero cardinality is assumed and the HyperBin is initialized with cardinality one. The hash table is the implementation of the Hyper-histogram used as departure for the First Order Clustering algorithm described in the next section. The computational complexity of said algorithm is
n
sizes may be constructed concurrently.
m
O(n),
i.e, linear, being
the point data count, and many Hyper-histograms with dierent Hyper-Bin The space complexity is
O(m),
being
the non-zero cardinality Hyper-Bin count for each concurrently constructed
Hyper-histogram.
10
3.2. First Order Clustering As a result of the Zero Order Clustering step, the list of populated HyperBins and their cardinality is obtained. This list conforms a new dataset that contains information about the original dataset but whose size is orders of magnitude smaller on average, allowing a more complex algorithm to be applied. The main idea behind the First Order Clustering step is to nd the modes present on the Hyper-histogram generated on the previous step, preferably preserving overlapped modes. In this work, a novel algorithm to nd the number and conformation of clus-
3 This algorithm is deterministic and unattended
ters in the dataset is presented.
although it can be ne-tuned. It will nd spherical, symmetrical clusters from the modes present in the Hyper-histogram. In the case of overlapped modes, this method will try to generate a group of clusters preserving information for a potential next step. When modes are isolated, good results are achieved, similar to a correctly initialized
K-Means,
but with less computer complexity. The
proposed First Order Clustering algorithm assumes that the most populated Hyper-Bins are candidates to be cluster seeds, and that they conform a cluster with the Hyper-Bins that constitute their neighborhood, considering the high dimension number scenario. A Hyper-Bin belongs to the same neighborhood as the seed Hyper-Bin if the distance between them is smaller than a constant value
c,
a natural number
measured in Hyper-Bin units. To measure the distance, an appropriate function must be selected. There are many functions that comply with the distance de-
Euclidean, Manhattan, Chebyshev, Minkowski in general, Mahalanobis, Jaccard, Spearman and Hamming[35]. In this work, the Chebyshev distance was chonition. The following distance functions may be cited as examples:
sen since it is really fast to compute and reects both the hyper-cube shape of the Hyper-Bins and the discrete nature of the method better. shows the as
¯ X
and
Chebyshev
Y¯
multidimensional vectors respectively, where
the vectors.
Equation 1
distance denition between two Hyper-Bins addressed
d
is the dimension of
d ¯ Y¯ = max DCh X, |xi − yi | i=1
(1)
The next sections present a description and a proposal of implementation of the algorithm used to generate the clusters from the Hyper-histogram.
3.2.1. First Order Clustering algorithm description This algorithm takes the Hyper-Bin with the biggest cardinality and assumes it is the seed of a cluster. Next, it searches all the populated Hyper-Bins that conform its neighborhood, i.e, those Hyper-Bins that are closer to the seed than the xed distance
c.Afterwards, the algorithm takes the next biggest cardinality
3 At this point, any clustering method such as those described in [9, 29, 22, 14, 45, 39] could potentially take advantage of the reduced dataset and be used to nd the clusters, allowing treating bigger datasets with less computer power. See section 6.
11
Zero order clustering result
Distance 2c from 1
2
Distance c from 3
2 Distance c from 1
7 4
7
4 3
3
1
1 6
6
5
L
1
2 3
5
4
5 6
L
7
C
C
U
U
M
M
2 3
M
M
C
4 1
5 1
State C L
C
1
U 4 5 1 1
M
State E
C
3
M
6 7
2 1 1
4 1
5 1
State D
1
3
U
2 1 1
3 1
U 1 1
L
3
3 3
L
6 7
State B 7
2 1 1
2 3 1
U
6 1
L
7
1 1
L
U
5 6
C
State A
C
4
1
3 3
4 5 1 1
6 3
7 3
M
1 2 1 1
State F
3 3
4 5 1 1
6 3
7 3
State G
Figure 4: Graphical representation of the First Order Clustering algorithm. The Hyperhistogram is 2D projected for simplicity. Darker Hyper-Bins have higher cardinality.
2c
Hyper-Bin that is more than
far from the previously chosen cluster seeds.
This Hyper-Bin is considered a new cluster seed and its neighborhood is constructed.
4
These steps are repeated until every Hyper-Bin is either inside a
cluster or nearer than
2c
from at least one cluster seed.
Unclustered Hyper-
Bins may either conform one-Hyper-Bin clusters by themselves or be included in the cluster whose seed is the closest to them, considering that the distance cannot be larger than
2c.
The rst solution preserves more information for a
possible Second Order Clustering Step, whereas the second solution gathers a better nal clustering.
A step by step algorithm description is stated below,
and Figure 4 shows a graphical example. i. From the Hyper-histogram, generate the List
L
containing all the Hyper-
Bin addresses in descending cardinality order. Shown as State A in Figure 4. ii. Initialize an empty List
U
of Hyper-Bin addresses discarded as cluster
seed. Shown as State A in Figure 4. iii. Initialize an empty List
C
of cluster seed Hyper-Bin addresses. Shown as
State A in Figure 4. iv. Initialize an empty Sparse Matrix
M
to keep the cluster assigned to every
Hyper-Bin address. Shown as State A in Figure 4.
4 The 2c value is neither arbitrary nor constitutes a new parameter. It is derived from the fact that neighborhoods should not overlap. Should every Hyper-Bin in the cluster be c far at most from its seed, two seeds must be more than 2c far from each other to ensure that a given Hyper-Bin can be included in at most one neighborhood, according to the denition.
12
v. Remove the rst Hyper-Bin address from the List new cluster seed, insert it into the List
M.
Matrix
c
L,
assuming it is the
and insert the Hyper-Bin in the
Shown as States B and E in Figure 4.
vi. Traverse the List distance
C
L.
For every Hyper-Bin that is closer than the xed
to the seed of the cluster found in Step v, remove its address
from the List
L
and insert it into the Matrix
M
as a member of this
cluster. Leave the rest of the Hyper-Bin addresses in the List
L.
Shown
as States C and F in Figure 4. vii. Check that the rst Hyper-Bin of the List every Hyper-Bin of the List
C.
L
is at least
2c
distant from
If this condition is not satised for at
least one Hyper-Bin stored in the List Hyper-Bin address from the List
L
C
of cluster seeds, remove this rst
and add it to the List
U
of discarded
Hyper-Bin addresses. Shown as State D in Figure 4. viii. Repeat Step vii with the new rst element of the List is satised for every element on the List ix. Repeat Steps v to viii until the List x. Traverse the List
U
L
C
L until the condition L is empty.
or the List
is empty.
removing every Hyper-Bin address and comparing it
to every cluster seed address in the List the Hyper-Bin to the Matrix
M
C
to nd the nearest one, adding
in the selected cluster. Shown as State G
in Figure 4. Steps i.ix. conform the main clustering algorithm, whereas step x. conforms the optional post-clustering step. Figure 5 shows a owchart representation of the algorithm with the optional post-clustering step highlighted in blue dotted line.
3.2.2. First Order Clustering algorithm implementation As in the previous phase, many considerations were taken to keep computational and space complexity as low as possible.
L
The List
is implemented as an ordered binary tree using cardinality as
strict ordering criteria. In every node of the tree, a dynamic array of Hyper-Bin addresses with the associated cardinality is stored as data. The structure can be built with
O(m.log(m))
complexity, where
m
is the number of the non-zero
cardinality Hyper-Bins from the Hyper-histogram resulting from Zero Order Clustering. The space complexity of the structure that implements the List is
O(m),
and the Hyper-histogram may be destroyed as soon as the List
L
L is
built. The List
C
This array has
is implemented as a dynamic array of seed Hyper-Bin addresses.
O(k)
computational complexity, being
found. The space complexity is also The List addresses.
U
k
the number of clusters
O(k).
is implemented as a dynamic array of discarded seed Hyper-Bin
This array has
O(m)
computational and space complexity in the
worst case. The Sparse Matrix
M
is implemented as a hash table using Hyper-Bin ad-
dresses as the keys with the same functions used for the Hyper-histogram in
13
Figure 5: Flowchart of the First Order Clustering algorithm with optional post-clustering step highlighted in blue dotted line.
14
S
S
S
c=1
c=2
c=3
Figure 6: 2D examples of dierent values of centered at the seed S.
Chebyshev radius distance c neighborhoods
Zero Order Clustering, and the seed Hyper-Bin addresses as the data. hash table has
O(m)
This
computational and space complexity.
The general algorithm computational complexity can be estimated as
O(m)
per round in the worst case, with one round per found cluster; thus, giving a computational complexity of Bin count and
k
O(m.k),
m
being
the non-zero cardinality Hyper-
the number of clusters found.
The worst case is
k = m.
Therefore the nal First Order Clustering worst computational complexity is
O(m2 ). 3.2.3. Neighborhood radius
c determination criteria
As described in section 3.2, First Order Clustering depends on a xed distance
c
that is computed using the
tance represents the
Chebyshev
Chebyshev
distance function.
This dis-
radius of the neighborhood, centered at the
seed Hyper-Bin address and measured in Hyper-Bin units. Figure 6 shows 2D examples with dierent
c
values.
Both Hyper-Bin side size and distance number of clusters to be found.
c
will aect directly the size and the
Since First Order Clustering complexity is
strongly dependent on the number of Hyper-Bins, and bigger Hyper-Bin side sizes tend to give smaller Hyper-Bin counts, it may be convenient to use a bigger Hyper-Bin side size and a smaller distance
c
value. Using this criterion for the
rst approach, the Hyper-Bin side size is determined using one of the criteria presented in section 3.1.1, and the distance
c xed to value 1.
Once this result is
obtained, a ne-tuning may be performed to adjust the number and size of the obtained clusters. A theoretical base to automate this ne-tuning is described in the next section. Furthermore, an algorithm to automate this process is under research while writing the present article. See section 6.
3.3. Relation between Hyper-Bin side size and distance
c
The number of clusters to nd may be adjusted by ne-tuning Hyper-Bin side size and distance
c.
Similar clusters are obtained using smaller Hyper-Bin
15
c=1
c=1
h=4,c=1 1
0
1
c=2
c=3
h=2,2