FAUM: Fast Autonomous Unsupervised

FAUM: Fast Autonomous Unsupervised Multidimensional classication ∗

Hugo Javier Curti , Rubén Sergio Wainschenker

Instituto de Investigación en Tecnología Informática Avanzada (INTIA) Facultad de Ciencias Exactas Universidad Nacional del Centro de la Provincia de Buenos Aires Campus Universitario, Paraje Arroyo Seco s/n, Tandil, Buenos Aires, Argentina

Abstract

This article presents

sional,

Faum: Fast Autonomous Unsupervised Multidimen-

an automatic clustering algorithm that can discover natural groupings

in unlabeled data.

Faum

is aimed to optimize the resources provided by a mod-

ern computer to process big datasets. The present algorithm can nd disjoint spherical symmetrical clusters in a deterministic way and without the indication of the number of clusters to nd, iterations or initializations. ably fast compared to processed. Since

Faum

K-Means

Faum

is remark-

when a big multidimensional set of data is

has an average

O(N )

space and time complexity, it can

process datasets of several hundred megabytes size in less than a minute on a standard laptop computer.Furthermore,

Faum

is not sensitive to outliers and

may be used by itself or to provide the whole initialization for a deterministic

K-Means Keywords:

processing. Big data, Fast clustering, Autonomous clustering

1. Introduction

Obtaining information from satellite imagery requires ordering and grouping a big amount of multidimensional data. Clustering techniques implemented on computers provide acceptable solutions to achieve this task.

The techniques

in use today can obtain good results; however, they tend to be slow and may require human intervention in many steps of the process. Furthermore, some of these techniques strongly depend on specic initializations and contain non-

∗ Corresponding author at: INTIA. Universidad Nacional del Centro de la Provincia de Buenos Aires, Paraje Arroyo Seco s/n, Tandil, Buenos Aires, Argentina. Phone: +54-249438-5683 ext 2350 - 2353 Email addresses: [email protected] (Hugo Javier Curti), [email protected] (Rubén Sergio Wainschenker)

Preprint submitted to Information Sciences

June 7, 2018

deterministic steps, leading to a long try-and-error process until an adequate result is found. [23, 24] The purpose of this work is to present

supervised Multidimensional,

FAUM: Fast Autonomous Un-

an automatic clustering algorithm that can

discover natural groupings in unlabeled big data by eciently generating different multidimensional histograms, dened as Hyper-histograms, and choosing one to obtain the information.

Faum is aimed to optimize the resources provided

by a modern computer. When the present research started, the idea was to develop a

K-Means [30]

deterministic initialization method, without providing the number of clusters to nd and also, applicable to satellite imagery. During the testing phase, it was found that the algorithm can also be extended to a full clustering algorithm on its own. Moreover, it can be easily generalized to process any big multidimensional data set, with remarkable timing performance compared to

K-Means.

Although the algorithm is autonomous, it can be ne-tuned by adjusting some parameters.

Faum

treats the clustering process as a succession of steps. Each step ex-

tracts relevant information from the input dataset, generating a new, smaller dataset, to be used in the next step. The aim of this article is to present the

Faum clustering, which obtains acceptable results with datasets K-Means. At Faum can be considered a linear clustering method. More steps are

rst two steps of

that contain disjoint spherical symmetrical clusters, similar to this point,

currently under research and will be presented in the future. These steps might allow

Faum

to approach non-linear clusterings, or even include other non-linear

techniques in use today, like COLL, MEAP, among others [44, 45], that might take advantage of the dataset size reduction. Section 2 presents a classication and a brief summary of clustering techniques in use today. In section 3, the proposed new method is presented, described and analyzed in terms of computational and space complexity. Section 4 presents two implementations as proofs of concept, together with the achieved results and timings under dierent scenarios. preted in section 5.

The obtained results are inter-

Finally, section 6 states the general conclusions and the

direction of future work derived from this research.

2. Background

Obtaining information from satellite imagery requires the ordering and grouping of big amounts of data. Data in this analysis is the radiation received from the satellite for each band, visible and infrared, represented by If each pixel of the image is described in a space of

d

n

dots or pixels.

dimensions, where each

dimension represents a band, each group of points in this feature space can be associated with a type of land cover. Clustering satellite images means to group similar points in this feature space to distinguish dierent land covers. On the other hand, classifying satellite images means to associate a point with a known, specic land cover. Clustering may also be named as Unsupervised Classica-

2

tion. When working with big datasets, the use of computers is indispensable for implementing either clustering or classication. [23, 24] Clustering algorithms can be divided into the three general groups described below:

•

Hierarchical clustering

•

Partitioning-based clustering

•

Grid-based clustering

2.1. Hierarchical clustering Hierarchical clustering algorithms recursively nd nested clusters, either in

agglomerative mode or in divisive mode. The agglomerative mode starts with each data point in its own cluster and merges the most similar pair of clusters

bottom-up strategy. The divisive mode begins with all the data points in one cluster and recursively divides each cluster into smaller ones: a top-down strategy. successively to form a cluster hierarchy; thus, conforming a

On the other hand,

CURE [21], DenPEHC [49, 34] are also examples of hierarchical clustering.

Examples of the former strategies are AGNES and DIANA [27].

Genie

[17] and

2.2. Partitioning-based clustering Partitioning-based clustering algorithms nd all the clusters simultaneously as a partition of the data. The partition is based on similarity, or dissimilarity, functions between the points, and do not impose a hierarchical structure.

K-Means and Affinity Prop-

The simplest and most popular partitioning-based algorithm is its variants. Another modern partitioning-based technique is

agation

and its variants [14, 45, 39].

Both techniques are briey described

below.

2.2.1. K-Means

K-Means the

algorithm nds a partition minimizing the squared error between

empirical mean of a cluster and the points in said cluster. The goal of K-

means

is to minimize the summation of the squared error over all

k

clusters.

Minimizing this objective function is known to be an NP-hard problem, even for

k=2

[8]. Thus

K-Means,

which is a greedy algorithm, can only converge

on a local minimum, even though recent studies show that

K-Means

could

converge on the global optimum, with high probability, when clusters are well separated [31].

K-Means

starts with an initial partition with

assigns patterns to clusters to reduce the squared error. error always decreases when the number of clusters minimized for a xed number of clusters

k.

k

k

clusters and

Since the squared

increases, it can only be

The main steps of

K-Means

classic

algorithm [25] are described as follows: 1. Initialize the number of clusters, the distance function, the end criteria, and the centroid of each cluster. The centroids may be either manually specied or randomly set.

3

2. For each point, compute the distance to each centroid and assign the point to the cluster represented by the nearest centroid. 3. Compute the new centroid of each cluster as the mass center of the cluster. 4. Repeat steps 2 and 3 until the end criteria are met. The main advantages of

K-Means

are that it can achieve good results, it is

simple to implement, and its space complexity is

K-Means

O(N ).

On the other hand,

highly depends on the initialization parameters, being remarkable

sensitive to the presence of outliers. Additionally, the number of clusters and the end criteria must be dened in advance. Furthermore, there is an implicit assumption of spherical symmetrical point distribution in each cluster [23, 24]. Finally, the computational complexity of number of points, To enhance

k

K-Means

the number of clusters and

l

is

O(n.k.l),

n

being

the

the number of iterations [26].

K-Means, some extensions have been created. Examples of these Isodata, Forgy, Fuzzy C-Means, K-Mediods, Dbscan, K-

extensions are

Means++

among others [50, 25, 40, 9, 29].

2.2.2. Affinity Propagation (AP) The main idea behind

Affinity Propagation

is to locate points that are

in the middle of concentrated areas of points in the feature space. These points are called

exemplars. AP neither depends on the initialization nor requires to

know in advance the number of clusters to nd. The algorithm works by passing messages between every point, gathering information to determine which are the

exemplar candidates. The Similarity Matrix

S

sized

(n, n),

being

n

the

total number of points in the dataset, must be provided as input data. Each

S(i, k)

position holds information about the similarity between points

i

and

k.

The negative squared error, i.e. the Euclidean Distance, is commonly used as similarity measure. The diagonal positions

S(k, k)

contain a

preference value

initialized in zero and updated during the run. Finally, points with higher preference value have more probability to be

exemplars. During the run, messages

are passed between points to interchange information used to update and to build two other matrices: the

ability Matrix

A,

Responsibility Matrix

R,

and the

S(k, k) Avail-

(n, n). R(i, k) values quantify how well-suited Xk is to serve as the exemplar for Xi , relative to other candidate exemplars for Xi . A(i, k) values represent how appropriate it would be for Xi to choose Xk as its exemplar, relative to other points preference for Xk as an exemplar[14]. MEAP both sized

is an extension of AP that performs better on more complicated structures, like non-linear clusters or multiple objective functions [45, 4, 5, 37, 36]. The space and time computational complexity of these algorithms is

O(N 2 ) [16, 45, 39, 33],

turning their usage in big-data dicult. For instance, the full satellite imagery used in the test described in section 4.2.3 would need hundreds of Tera Bytes available in the main memory of the computer.

2.3. Grid-based clustering The grid-based clustering approach diers from the other clustering algorithms in that it focuses on the value space that surrounds the data points, and

4

not on the data points themselves. In general, a typical grid-based clustering algorithm consists of the following ve basic steps [20]: 1. Create the grid structure, i.e., partition the data space into a nite number of cells. 2. Calculate the cell density for each cell. 3. Sort the cells according to their densities. 4. Identify cluster centers. 5. Traverse neighbor cells. Grid-based, also known as Density-based, approaches are popular for mining clusters in a large multidimensional space wherein clusters are regarded as denser regions than their surroundings [22]. The most important advantages of gridbased clustering are its signicant reduction of the computational complexity, especially for clustering really large datasets, and its tolerance to outliers. As an example of grid-based clustering, STING and CLIQUE are described here: STING (STatistical INformation Grid-based clustering method) was proposed in [46] to cluster spatial databases. The algorithm can be used to facilitate several kinds of spatial queries. The spatial area is divided into rectangular cells, represented by a hierarchical structure. Let the root of the hierarchy be at level

1, its children at level 2, etc.

The number of layers could be obtained by

changing the number of cells that form a higher level cell. A cell in level responds to the union of the areas of its children in level algorithm, each cell has

4

i + 1.

i

cor-

In the STING

children and each child corresponds to one quadrant

of the parent cell. Only two-dimensional spaces are considered in this algorithm. Some related work can be found in [47]. CLIQUE [2] is a scalable clustering algorithm designed to nd subspaces in the data with high density clusters.

CLIQUE does not undergo the problem

of high dimensionality since it estimates the density only in a low dimensional subspace.

2.3.1. Uses of histograms in clustering A histogram partitions the feature space into buckets or bins. In each bin, data distribution is often assumed uniform and recorded using simple statistic data.

The distribution of each bin can also be approximated using more

complex functions and statistical data.

Histograms are used to capture rele-

vant information about the data in a concise representation [48].

Classically,

histograms for multi-dimensional spaces, dened as Hyper-histograms, are built either by projecting the data to each dimension and building one histogram per dimension or by directly partitioning the feature space into Hyper-Bins, and using grid-based clustering to group the Hyper-Bins into clusters [43, 15, 18]. For the histogram to show correct information, the use of an optimal bin size is highly important. Historically, the bin size for a histogram was determined using statistical methods, like those described in [38, 41, 13, 42].

5

3. Proposed method

To nd clusters in big multidimensional data spaces, i.e., four or more dimensions, as is the case with satellite imagery multi-band data, eciently in computational terms, steps.

Faum

treats the clustering process as a succession of

Each step extracts relevant information from the input dataset, gen-

erating a new smaller dataset to be used in the next step. The rst steps work with bigger datasets, and tend to have a low computational complexity. On the other hand, the last steps tend to have a higher computational complexity, but work with the smaller datasets generated by the previous steps. Therefore, the whole process complexity is kept bounded. The rst version

Faum

presented in

this article proposes two steps: 1. To build a Hyper-histogram using certain criteria to determine a good Hyper-Bin size. The Hyper-histogram should reect the high density regions to gather information about the potential clusters present in it. This step is called Zero Order Clustering in this work. 2. To nd clusters grouping the Hyper-Bins by proximity. It is important in this step to determine the proximity criteria, and then, choose a distance function, considering the multidimensional nature of the space [30, 1]. This step is called First Order Clustering in this work. Both steps are detailed in sections 3.1 and 3.2, respectively. the relation between

Faum

In section 3.3,

input parameters is theoretically analyzed.

The

ne-tuning purpose is stated in section 3.4. Finally, section 3.5 shows a brief computational and space complexity analysis [3]. With the aforementioned steps, like

K-Means,

(O(N )).

Faum

can handle linearly separable clusters,

but keeping linear space and time computational complexity

Furthermore, if the parameters are tuned to be conservative,

will generate a bigger number of clusters.

Faum

These clusters can be considered

intermediate and used as the input for a Second Order Clustering Step in a future

Faum

version. See section 6.

3.1. Zero Order Clustering The aim of the Zero Order Clustering step is to eciently nd higher density zones, and reduce the dataset size for the next step. To nd higher density zones in a discrete dataset, the Zero Order Clustering step divides the space into uniform discrete parts, which should be big enough to represent the local density, and small enough to show the shape of the density distribution; thus, conforming a grid-based approach. In this work Zero Order Clustering is achieved by dening the obtained uniform parts as Hyper-Bins, nding their adequate size and generating a Hyperhistogram. At this point, it is important to consider that the Hyper-histogram is built by a computer, using a discrete model.

6

3.1.1. Hyper-Bin size determination criteria

1 The statistical

For this work, the Hyper-Bin was dened as hyper-cube.

methods described in [38, 41, 13, 42] were tried, and dierent results were found. None of them probed to be appropriate for this work on satellite imagery, mainly because they produced either small Hyper-Bins that did not reect the high density regions, or big ones that masqueraded dierent high density regions into one; thus, hiding the shape of the density distribution. Consequently, the development of empirical methods was conducted.

The

rst decision taken is to use power of two sized Hyper-Bins with equal sides: hyper-cubes. Restricting possible sizes to powers of two values allows a faster Hyper-Bin determination for every data-point in a digital computer since the operation may be implemented by bit shifting techniques; as opposed to much slower standard arithmetical divisions used in other cases. This restriction reduces the Hyper-Bin size possibilities to

s = log2 M , with M the maximum data s = 8 (bits per pixel band) for

value. In the special case of Landsat imagery,

s = 16

Landsat 5 or Landsat 7 and of

s

for Landsat 8. These relatively small values

make it feasible to try each possibility when an ecient algorithm to build

the Hyper-histogram, like the one explained in section 3.1.2, is used.. The Hyper-Bin size may be searched in a top-down approach: starting with the biggest side size of side size of

1.

2s ,

or a bottom-up approach: starting with a Hyper-Bin

In this work the Hyper-Bin cardinality is dened as the count of

2 For every proposed criterion in this work, the

data points contained in it.

0

cardinality Hyper-Bins are neither considered nor represented in any way in the implementations to reduce the space complexity [3] of the problem. In the top

2s , only one Hyper-Bin is obtained, containing all the data points in it, i.e. cardinality n, where n is the total data-point count. On the other hand, in the bottom extreme of Hyper-Bin side size of 1, many low cardinality Hyper-Bins are expected, being the n Hyper-Bins of cardinality 1 extreme of Hyper-Bin side size of

the worst case. Both extreme cases are the least adequate to obtain information about the clusters. The middle way Hyper-histograms tend to depict the shape of the distribution better, showing more information and emphasizing the high density regions. Figure 1 shows a simple graphical example applied to a twodimension dataset. It can be appreciated that side size the local density, while side size

4

1

bins do not represent

bins tend to unify close modes.

In this

work, every Hyper-histogram is generated starting with the bigger Hyper-Bin size, until an adequate size is found. Two empirical methods to nd a suitable Hyper-Bin size are presented. Both methods should be considered initial ideas that deserve a deeper study, even though good results were achieved with them. New methods may be conceived in the future. The decision about which method should be used depends on each particular problem. See section 3.4 for a more

1 The same size is used for every dimension. This denition allows covering all the space without superposition. 2 The cardinality concept may correlate with the density concept since the Hyper-Bin size is constant for a given Hyper-histogram.

7

a) Bin side size = 4

b) Bin side size = 2

12 10 8 6 4 2 0

12 10 8 6 4 2 0

c) Bin side size = 1

8 7 6 5 4 3 2 1 0

12 10 8 6 4 2 0

2 1.5 0

0.5

0 0.5

1 1

0.5 1.5

2 0

Plural Bin Count: 3 Cardinality Dispersion: 1

1 1.5

2 2.5

3 3.5

4 0

0.5

1

1.5

2

2.5

3

3.5

12 10 8 6 4 2 0

12 10 8 6 4 2 0

4 0


1

2

3

4

5

6

7

8 0

1

2

3

4

5

6

7

8


Figure 1: Simple example of dierent bin side sizes 4, 2 and 1 respectively, and the corresponding histogram in a two-dimension dataset. detailed explanation.

The plural Hyper-Bin count empirical method. When generating multiple Hyperhistograms of the same discrete xed dataset, using dierent Hyper-Bin sizes, it can be observed that, when the Hyper-Bin Size is reduced, the total count of populated Hyper-Bins increases, while the cardinality of the Hyper-Bins tends

1 increases. 1 Hyper-Bins may neither represent the local density nor per-

to reduce. Particularly, the amount of Hyper-Bins with cardinality These cardinality

mit to reduce the dataset size; thus, being harmful for the purpose of this step. It seems that when an important amount of cardinality 1 Hyper-Bins are present, the Hyper-histogram cannot represent the high density areas correctly. In this work a

Plural Hyper-Bin is dened as a Hyper-Bin whose cardinality

is greater than or equal to two. The Plural Hyper-Bin count method consists in determining the Hyper-Bin size that maximizes the amount of Plural HyperBins present in the Hyper-histogram. In Figure 1, the Plural Hyper-Bin count is shown under each histogram.

For example, Figure 2 shows the amount of

plural Hyper-Bins present in each Hyper-histogram generated with two-powered Hyper-Bin side sizes for the Landsat 7 225-86 2000-01-19 image. According to this method, the Hyper-histogram with Hyper-Bin side size Particularly, the relation between cardinality Hyper-Bins rises to

98.33%

1

2

should be used.

Hyper-Bins and total populated

when the Hyper-Bin side size

1

is used. Hence, no

information about local densities can be obtained.

The Cardinality Dispersion empirical method. The second method consists in determining the Hyper-Bin size by maximizing the Cardinality Dispersion. In

8

1x108 Total populated Hyper-bins Plural Hyper-bins

1x107

Hyper-bin count

1x106 100000 10000 1000 100 10 1 1

2

4

8

16

32

64

128

256

Hyper-bin side size

Figure 2: Amount of plural Hyper-Bins (green bars) and total populated Hyper-Bins (purple line) in each of the Hyper-histograms generated with two-powered Hyper-Bin side sizes. 5000 Cardinality Dispersion

Cardinality Dispersion

4000

3000

2000

1000

0 1

2

4

8

16

32

64

128

256

Hyper-bin side size

Figure 3: Cardinality Dispersion in each of the Hyper-histograms generated with two-powered Hyper-Bin side sizes. this work the

Cardinality Dispersion of a Hyper-histogram is dened as the total

number of dierent cardinality values in it, excluding cardinality

0.

For instance,

in a space with homogeneous distributed points, every Hyper-Bin should have the same cardinality, then the Cardinality Dispersion is one.

Studying the

relation between the Hyper-Bin size and the Cardinality Dispersion, it can be observed that extreme values tend to have lower Cardinality Dispersion.

It

seems that when bigger Hyper-Bin sizes are used, the Cardinality Dispersion is low because there are few populated Hyper-Bins. Whereas, when smaller HyperBin sizes are used, all the populated Hyper-Bins tend to have uniform, small values of cardinality; thus, decreasing the Cardinality Dispersion again. As in the case of the Plural Hyper-Bin count method, those Hyper-Bins with small values do not seem to represent the local density. In Figure 1, the Cardinality Dispersion is shown under each histogram.

9

Figure 3 shows the Cardinality

Dispersion for the same Hyper-histograms used in Figure 2. According to this method, the Hyper-histogram with Hyper-Bin side size

8

should be used.

3.1.2. Hyper-histogram generation Since the proposed methods require the generation of many Hyper-histograms for the same data point set, it is important to use an ecient algorithm to create them, using all the resources the computer and the operating system oer. Firstly, the data point set le, e.g., the Landsat image, is mapped in the main memory of the process, avoiding the need of physical main memory to load the whole data point set. The operating system loads each part on demand. If there is enough physical memory, only the rst read requires access to the much slower secondary memory, i.e., a hard disk. If there is not enough physical memory, more accesses may be required, but the program may still function although it is slower. Without this memory mapping based implementation, the program could not function when there is not enough physical memory to store the whole data point set. Secondly, to generate the Hyper-histogram, the

sparse matrix concept ap-

plied to a histogram [7] is used. Only the cardinalities of the Hyper-Bins that actually contain points, are positively stored. The disadvantage of sparse matrices is that it is not trivial to obtain access to their elements with

O(1),

i.e.,

constant, computational complexity, which is easily achieved in normal dense matrices where the position of each element depends exclusively on its coordinates.

To solve this inconvenience, a hash table structure [28] is used to

implement the sparse matrix, storing the cardinality of each Hyper-Bin. The Hyper-Bin address is computed from the data point value, and the hash table key is computed from the Hyper-Bin address. Both computations are performed using extremely fast bit shifting operations and the hash table allows really fast access, of almost constant complexity, to the cardinality using the hash table key. Consequently, the cardinality can be accessed either from the data point value or from the Hyper-Bin address, using a fast function with almost constant complexity. Finally, starting with an empty hash table, the Hyper-histogram is populated traversing the data points once.

For each data point, the cardinality of its

Hyper-Bin is searched on the hash table. If it is found, then its cardinality is incremented, and if it is not found, zero cardinality is assumed and the HyperBin is initialized with cardinality one. The hash table is the implementation of the Hyper-histogram used as departure for the First Order Clustering algorithm described in the next section. The computational complexity of said algorithm is

n

sizes may be constructed concurrently.

m

O(n),

i.e, linear, being

the point data count, and many Hyper-histograms with dierent Hyper-Bin The space complexity is

O(m),

being

the non-zero cardinality Hyper-Bin count for each concurrently constructed

Hyper-histogram.

10

3.2. First Order Clustering As a result of the Zero Order Clustering step, the list of populated HyperBins and their cardinality is obtained. This list conforms a new dataset that contains information about the original dataset but whose size is orders of magnitude smaller on average, allowing a more complex algorithm to be applied. The main idea behind the First Order Clustering step is to nd the modes present on the Hyper-histogram generated on the previous step, preferably preserving overlapped modes. In this work, a novel algorithm to nd the number and conformation of clus-

3 This algorithm is deterministic and unattended

ters in the dataset is presented.

although it can be ne-tuned. It will nd spherical, symmetrical clusters from the modes present in the Hyper-histogram. In the case of overlapped modes, this method will try to generate a group of clusters preserving information for a potential next step. When modes are isolated, good results are achieved, similar to a correctly initialized

K-Means,

but with less computer complexity. The

proposed First Order Clustering algorithm assumes that the most populated Hyper-Bins are candidates to be cluster seeds, and that they conform a cluster with the Hyper-Bins that constitute their neighborhood, considering the high dimension number scenario. A Hyper-Bin belongs to the same neighborhood as the seed Hyper-Bin if the distance between them is smaller than a constant value

c,

a natural number

measured in Hyper-Bin units. To measure the distance, an appropriate function must be selected. There are many functions that comply with the distance de-

Euclidean, Manhattan, Chebyshev, Minkowski in general, Mahalanobis, Jaccard, Spearman and Hamming[35]. In this work, the Chebyshev distance was chonition. The following distance functions may be cited as examples:

sen since it is really fast to compute and reects both the hyper-cube shape of the Hyper-Bins and the discrete nature of the method better. shows the as

¯ X

and

Chebyshev

Y¯

multidimensional vectors respectively, where

the vectors.

Equation 1

distance denition between two Hyper-Bins addressed

d

is the dimension of

d ¯ Y¯ = max DCh X, |xi − yi | i=1

(1)

The next sections present a description and a proposal of implementation of the algorithm used to generate the clusters from the Hyper-histogram.

3.2.1. First Order Clustering algorithm description This algorithm takes the Hyper-Bin with the biggest cardinality and assumes it is the seed of a cluster. Next, it searches all the populated Hyper-Bins that conform its neighborhood, i.e, those Hyper-Bins that are closer to the seed than the xed distance

c.Afterwards, the algorithm takes the next biggest cardinality

3 At this point, any clustering method such as those described in [9, 29, 22, 14, 45, 39] could potentially take advantage of the reduced dataset and be used to nd the clusters, allowing treating bigger datasets with less computer power. See section 6.

11

Zero order clustering result

Distance 2c from 1

2

Distance c from 3

2 Distance c from 1

7 4

7

4 3

3

1

1 6

6

5

L

1

2 3

5

4

5 6

L

7

C

C

U

U

M

M

2 3

M

M

C

4 1

5 1

State C L

C

1

U 4 5 1 1

M

State E

C

3

M

6 7

2 1 1

4 1

5 1

State D

1

3

U

2 1 1

3 1

U 1 1

L

3

3 3

L

6 7

State B 7

2 1 1

2 3 1

U

6 1

L

7

1 1

L

U

5 6

C

State A

C

4

1

3 3

4 5 1 1

6 3

7 3

M

1 2 1 1

State F

3 3

4 5 1 1

6 3

7 3

State G

Figure 4: Graphical representation of the First Order Clustering algorithm. The Hyperhistogram is 2D projected for simplicity. Darker Hyper-Bins have higher cardinality.

2c

Hyper-Bin that is more than

far from the previously chosen cluster seeds.

This Hyper-Bin is considered a new cluster seed and its neighborhood is constructed.

4

These steps are repeated until every Hyper-Bin is either inside a

cluster or nearer than

2c

from at least one cluster seed.

Unclustered Hyper-

Bins may either conform one-Hyper-Bin clusters by themselves or be included in the cluster whose seed is the closest to them, considering that the distance cannot be larger than

2c.

The rst solution preserves more information for a

possible Second Order Clustering Step, whereas the second solution gathers a better nal clustering.

A step by step algorithm description is stated below,

and Figure 4 shows a graphical example. i. From the Hyper-histogram, generate the List

L

containing all the Hyper-

Bin addresses in descending cardinality order. Shown as State A in Figure 4. ii. Initialize an empty List

U

of Hyper-Bin addresses discarded as cluster

seed. Shown as State A in Figure 4. iii. Initialize an empty List

C

of cluster seed Hyper-Bin addresses. Shown as

State A in Figure 4. iv. Initialize an empty Sparse Matrix

M

to keep the cluster assigned to every

Hyper-Bin address. Shown as State A in Figure 4.

4 The 2c value is neither arbitrary nor constitutes a new parameter. It is derived from the fact that neighborhoods should not overlap. Should every Hyper-Bin in the cluster be c far at most from its seed, two seeds must be more than 2c far from each other to ensure that a given Hyper-Bin can be included in at most one neighborhood, according to the denition.

12

v. Remove the rst Hyper-Bin address from the List new cluster seed, insert it into the List

M.

Matrix

c

L,

assuming it is the

and insert the Hyper-Bin in the

Shown as States B and E in Figure 4.

vi. Traverse the List distance

C

L.

For every Hyper-Bin that is closer than the xed

to the seed of the cluster found in Step v, remove its address

from the List

L

and insert it into the Matrix

M

as a member of this

cluster. Leave the rest of the Hyper-Bin addresses in the List

L.

Shown

as States C and F in Figure 4. vii. Check that the rst Hyper-Bin of the List every Hyper-Bin of the List

C.

L

is at least

2c

distant from

If this condition is not satised for at

least one Hyper-Bin stored in the List Hyper-Bin address from the List

L

C

of cluster seeds, remove this rst

and add it to the List

U

of discarded

Hyper-Bin addresses. Shown as State D in Figure 4. viii. Repeat Step vii with the new rst element of the List is satised for every element on the List ix. Repeat Steps v to viii until the List x. Traverse the List

U

L

C

L until the condition L is empty.

or the List

is empty.

removing every Hyper-Bin address and comparing it

to every cluster seed address in the List the Hyper-Bin to the Matrix

M

C

to nd the nearest one, adding

in the selected cluster. Shown as State G

in Figure 4. Steps i.ix. conform the main clustering algorithm, whereas step x. conforms the optional post-clustering step. Figure 5 shows a owchart representation of the algorithm with the optional post-clustering step highlighted in blue dotted line.

3.2.2. First Order Clustering algorithm implementation As in the previous phase, many considerations were taken to keep computational and space complexity as low as possible.

L

The List

is implemented as an ordered binary tree using cardinality as

strict ordering criteria. In every node of the tree, a dynamic array of Hyper-Bin addresses with the associated cardinality is stored as data. The structure can be built with

O(m.log(m))

complexity, where

m

is the number of the non-zero

cardinality Hyper-Bins from the Hyper-histogram resulting from Zero Order Clustering. The space complexity of the structure that implements the List is

O(m),

and the Hyper-histogram may be destroyed as soon as the List

L

L is

built. The List

C

This array has

is implemented as a dynamic array of seed Hyper-Bin addresses.

O(k)

computational complexity, being

found. The space complexity is also The List addresses.

U

k

the number of clusters

O(k).

is implemented as a dynamic array of discarded seed Hyper-Bin

This array has

O(m)

computational and space complexity in the

worst case. The Sparse Matrix

M

is implemented as a hash table using Hyper-Bin ad-

dresses as the keys with the same functions used for the Hyper-histogram in

13

Figure 5: Flowchart of the First Order Clustering algorithm with optional post-clustering step highlighted in blue dotted line.

14

S

S

S

c=1

c=2

c=3

Figure 6: 2D examples of dierent values of centered at the seed S.

Chebyshev radius distance c neighborhoods

Zero Order Clustering, and the seed Hyper-Bin addresses as the data. hash table has

O(m)

This

computational and space complexity.

The general algorithm computational complexity can be estimated as

O(m)

per round in the worst case, with one round per found cluster; thus, giving a computational complexity of Bin count and

k

O(m.k),

m

being

the non-zero cardinality Hyper-

the number of clusters found.

The worst case is

k = m.

Therefore the nal First Order Clustering worst computational complexity is

O(m2 ). 3.2.3. Neighborhood radius

c determination criteria

As described in section 3.2, First Order Clustering depends on a xed distance

c

that is computed using the

tance represents the

Chebyshev

Chebyshev

distance function.

This dis-

radius of the neighborhood, centered at the

seed Hyper-Bin address and measured in Hyper-Bin units. Figure 6 shows 2D examples with dierent

c

values.

Both Hyper-Bin side size and distance number of clusters to be found.

c

will aect directly the size and the

Since First Order Clustering complexity is

strongly dependent on the number of Hyper-Bins, and bigger Hyper-Bin side sizes tend to give smaller Hyper-Bin counts, it may be convenient to use a bigger Hyper-Bin side size and a smaller distance

c

value. Using this criterion for the

rst approach, the Hyper-Bin side size is determined using one of the criteria presented in section 3.1.1, and the distance

c xed to value 1.

Once this result is

obtained, a ne-tuning may be performed to adjust the number and size of the obtained clusters. A theoretical base to automate this ne-tuning is described in the next section. Furthermore, an algorithm to automate this process is under research while writing the present article. See section 6.

3.3. Relation between Hyper-Bin side size and distance

c

The number of clusters to nd may be adjusted by ne-tuning Hyper-Bin side size and distance

c.

Similar clusters are obtained using smaller Hyper-Bin

15

c=1

c=1

h=4,c=1 1

0

1

c=2

c=3

h=2,2

FAUM: Fast Autonomous Unsupervised

FAUM: Fast Autonomous Unsupervised

Suggest Documents

Autonomous Educational Testing System Using Unsupervised Feature ...

Fast and unsupervised methods for multilingual ...

Fast Scene Understanding for Autonomous Driving

Unsupervised Random Forest Indexing for Fast Action Search - Microsoft

Fast Unsupervised Ego-Action Learning for First-Person Sports Videos

The FAST Toolkit for Unsupervised Learning of HMMs ...

Fast and unsupervised methods for multilingual cognate clustering

unsupervised region of interest detection using fast ... - Semantic Scholar

Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly

Fast Full State Trajectory Generation for Multirotors - Autonomous ...

Fast, Autonomous Flight in GPS-Denied and Cluttered Environments

Fast Multi-Level Adaptation for Interactive Autonomous ... - CiteSeerX

Fast Autonomous Flight in Warehouses for Inventory ... - ais.uni-bonn.de

A System for Fast Navigation of Autonomous Vehicles - The Robotics ...

Design of a Fast and Autonomous Complex Line Tracker ... - CiteSeerX

FASt - An autonomous sailing platform for oceanographic missions

Fast prototyping of a Highly Autonomous Cooperative ...

)FauM- Read 'The Environmental and Genetic Causes of Autism' Free ...

UNSUPERVISED LEARNING

Autonomous and non-autonomous differentiation of ... - Development

Robotic, Semi-Autonomous and Autonomous Medical ...

Autonomous Navigation for Autonomous Underwater Vehicles ... - MDPI

UNSUPERVISED DECOMPOSITION OF ... - Research

UNSUPERVISED NEURAL MACHINE TRANSLATION