Efficient Implementation of Self-Organizing Map for ...

3 downloads 0 Views 1MB Size Report
SOM Description. Standard vs. Batch Versions. Motivation for Sparse SOM. Example: MNIST Handwritten Dataset. Random sampling. SOM map (12x16 units) ...
Efficient Implementation of Self-Organizing Map for Sparse Input Data Josué Melka and Jean-Jacques Mariage

Introduction Contributions Experiments

1

Introduction SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

2

Contributions

3

Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

Introduction Contributions Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

Presentation: The SOM Network

Self-Organizing Map (Kohonen 1982): an artificial neural network trained by unsupervised competitive learning produces a low-dimensional map of the input space Many applications Commonly used for data projection, clustering, etc.

Introduction Contributions Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

The SOM Map The map: units (= nodes/neurons) on a lattice. Associated with each node: a weight vector a position in the map

http://www.lohninger.com

Introduction Contributions Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

Example: MNIST Handwritten Dataset

SOM map (12x16 units)

Random sampling

Introduction Contributions Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

The SOM Training Algorithm

Within the main loop: 1

compute the distance between input and weight vectors dk (t) = kx(t) − wk (t)k2

2

find the node weight closest to the input (BMU) dc (t) = min d(t) k

3

(1)

(2)

update the BMU and its neighbors to be closer to the input

Introduction Contributions Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

Standard Algorithm: The Learning Rule Update weight vectors at each time step, for a random sample x: wk (t + 1) = wk (t) + α(t)hck (t) [x(t) − wk (t)]

(3)

α(t) is the decreasing learning rate hck (t) is the neighborhood function For example, the Gaussian: krk − rc k2 hck (t) = exp − 2σ(t)2

!

Gaussian neighborhood

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

Introduction Contributions Experiments

The Batch Algorithm

The BMUs and their neighbors are updated once at the end of each epoch, with the average of all the samples that trigger them: Ptf

wk (tf ) =

t0

hck (t 0 )x(t 0 ) 0 t0 hck (t )

Ptf

(4)

Introduction Contributions Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

Motivation: Reduce the Computing Time

Computing time depends on: 1

the T training iterations

2

the M nodes in the network

3

the D dimensions of the vectors

The issue: large sparse datasets Many real-world datasets are sparse and high-dimensional. But existing SOMs can’t exploit sparseness to save time.

Introduction Contributions Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

Datasets Examples: Dense vs. Sparse

MNIST dataset

News-20 dataset

(dim: 780, density: 19.22%)

(dim: 62061, density: 0.13%)

Introduction Contributions Experiments

SOM Description Standard vs. Batch Versions Motivation for Sparse SOM

Overcoming the Dimensionality Problem

A popular option: reduce the model space Use less dimension-sensitive space reduction techniques (such as SVD, Random-Mapping, etc.). But is this the only way ? Let’s define f is the fraction of non-zero values, and d = D × f . Can we reduce the SOM complexity from O(TMD) to O(TMd) ?

Introduction Contributions Experiments

1

Introduction

2

Contributions Sparse Algorithms Parallelism

3

Experiments

Sparse Algorithms Parallelism

Introduction Contributions Experiments

Sparse Algorithms Parallelism

Compressed Sparse Rows Format

https://op2.github.io/PyOP2/linear_algebra.html

Sufficient to reduce the computing time ? CSR supports efficient linear algebra operations. But: nodes weights must be dense for training not all operations produce sparse output

Introduction Contributions Experiments

Sparse Algorithms Parallelism

Speedup the BMU Search Euclidean distance (1) is equivalent to: dk (t) = kx(t)k2 − 2(wk (t) · x(t)) + kwk (t)k2

(5)

Batch version This change suffices to make sparse Batch SOM efficient, since kwk k2 can be computed once for each epoch. Standard version By storing kw(t)k2 , we can compute kw(t + 1)k2 efficiently. kwk (t + 1)k2 = (1 − β(t))2 kwk (t)k2 + β(t)2 kx(t)k2 + 2β(t)(1 − β(t))(wk (t) · x(t))

(6)

Introduction Contributions Experiments

Sparse Algorithms Parallelism

Sparse-Som: Speedup the Update Phase We express the learning rule (3) as (Natarajan 1997): 

wk (t + 1) = (1 − β(t)) wk (t) +

β(t) x(t) 1 − β(t)



Don’t update entire weight vectors We keep the scalar coefficient separately, so we update only the values affected by x(t). Numerical stability To avoid numerical stability issues, we use double-precision floating-point, and rescale the weights when needed.

(7)

Introduction Contributions Experiments

Sparse Algorithms Parallelism

Parallel Approaches for SOM Specialized hardware (historical) Massively parallel computing Shared-memory multiprocessing

http://www.nersc.gov/users/computational-systems/cori

Introduction Contributions Experiments

Sparse Algorithms Parallelism

How to Split the Computation

Network partitioning divides the neurons

Data partitioning dispatches the input data

Introduction Contributions Experiments

Sparse Algorithms Parallelism

What We Did Sparse-Som: hard to parallelize

Sparse-BSom: much simpler

not adapted to data partitioning

adapted both to data and network partitioning

too much latency with network partitioning

less synchronization needed

Another specific issue due to sparseness Memory-access latency, because the non-linear access pattern to weight vectors. Mitigation: improve the processor cache locality Access to the weight-vectors in the inner-loop.

Introduction Contributions Experiments

1

Introduction

2

Contributions

3

Experiments Evaluation Speed benchmark Quality test

Evaluation Speed benchmark Quality test

Introduction Contributions Experiments

Evaluation Speed benchmark Quality test

The Evaluation

We’ve trained SOM networks on various datasets with same parameters 5 times each test then measured their performance (speed and quality). Our speed baseline Somoclu (Wittek et al. 2017) is a massively parallel batch SOM implementation, which uses the classical algorithm.

Introduction Contributions Experiments

Evaluation Speed benchmark Quality test

Datasets Characteristics classes

features

rcv1

53

47236

news20

20

62061

sector

105

55197

mnist

10

780

usps

10

256

letter

26

16

protein

3

357

dna

3

180

satimage

6

36

samples 15564 518571 15933 3993 6412 3207 60000 10000 7291 2007 15000 5000 17766 6621 2000 1186 4435 2000

density 0.14 0.14 0.13 0.13 0.29 0.30 19.22 19.37 100.00 100.00 100.00 100.00 29.00 26.06 25.34 25.14 98.99 98.96

Introduction Contributions Experiments

Evaluation Speed benchmark Quality test

Training Parameters

Each network : 30 × 40 units grid Rectangular lattice tmax = 10 × Nsamples / Kepochs = 10 α(t) = 1 − (t/tmax ) Gaussian neighborhood, with a radius decreasing linearly from 15 to 0.5

Introduction Contributions Experiments

Evaluation Speed benchmark Quality test

Speed Benchmark 4 datasets used (2 sparse and 2 dense), to test both: Serial mode (Sparse-Som vs. Sparse-BSom) Parallel mode (Sparse-BSom vs. Somoclu) Hardware and system specifications: Intel Xeon E5-4610 4 sockets of 6 cores each cadenced at 2.4 GHz 2 threads / core

Linux Ubuntu 16.04 (64 bits) GCC 5.4

Evaluation Speed benchmark Quality test

Introduction Contributions Experiments

Results: Serial Performance 302

elapsed time (seconds)

300

Sparse-Som Sparse-BSom

250 203

200

144

150 100 50 0

123

94 30

50

usps

34 mnist

sector

44 news20

36 rcv1

Evaluation Speed benchmark Quality test

Introduction Contributions Experiments

Results: Parallel Performance Somoclu

Sparse-BSom sector news20 mnist usps

elapsed time (seconds)

104

103

102 sector news20 mnist usps

101 1

2

4

8

number of CPU cores

16

32

1

2

4

8

number of CPU cores

16

32

Evaluation Speed benchmark Quality test

Introduction Contributions Experiments

Quality Evaluation: Methodology

Metrics used: 1

Average Quantization Error PN

Q=

i=1 kxi

− wc k

N

2

Topological Error

3

Precision and Recall in classification

(8)

Introduction Contributions Experiments

Evaluation Speed benchmark Quality test

Results: Average Quantization Error

rcv1 news20 sector mnist usps protein dna satimage letter

Sparse-Som 0.825 ± 0.001 0.905 ± 0.000 0.814 ± 0.001 4.400 ± 0.001 3.333 ± 0.002 2.451 ± 0.000 4.452 ± 0.006 0.439 ± 0.001 0.357 ± 0.001

Sparse-BSom 0.816 ± 0.001 0.901 ± 0.001 0.772 ± 0.003 4.500 ± 0.008 3.086 ± 0.006 2.450 ± 0.001 3.267 ± 0.042 0.377 ± 0.001 0.345 ± 0.002

Introduction Contributions Experiments

Evaluation Speed benchmark Quality test

Results: Topological Error

rcv1 news20 sector mnist usps protein dna satimage letter

Sparse-Som 0.248 ± 0.007 0.456 ± 0.020 0.212 ± 0.012 0.369 ± 0.005 0.150 ± 0.006 0.505 ± 0.006 0.099 ± 0.004 0.103 ± 0.005 0.160 ± 0.003

Sparse-BSom 0.353 ± 0.010 0.604 ± 0.014 0.514 ± 0.017 0.268 ± 0.003 0.281 ± 0.011 0.448 ± 0.007 0.278 ± 0.023 0.239 ± 0.015 0.269 ± 0.008

Introduction Contributions Experiments

Evaluation Speed benchmark Quality test

Results: Prediction Evaluation

rcv1 news20 sector mnist usps protein dna satimage letter

Sparse-Som precision recall 79.2 ± 0.5 79.3 ± 0.6 73.7 ± 0.4 70.6 ± 0.5 64.2 ± 0.5 62.8 ± 0.5 60.0 ± 1.7 55.4 ± 1.3 77.2 ± 0.9 73.2 ± 0.9 73.3 ± 0.8 61.3 ± 1.8 93.5 ± 0.2 93.5 ± 0.2 93.4 ± 0.2 93.4 ± 0.2 95.9 ± 0.2 95.9 ± 0.2 91.4 ± 0.3 90.7 ± 0.3 56.7 ± 0.2 57.5 ± 0.2 49.8 ± 0.7 51.2 ± 0.6 90.9 ± 0.6 90.8 ± 0.5 77.7 ± 1.5 69.6 ± 2.1 92.3 ± 0.4 92.4 ± 0.3 87.6 ± 0.3 85.4 ± 0.4 83.8 ± 0.3 83.7 ± 0.3 81.5 ± 0.5 81.1 ± 0.5

Sparse-BSom precision recall 81.3 ± 0.4 82.1 ± 0.3 76.6 ± 0.4 72.6 ± 0.5 50.3 ± 0.9 49.6 ± 0.8 47.8 ± 1.2 43.6 ± 1.2 58.4 ± 0.5 56.0 ± 1.0 60.9 ± 1.3 44.8 ± 1.0 91.5 ± 0.2 91.5 ± 0.2 91.7 ± 0.2 91.7 ± 0.2 95.6 ± 0.2 95.6 ± 0.2 92.4 ± 0.5 91.5 ± 0.4 56.7 ± 0.4 57.6 ± 0.3 50.7 ± 0.7 52.1 ± 0.6 88.5 ± 0.6 88.5 ± 0.5 81.9 ± 2.9 30.3 ± 1.7 92.5 ± 0.4 92.6 ± 0.4 88.7 ± 0.5 86.3 ± 0.5 81.9 ± 0.3 81.7 ± 0.4 80.2 ± 0.3 79.8 ± 0.5

Introduction Contributions Experiments

Summary: Main Benefits

Sparse-Som and Sparse-BSom run much faster than their classical “dense” counterparts with sparse data. Advantages of each version: Sparse-Som maps seem to have a better organization

Sparse-BSom highly parallelizable more memory efficient (single-precision)

Introduction Contributions Experiments

Thank you!

Suggest Documents