Efficient prototype selection supported by subspace partitions

Efficient prototype selection supported by subspace partitions Dr. Joel Luís Carbonera

IBM research - Rio de Janeiro – Brazil [email protected]

Dr. Mara Abel

UFRGS - Porto Alegre – Brazil [email protected]

Introduction

Prototype selection - Overview Classification based on machine learning – traditional pipeline Training dataset (labeled instances)

Learning

Classifier

Classification

Classified instances

Unlabeled instances

© 2017 International Business Machines Corporation

3

Prototype selection - Overview Classification based on machine learning – pipeline including prototype selection Training dataset (labeled instances)

Learning

Prototype selection

Selected prototypes

•

Classifier

Classification

Classified instances

Unlabeled instances

Prototype selection:

– Selection of a smaller set of prototypes from the total available data, which is able to support a machine learning application, keeping a similar performance. – Prototypes can be synthetic instances.

•

Trade-off:

– Reduction rate X Classification quality.


4

Available approaches • Preserving the boundaries between different classes in the dataset. – Border instances provide relevant information for distinguishing different classes. Original dataset

Selected dataset

Prototype selection


5

Drawbacks of preserving the boundaries • Identification of boundaries involves a quadratic time complexity: – It is necessary to compare each pair of instances in the dataset. – Complexity can be too high for dealing with large datasets.


6

PSSP algorithm 7

Main assuptions • Assumptions: – We can extract a synthetic prototype from a given set of instances that are similar to each other. – The resulting prototype is able to abstract the information represented by the original set of instances. Instances of the original dataset Prototype


8

General strategy • P←∅ • For each label L in the dataset: – S ← find the set of sets of objects with label L that are more similar to each other. – For each set 𝒔𝒊 ∈ S: • 𝒑𝒓𝒐𝒕𝒊 ← extract the prototype that abstracts the instances in 𝒔𝒊 . • Include 𝒑𝒓𝒐𝒕= in P

• Return P © 2017 International Business Machines Corporation

9

Subspace partition • Considering:

– A subspace SS of the whole dataset (where SS ⊆ D), selected by arbitrary criteria. – A dimension 𝒅𝒊 ∈ SS can be splitted into a set of intervals.

• A subspace partition is a set of intervals, one for each dimension 𝒅𝒊 ∈ SS, defining a specific multidimensional region of the subspace.

Representation of a dataset with 28 data objects in a 2D space, with 5 subspace partitions. © 2017 International Business Machines Corporation

10

PSSP algorithm


11

PSSP algorithm •

•

•


Defines a set of homogeneous SP that are enough for producing the desirable number of prototypes Finds the objects contained in each partition, using spatial hashing.

Extracts the centroid of the set of instances.

12

Complexity of PSSP • Each step of the algorithm has, at most, a time complexity that is linear on the number of instances. – The time complexity of PSSP is linear on the number of instances.


13

Experiments 14

Experimental Settings • Algorithms: DROP3, ENN, ICF, LSBo, LSSm, LDIS • 11 well-known datasets (with numerical dimensions):


15

Experimental Settings • Evaluation measures: – 𝒂𝒄𝒄𝒖𝒓𝒂𝒄𝒚

𝑺𝒖𝒄𝒆𝒔𝒔(𝑻𝒆𝒔𝒕) = |𝑻𝒆𝒔𝒕|

– 𝒓𝒆𝒅𝒖𝒄𝒕𝒊𝒐𝒏 =

𝑻 K|𝑺| |𝑻|

• Where:

– 𝑻𝒆𝒔𝒕 is the set of instances that are tested in a classification task and |𝑻𝒆𝒔𝒕| is the cardinality of this set. – 𝑺𝒖𝒄𝒆𝒔𝒔(𝑻𝒆𝒔𝒕) is the number of instances in Test correctly classified in the classification task. – 𝑻 is the cardinality of the set of instances of the whole training dataset and 𝑺 is the cardinality of the set of instances selected by a given instance selection algorithm.


16

Experimental Settings K=3 for DROP3, ENN and LDIS. D = 1 and p = 0.1, for PSSP. Euclidean distance. The classification accuracy was evaluated using the KNN algorithm, adopting k=3. • The 2 measures were evaluated in an 10-fold crossvalidation process. • • • •


17

Average accuracy


18

Average reduction


19

Computational efficiency • Synthetic dataset: – 20k 2D random points, – uniformly distributed in a squared area, – with 4 different classes (one class in each quadrant of the squared area), where each class has 5k points.

• Measure: – The time (milliseconds) taken by each algorithm for performing the prototype selection task in the synthetic dataset. © 2017 International Business Machines Corporation

20

Computational efficiency


21

Conclusion 22

Conclusion • PSSP allows the user to chose the size of the resulting reduced dataset.

– This is not a typical feature in prototype selection algorithms.

• PSSP has a time complexity that is linear on the number of objects. • The experiments show that PSSP has: – The best reduction rates. – A reasonable balance between accuracy and reduction. – The lowest running time.

• PSSP can be applied only to numerical datasets. • PSSP a promising algorithm for dealing with big volumes of data, in scenarios that the running time is critical. © 2017 International Business Machines Corporation

23

Future works • Investigate other ways of applying the notion of subspace partition for prototype selection.


24

Thank you! 25

Appendix 26

Other algorithms 27

partitioning • •

K is the maximum number of prototypes. We consider an equal number of intervals for every dimension.

𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 |gg| ≤ 𝑘


28

extractsPrototype


29

Effects of the parameters D and p 30

Effects of the parameters D and p •

We evaluated the effects of the parameters D and p in the performance of PSSP. – D defines the number of dimensions of the subspace. – p defines the number of prototypes (as a percentage of the total number of objects).

•

Datasets:

– The 20 datasets considered in the first experiment.

•

Evaluated measures: – Accuracy – Reduction – Effectiveness

•

Values of D:

– 1, 2, 3 and 4.

•

Values of p:

– 0.1, 0.2 and 0.3


31

Average accuracy


32

Average reduction


33

10-fold cross-validation process 34

10-fold cross-validation process Repeated 10 times, with each subsample used once as Test 7

Reduction Calculating Initial dataset

3 Partitioning

Initial training set

Selecting Appllying IS algorithm

1

5

Calculating

Reduced training set

4

IS algorithm

2

Calculating

Appllying KNN algorithm KNN algorithm

Average Reduction

Average Effectiveness

9

6 Accuracy

8 Calculating

Average Accuracy

Selecting 10 equal sized subsamples

Test


35

Efficient prototype selection supported by subspace partitions

Efficient prototype selection supported by subspace partitions

Suggest Documents

Efficient Subspace Approximation Algorithms

COMPUTATIONALLY EFFICIENT SUBSPACE

Randomized selection with quintary partitions

Efficient Preconditioners for Krylov Subspace

Vertical Operating Prototype Development Supported ... - Science Direct

An Efficient Approach for Subspace Clustering By ... - Google Sites

Efficient Sequential Correspondence Selection by ... - IEEE Xplore

Subspace Selection for Projection Maximization ... - Semantic Scholar

Neighborhood Selection for Thresholding-based Subspace Clustering

Greedy Feature Selection for Subspace Clustering

Subspace Selection Techniques for Classification Problems - CiteSeerX

Greedy Feature Selection for Subspace Clustering

Prototype Selection for Dissimilarity Representation by a ... - CNRS

Partitions

Efficient Subspace Segmentation via Quadratic ... - Google Sites

Robust Orthonormal Subspace Learning: Efficient Recovery of ...

Efficient Subspace Segmentation via Quadratic ... - Google Sites

SUPPORTED BY

Partner selection supported by opaque reputation promotes ... - arXiv

Efficient Data Uploading Supported by D2D ... - Semantic Scholar

Subspace Clustering by Block Diagonal

One-Sided Prototype Selection on Class

Text Extraction Using Efficient Prototype - IJRIT

Text Extraction Using Efficient Prototype - IJRIT