Efficient prototype selection supported by subspace partitions

0 downloads 0 Views 2MB Size Report
[email protected]. Dr. Mara Abel. UFRGS - Porto Alegre – Brazil [email protected]. Efficient prototype selection supported by subspace partitions ...
Efficient prototype selection supported by subspace partitions Dr. Joel Luís Carbonera

IBM research - Rio de Janeiro – Brazil [email protected]

Dr. Mara Abel

UFRGS - Porto Alegre – Brazil [email protected]

Introduction

Prototype selection - Overview Classification based on machine learning – traditional pipeline Training dataset (labeled instances)

Learning

Classifier

Classification

Classified instances

Unlabeled instances

© 2017 International Business Machines Corporation

3

Prototype selection - Overview Classification based on machine learning – pipeline including prototype selection Training dataset (labeled instances)

Learning

Prototype selection

Selected prototypes



Classifier

Classification

Classified instances

Unlabeled instances

Prototype selection:

– Selection of a smaller set of prototypes from the total available data, which is able to support a machine learning application, keeping a similar performance. – Prototypes can be synthetic instances.



Trade-off:

– Reduction rate X Classification quality.

© 2017 International Business Machines Corporation

4

Available approaches • Preserving the boundaries between different classes in the dataset. – Border instances provide relevant information for distinguishing different classes. Original dataset

Selected dataset

Prototype selection

© 2017 International Business Machines Corporation

5

Drawbacks of preserving the boundaries • Identification of boundaries involves a quadratic time complexity: – It is necessary to compare each pair of instances in the dataset. – Complexity can be too high for dealing with large datasets.

© 2017 International Business Machines Corporation

6

PSSP algorithm 7

Main assuptions • Assumptions: – We can extract a synthetic prototype from a given set of instances that are similar to each other. – The resulting prototype is able to abstract the information represented by the original set of instances. Instances of the original dataset Prototype

© 2017 International Business Machines Corporation

8

General strategy • P←∅ • For each label L in the dataset: – S ← find the set of sets of objects with label L that are more similar to each other. – For each set 𝒔𝒊 ∈ S: • 𝒑𝒓𝒐𝒕𝒊 ← extract the prototype that abstracts the instances in 𝒔𝒊 . • Include 𝒑𝒓𝒐𝒕= in P

• Return P © 2017 International Business Machines Corporation

9

Subspace partition • Considering:

– A subspace SS of the whole dataset (where SS ⊆ D), selected by arbitrary criteria. – A dimension 𝒅𝒊 ∈ SS can be splitted into a set of intervals.

• A subspace partition is a set of intervals, one for each dimension 𝒅𝒊 ∈ SS, defining a specific multidimensional region of the subspace.

Representation of a dataset with 28 data objects in a 2D space, with 5 subspace partitions. © 2017 International Business Machines Corporation

10

PSSP algorithm

© 2017 International Business Machines Corporation

11

PSSP algorithm •





© 2017 International Business Machines Corporation

Defines a set of homogeneous SP that are enough for producing the desirable number of prototypes Finds the objects contained in each partition, using spatial hashing.

Extracts the centroid of the set of instances.

12

Complexity of PSSP • Each step of the algorithm has, at most, a time complexity that is linear on the number of instances. – The time complexity of PSSP is linear on the number of instances.

© 2017 International Business Machines Corporation

13

Experiments 14

Experimental Settings • Algorithms: DROP3, ENN, ICF, LSBo, LSSm, LDIS • 11 well-known datasets (with numerical dimensions):

© 2017 International Business Machines Corporation

15

Experimental Settings • Evaluation measures: – 𝒂𝒄𝒄𝒖𝒓𝒂𝒄𝒚

𝑺𝒖𝒄𝒆𝒔𝒔(𝑻𝒆𝒔𝒕) = |𝑻𝒆𝒔𝒕|

– 𝒓𝒆𝒅𝒖𝒄𝒕𝒊𝒐𝒏 =

𝑻 K|𝑺| |𝑻|

• Where:

– 𝑻𝒆𝒔𝒕 is the set of instances that are tested in a classification task and |𝑻𝒆𝒔𝒕| is the cardinality of this set. – 𝑺𝒖𝒄𝒆𝒔𝒔(𝑻𝒆𝒔𝒕) is the number of instances in Test correctly classified in the classification task. – 𝑻 is the cardinality of the set of instances of the whole training dataset and 𝑺 is the cardinality of the set of instances selected by a given instance selection algorithm.

© 2017 International Business Machines Corporation

16

Experimental Settings K=3 for DROP3, ENN and LDIS. D = 1 and p = 0.1, for PSSP. Euclidean distance. The classification accuracy was evaluated using the KNN algorithm, adopting k=3. • The 2 measures were evaluated in an 10-fold crossvalidation process. • • • •

© 2017 International Business Machines Corporation

17

Average accuracy

© 2017 International Business Machines Corporation

18

Average reduction

© 2017 International Business Machines Corporation

19

Computational efficiency • Synthetic dataset: – 20k 2D random points, – uniformly distributed in a squared area, – with 4 different classes (one class in each quadrant of the squared area), where each class has 5k points.

• Measure: – The time (milliseconds) taken by each algorithm for performing the prototype selection task in the synthetic dataset. © 2017 International Business Machines Corporation

20

Computational efficiency

© 2017 International Business Machines Corporation

21

Conclusion 22

Conclusion • PSSP allows the user to chose the size of the resulting reduced dataset.

– This is not a typical feature in prototype selection algorithms.

• PSSP has a time complexity that is linear on the number of objects. • The experiments show that PSSP has: – The best reduction rates. – A reasonable balance between accuracy and reduction. – The lowest running time.

• PSSP can be applied only to numerical datasets. • PSSP a promising algorithm for dealing with big volumes of data, in scenarios that the running time is critical. © 2017 International Business Machines Corporation

23

Future works • Investigate other ways of applying the notion of subspace partition for prototype selection.

© 2017 International Business Machines Corporation

24

Thank you! 25

Appendix 26

Other algorithms 27

partitioning • •

K is the maximum number of prototypes. We consider an equal number of intervals for every dimension.

𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑠 |gg| ≤ 𝑘

© 2017 International Business Machines Corporation

28

extractsPrototype

© 2017 International Business Machines Corporation

29

Effects of the parameters D and p 30

Effects of the parameters D and p •

We evaluated the effects of the parameters D and p in the performance of PSSP. – D defines the number of dimensions of the subspace. – p defines the number of prototypes (as a percentage of the total number of objects).



Datasets:

– The 20 datasets considered in the first experiment.



Evaluated measures: – Accuracy – Reduction – Effectiveness



Values of D:

– 1, 2, 3 and 4.



Values of p:

– 0.1, 0.2 and 0.3

© 2017 International Business Machines Corporation

31

Average accuracy

© 2017 International Business Machines Corporation

32

Average reduction

© 2017 International Business Machines Corporation

33

10-fold cross-validation process 34

10-fold cross-validation process Repeated 10 times, with each subsample used once as Test 7

Reduction Calculating Initial dataset

3 Partitioning

Initial training set

Selecting Appllying IS algorithm

1

5

Calculating

Reduced training set

4

IS algorithm

2

Calculating

Appllying KNN algorithm KNN algorithm

Average Reduction

Average Effectiveness

9

6 Accuracy

8 Calculating

Average Accuracy

Selecting 10 equal sized subsamples

Test

© 2017 International Business Machines Corporation

35

Suggest Documents