Clustering and Data Mining in R - Workshop Supplement

Clustering and Data Mining in R

Clustering and Data Mining in R Workshop Supplement

Thomas Girke

December 10, 2011

Clustering and Data Mining in R

Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches Tree Cutting Non-Hierarchical Clustering K-Means Principal Component Analysis Multidimensional Scaling Biclustering Many Additional Techniques

Clustering and Data Mining in R Introduction

Outline Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches Tree Cutting Non-Hierarchical Clustering K-Means Principal Component Analysis Multidimensional Scaling Biclustering Many Additional Techniques


What is Clustering?

I

Clustering is the classification of data objects into similarity groups (clusters) according to a defined distance measure.

I

It is used in many fields, such as machine learning, data mining, pattern recognition, image analysis, genomics, systems biology, etc.


Why Clustering and Data Mining in R?

I

Efficient data structures and functions for clustering.

I

Efficient environment for algorithm prototyping and benchmarking.

I

Comprehensive set of clustering and machine learning libraries.

I

Standard for data analysis in many areas.

Clustering and Data Mining in R Data Preprocessing


Clustering and Data Mining in R Data Preprocessing Data Transformations

Data Transformations I

Choice depends on data set!

Center & standardize 1. Center: subtract from each vector its mean 2. Standardize: devide by standard deviation

I

⇒ Mean = 0 and STDEV = 1 Center & scale with the scale() fuction 1. Center: subtract from each vector its mean 2. Scale: divide centered vector by their root mean square (rms) v u n u 1 X xi 2 xrms =t n−1 i=1

I I I

⇒ Mean = 0 and STDEV = 1 Log transformation Rank transformation: replace measured values by ranks No transformation

Clustering and Data Mining in R Data Preprocessing Distance Methods

Distance Methods I

List of most common ones!

Euclidean distance for two profiles X and Y v u n uX d(X , Y ) =t (xi − yi )2 i=1

Disadvantages: not scale invariant, not for negative correlations I

Maximum, Manhattan, Canberra, binary, Minowski, ...

I

Correlation-based distance: 1 − r I Pearson correlation coefficient (PCC) Pn Pn Pn n i=1 xi yi − i=1 xi i=1 yi r =q P Pn Pn Pn n ( i=1 xi2 − ( i=1 xi )2 )( i=1 yi2 − ( i=1 yi )2 ) I

Disadvantage: outlier sensitive Spearman correlation coefficient (SCC) Same calculation as PCC but with ranked values!

Clustering and Data Mining in R Data Preprocessing Cluster Linkage

Cluster Linkage Single Linkage

Complete Linkage

Average Linkage

Clustering and Data Mining in R Hierarchical Clustering



Hierarchical Clustering Steps

1. Identify clusters (items) with closest distance 2. Join them to new clusters 3. Compute distance between clusters (items) 4. Return to step 1


Hierarchical Clustering (a)

Agglomerative Approach

0.1

g1

g2

g3

(b)

g4

g5

g4

g5

0.4 0.1

g1

g2

(c)

g3 0.6

0.5 0.4 0.1

g1

g2

g3

g4

g5

Clustering and Data Mining in R Hierarchical Clustering Approaches

Hierarchical Clustering Approaches

1. Agglomerative approach (bottom-up) hclust() and agnes()

2. Divisive approach (top-down) diana()

Clustering and Data Mining in R Hierarchical Clustering Tree Cutting

Tree Cutting to Obtain Discrete Clusters

1. Node height in tree 2. Number of clusters 3. Search tree nodes by distance cutoff

Clustering and Data Mining in R Non-Hierarchical Clustering


Clustering and Data Mining in R Non-Hierarchical Clustering

Non-Hierarchical Clustering

Selected Examples

Clustering and Data Mining in R Non-Hierarchical Clustering K-Means

K-Means Clustering

1. Choose the number of k clusters 2. Randomly assign items to the k clusters 3. Calculate new centroid for each of the k clusters 4. Calculate the distance of all items to the k centroids 5. Assign items to closest centroid 6. Repeat until clusters assignments are stable

Clustering and Data Mining in R Non-Hierarchical Clustering K-Means

K-Means (a)

X

X X (b)

X

X X (c)

X X X

Clustering and Data Mining in R Non-Hierarchical Clustering Principal Component Analysis

Principal Component Analysis (PCA)

Principal components analysis (PCA) is a data reduction technique that allows to simplify multidimensional data sets to 2 or 3 dimensions for plotting purposes and visual variance analysis.


Basic PCA Steps I I

Center (and standardize) data First principal component axis I I

I

Accross centroid of data cloud Distance of each point to that line is minimized, so that it crosses the maximum variation of the data cloud

Second principal component axis I I

Orthogonal to first principal component Along maximum variation in the data

I

1st PCA axis becomes x-axis and 2nd PCA axis y-axis

I

Continue process until the necessary number of principal components is obtained


PCA on Two-Dimensional Data Set

2nd 2nd 1st

1st


Identifies the Amount of Variability between Components

Example Principal Component Proportion of Variance

1st 62%

2nd 34%

3rd 3%

Other rest

1st and 2nd principal components explain 96% of variance.

Clustering and Data Mining in R Non-Hierarchical Clustering Multidimensional Scaling

Multidimensional Scaling (MDS)

I

Alternative dimensionality reduction approach

I

Represents distances in 2D or 3D space

I

Starts from distance matrix (PCA uses data points)

Clustering and Data Mining in R Non-Hierarchical Clustering Biclustering

Biclustering Finds in matrix subgroups of rows and columns which are as similar as possible to each other and as different as possible to the remaining data points.

Unclustered

⇒

Clustered

Clustering and Data Mining in R Non-Hierarchical Clustering Many Additional Techniques

Remember: There Are Many Additional Techniques!

Continue with R manual section: ”Clustering and Data Mining”

Clustering and Data Mining in R - Workshop Supplement

Clustering and Data Mining in R - Workshop Supplement

Suggest Documents

Clustering and Data Mining in R - Introduction

Data Mining - Clustering

Mining Spatial Data via Clustering

Data Mining with R

Association Rule Mining with R - R and Data Mining

Data Mining with R:

Analysis of Clustering Techniques in Data Mining

Temporal Data Mining: clustering methods and algorithms

Data mining, Classification and Clustering with ...

Data Mining and Clustering Techniques

Clustering methods and algorithms in data mining

R and Data Mining - Google Sites

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION ...

FUZZY CLUSTERING ANALYSIS OF DATA MINING ... - ijicic

Chapter 4. Clustering - Data Mining TextBook

Clustering with obstacles for Geographical Data Mining

Chapter 4. Clustering - Data Mining TextBook

Lightweight Clustering Technique for Distributed Data Mining ...

Practical Graph Mining With R (Chapman & HallCRC Data Mining And ...

Mazda6 Workshop Manual Supplement

Lightweight Clustering Technique for Distributed Data Mining

educational data mining using r programming and r ...

Workshop on Ubiquitous Data Mining - Bad Request

Sixth International Workshop on Multimedia Data Mining