K-means Clustering of Proportional Data Using L1 Distance
Recommend Documents
big data, a Kmeans process is implemented on MapReduce to scale to large data sets. Even a single Kmeans process on MapReduce requires considerable ...
Dec 13, 2012 - Data Recovery Fuzzy Clustering: Proportional Membership and Additive. Spectral Methods. Susana Nascimento. Department of Computer ...
Connect more apps... Try one of the apps below to open or edit this item. 25-clustering-and-kmeans-handout.pdf. 25-clust
University of Warwick institutional repository: http://go.warwick.ac.uk/wrap. This paper is made available online in accordance with publisher policies. Please ...
data analysis methods by Bashashati and Brinkman (1), auto- mated gating methods can be used to identify both known and unknown cell populations, with the ...
Argenis A. Aroche-Villarruel1, J.A. Carrasco-Ochoa1, José Fco. Martínez-Trinidad1,. J. Arturo Olvera-López2, and Airel Pérez-Suárez3. 1 Computer Science ...
Dec 12, 2015 - LG] 12 Dec 2015 .... For example, when we know the exact distances between dax,dxy and ..... Pattern Recognition, 41(5):1834â1844, 2008.
May 31, 2011 - Ramsay and Dalzell (1991) gave the name âfunctional data analysisâ to the anal- ysis of data of this kind. As evidenced in the work by Ramsay ...
in the data set are common prior knowledge incorporated in many clustering algorithms today. This approach has been shown to be successful in guiding a ...
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object ...... algorithms research,â J. Software, vol.19, no.1, pp.48-. 61, Jan.2008.
May 20, 2009 - commercial applications, such as Adobe's creative suite,. Mathematica, etc., have new releases of their CUDA-enabled version; researchers ...
TECHNICAL DATA. SHEET. 1. Key Features. High-performing, self-powered,
portable two-way ... loudspeaker system designed to provide wide, uniform
sound coverage and easy setup. Spatial ... guitar with a piezo pickup. When in
the OFF ...
tween the values of a categorical attribute (DILCA DIstance Learning ... between values of a categorical variable and apply this technique to clustering.
learning. 1 Introduction. Cluster analysis, a type of unsupervised learning, is a .... and a clustering method suitable for exclusively categorical data is applied (e.g. ...
JOHN BENJAMINS PUBLISHING COMPANY. Chapter 3. Processability, typological ...... Johnston and Brindley (1988). We will therefore focus on the summary ...
State-space dynamics distance for clustering sequential data. Darıo Garcıa-
Garcıa. أ. , Emilio Parrado-Herna´ndez, Fernando Diaz-de-Maria. Signal Theory
and ...
which distance function âworks bestâ has been investigated, but no final conclusion has been ..... via data analysis tools such as clustering algorithms. In strict ...
is is K-means algorithm [11], where K denotes the number of clusters. The reasons for the ... optimization problem, a BCO-based clustering algorithm is proposed. ..... pp. 264-323, 1999. [4] S. Guha, R. Rastogi, and K. Shim, "CURE: An Efficient.
May 28, 2015 - component analysis (PCA), an important technique for machine ... attracted increasing attention in dealing with âbig dataâ problems such as ...
Heterogeneous Data Sources, Biological Literature, Text Mining. 1. INTRODUCTION. Large scale microarray experiments that have been performed under a ...
set of 19 left-context classes and 18 right-context classes. (2). The data-driven ... algorithm to split or merge the contexts to a speci ed number of generalized ...
Etienne Barnard. Center for Spoken Language Understanding. Oregon Graduate Institute of Science and Technology. 20000 N.W. Walker Road, Portland, OR ...
K-means Clustering of Proportional Data Using L1 Distance
... associated with different costs. â« L1-distance is known to be robust to noise. e.g. âsoftware installationâ template = (consultant=0.1, engineer=0.9, architect=0.0) ...
IBM Research
K-means Clustering of Proportional Data Using L1 Distance Hisashi Kashima IBM Research, Tokyo Research Laboratory
Jianying Hu Bonnie Ray Moninder Singh IBM Research, T.J. Watson Research Center
We propose a new clustering method for proportional data with the L1 distance K-means clustering K-means clustering of proportional data with the L1-distance Motivation of L1-proportional data clustering: Workforce management An efficient sequential optimization algorithm Experimental results with real world data sets
K-means Clustering of Proportional Data Using L1 Distance
IBM Research
Review of K-means clustering K-means clustering
partitions N data points obtain K centers
Iteration: 1. Assign each data point
into K groups
to its closest cluster
2. Estimate the j-th new centroid by set of data points assigned to the j-th cluster
where
is a distance measure between two vectors
K-means Clustering of Proportional Data Using L1 Distance
IBM Research
K-means clustering of proportional data K-means clustering of proportional data partitions N proportional data points obtain K centers
into K groups
A proportional data is a M-dimensional vector where each element is non-negative elements are summed to be one, i.e.
Cluster center also must satisfy the constraints
and
clustering
1
job role distributions
2
3
4
5
6
7
8
9
10
cluster centroids (= templates)
K-means Clustering of Proportional Data Using L1 Distance
IBM Research
We use the L1 distance, but why ? Application to Skill Allocation-Based Project Clustering We concentrate on the L1-distance as the distance measure
Our motivation: Workforce management
We want staffing templates for assigning skilled people to various projects A template indicates how much % of the whole project time is charged to a particular job role e.g. “software installation” template = (consultant=0.1, engineer=0.9, architect=0.0) • used for efficient assignment of appropriate people to appropriate project • used as bases of skill demand forecasting The L1 distance from templates can be directly translated into cost differences. • Also allows different skills associated with different costs
L1-distance is known to be robust to noise K-means Clustering of Proportional Data Using L1 Distance
IBM Research
Challenge of using L1 distance for K-mean clustering of proportional data For L2 distance, the closed form solution is obtained as the mean
regardless whether the proportional constraints apply
For L1 distance,
the median is the closed form solution for the unconstrained case With proportional constraints, no closed form solution exists
There are two fast approximations for constrained L1 K-means: 1. Use the mean (just like L2 K-means) 2. Median followed by normalization
Challenge: can we find an efficient way to compute accurate solutions ?
K-means Clustering of Proportional Data Using L1 Distance
IBM Research
Algorithm for K-means clustering with proportional data K-means clustering of proportional data
partitions N data points obtain K centers
into K groups
Iteration: 1. Assign each data point
to one of K clusters
2. For the j-th cluster, estimate a new centroid by
s.t.
where
is the L1-distance:
K-means Clustering of Proportional Data Using L1 Distance
and
IBM Research
The key point is how to solve step 2 efficiently The optimization problem involved in Step 2 is s.t.
and
The equivalent linear programming problem has O( #data points × #dimensions )-variables But we want a more efficient method tailored for our problem using the equality constraint explicitly
K-means Clustering of Proportional Data Using L1 Distance
IBM Research
Our approach: sequential optimization w.r.t. 2 variables Key observation: we have only one equality constraint s.t.
and
We employ sequential optimization borrowing the idea of SMO algorithm for SVM (QP)
Picks up two variables at a time and optimizes w.r.t. the two variables
Iteration: 1. find a pair of variables and which improves the solution the most 2. Optimize the objective function with respect only to them (while keeping the equality constraint satisfied) K-means Clustering of Proportional Data Using L1 Distance
a piecewise linear function
IBM Research
How to select the two variables ? Key observation: The objective function is decomposed into piecewise linear convex functions with only one parameter
,
piecewise linear and with only one parameter
If we decrease , will be increased Find a pair of variables and which has the steepest gradient Found in O(log M) time by efficient implementations
gradient wrt the change
left gradient of
K-means Clustering of Proportional Data Using L1 Distance
right gradient of
IBM Research
How much do we update the two variables ? Update and while keeping the constraint If we decrease by , increases by Move the two variables until either of them reaches a corner point minimum distance to the next corner of the two objective functions The number of updates is bounded by the number of corners
fd ∆d decrease of
K-means Clustering of Proportional Data Using L1 Distance
fd’ ∆d’ increase of
IBM Research
Experiments Two real world datasets representing skill allocations for past projects in two service areas in IBM
1,604 project with 16 skill categories 302 projects with 67 skill categories
We compared three algorithms
The proposed algorithm “Normalized median” 1. Computes dimension-wise medians 2. Normalizes them to make the sum to be one “Mean” • Uses sample means as cluster centroids (The equality constraint is automatically satisfied)
K-means Clustering of Proportional Data Using L1 Distance
IBM Research
The proposed method achieves low L1 errors
220
60
200
55 Total L1 distortion
Total L1 distortion
10-fold cross validation × 10 runs with different initial clusters Performances are evaluated by sum of L1 distances to nearest clusters The proposed algorithm consistently outperforms both alternative approaches at all values of K
180 160 140 120
50 Proposed 45
Proposed
Normalized Median
Normalized Median
40 Mean
Mean
35
100
30
80
25 1
2
3
4
5
6
7
8
9
2
3
Number of clusters
K-means Clustering of Proportional Data Using L1 Distance
4
5
6
7
Number of clusters
8
9
10
IBM Research
The proposed method produces moderately sparse clusters The proposed method leads to more interpretable cluster centroids “Normalized Median” produces many cluster centroids with only a single non-zero dimension Moderately sparse Proposed Proposed
Too sparse
Not sparse
NormalizedNormalized Median Median
Mean
Unspecified Project Management Principal PSIC Other Missing Knowledge and Learning Developer-Other Data Related Consultant-Other Business Transformation Consultant Business Strategy Consultant Business Development Executive Architect-Other Application Developer Application Architect
cluster 1
cluster 2
cluster 3
cluster 4
cluster 5
cluster 6
cluster 1
cluster 2
cluster 3
Mean
Unspecified Project Manage Principal PSIC Other Missing Knowledge and Developer-Othe Data Related Consultant-Oth Business Trans Business Strat Business Deve Architect-Othe Application Dev Application Arc
Unspecified Project Management Principal PSIC Other Missing Knowledge and Learning Developer-Other Data Related Consultant-Other Business Transformation Consultant Business Strategy Consultant Business Development Executive Architect-Other Application Developer Application Architect
cluster 4
cluster 5
K-means Clustering of Proportional Data Using L1 Distance
cluster 6
cluster 1
cluster 2
cluster 3
cluster 4
cluster 5
cluster 6
IBM Research
Conclusion We proposed a new algorithm for clustering proportional data using L1 distance measure
The proposed algorithm explicitly uses the equality constraint
Other applications include
document clustering based on topic distributions video analysis based on color distributions
K-means Clustering of Proportional Data Using L1 Distance