Clustering via optimization

1 downloads 105 Views 2MB Size Report
May 19, 2011 - Clustering via optimization. Suely Oliveira & David Stewart. Dept Computer Science & Dept Mathematics. CSC workshop, Darmstadt. May 19 ...
Clustering via optimization Suely Oliveira & David Stewart Dept Computer Science & Dept Mathematics

CSC workshop, Darmstadt

May 19, 2011

Clustering Applications

I

I

I

Documents and Manufacturing (past work) for example grouping parts and machines in industry for more efficient manufacturing. Matrix data available. Protein—protein interactions clustering for grouping proteins according to their function. Networks can be used. Genomic clustering for understanding biological processes in various systems, through the use of simultaneous gene activation.

Weighted and unweighted networks in real world and clustering them I I

Weighted networks: Edges hold closeness Unweighted networks

Structure of a cell

Gene expression

I

I I

DNA in the nucleus is transcribed into mRNA (messenger RNA) The mRNA is processed to remove “introns” The processed mRNA then leaves the nucleus in search of ribosomes

Ribosomes make protein

I I

Protein is a chain of amino acids Amino acid is specified by 3 mRNA base “codons”

Data and clustering Data and clustering from Wen & Fuhrman et al. (1998)

Clustering is a relative of Graph Partitioning In Graph Partitioning the 2 subgraphs have the same number of nodes. In clustering the number of data in each cluster can be different, but in both cases you want to minimize the connections (similarities) between the sub-graphs (or clusters).

Mathematical Model

Let G = (V , E) be a graph with vertices i ∈ V and edges eij ∈ E. We allow either edges or vertices to have positive weights, we (eij ) and wv (i) respectively. One way to describe a partition is to assign a value of +1 to all the vertices in one set and a value of –1 to all the vertices in the other. ( +1, if v (i) is in P1 , x(i) = −1, if v (i) is in P2 . ( 1, if v (i) & v (j) are in different partitions 1 (x(i) − x(j))2 = 4 0, otherwise.

From Discrete to Continuous Model The discrete optimization problem: 1X min f (x) : = (x(i) − x(j))2 4 eij ∈E X (a) x(i) ≈ 0,

subject to

i

(b) x(i) = ±1. Relax to a continuous optimization problem: 1X (x(i) − x(j))2 subject to min f (x) : = 4 eij ∈E X (a) x(i) = 0, i

(b)

X i

x(i)2 = n.

Clustering model based on Similarity Matrix

Given similarity matrix S for the graph G(V , E) , construct two clusters A and B. XX XX Define s(A, B) = sij , s(A, A) = sij i∈A j∈B

Aims: (P1): Maximize s(A, A) and s(B, B) (P2): Minimize s(A, B)

i∈A j∈B

Objective functions for Clustering

K - means method: Ratio Cut : Normalized Cut: MinMax Cut :

p p s(A, A) + s(B, B) s(A, B) s(A, B) Minimize + |A| |B| s(A, B) s(A, B) + Minimize dA dB s(A, B) s(A, B) Minimize + s(A, A) s(B, B) Maximize

Identifying Protein Complexes in PPI Networks (SO & S. Seok) I

I

I

PPI networks are power law networks: most nodes have low degrees and a few are highly connected, plus the diameter of the graph is small compared with the number of nodes. PPI nets have no edge weight information. Our approaches using multilevel algorithms were accurate and fast for finding the clusters. Most cellular processes are carried out by groups of proteins, called functional modules or protein complexes. We studied networks for S. Cerevisiae, a species of budding yeast due to simple structure, it has been considered as a unique model system for biological research.

I

The Databases used were: DIP (Database of Interacting Proteins provides experimentally determined network), Krogan, and MIPS for validation.

Related Publications I

I

I

S. O LIVEIRA AND S.C. S EOK, Multilevel approaches for large scale proteomic networks, International Journal of Computer Mathematics, 84(5) (2007), pp. 683-695. S. O LIVEIRA AND S.C. S EOK, A Matrix-based Multilevel Approach to Identify Functional Protein Modules, Int. J. Bioinformatics Research and Applications, 4(1) (2008) S. O LIVEIRA , J.F.F. R IBEIRO AND S.C. S EOK, A Spectral Clustering Algorithm for Manufacturing Cell Formation, Computers and Industrial Engineering, 2009.

Distance Approach for Clustering I

Optimization formulation of MR Rao (1965): min x

N X i,j=1

K X

dij

K X

zi` zj`

subject to

`=1

zi` = 1

for all i,

zi` ≥ 1

for all `,

`=1 N X i=1

zi` ∈ {0, 1} I

I I

for all i, `.

Objective: sum of distances dij where items i and j are in the same cluster. Constraint 1: each item belongs to exact one cluster Constraint 2: each cluster has ≥ 1 item

A semi definite program for clustering I

Semi-definite programming (SDP): “variables” are matrices min C • X := X

n X

cij xij

subject to

i,j=1

Ap • X = αp , p = 1, . . . , q, X positive semi-definite. I

This is a convex optimization problem: I I

I

Objective function is linear Constraints are either linear or convex

Advantages: I I

Efficient algorithms No irrelevant local minima

Clustering as SDP I

I

In Rao’s model make Y = ZZ T so objective function becomes linear and use Y −ZZ T positive semi-definite SDPs give convex approximation to combinatorial problems: min D • Y X ,Y

subject to

Z ≥ 0 componentwise, K X zi` = 1 for all i, `=1  Y Z is positive semi-definite. ZT I

Problems with SDP version

I

I

I

If P is a permutation matrix then Z 7→ ZP gives same objective value and satisfies same constraints. SDP is a convex optimization problem and typically has unique solutions: such a solution must be invariant under this transformation. This implies all entries of Z are 1/K . No information about clustering.

A Non-convex Relaxation I

Keep matrix formulation with convex feasible set, but non-convex objective: min D • ZZ T subject to Z ≥0 (componentwise) (K ) (N) Ze = e .

I

I

I

Non-convex implies many local minima but... we hope most of them are global minima If Z has just zeros & ones for entries and row sums are all one, then rank Z = number of clusters. This allows the optimization method to determine the best number of clusters.

Number of clusters & min rank

I I I

Z can have zero columns, being empty “clusters”. Minimizing rank is usually NP-hard. Approximation is to use the nuclear matrix norm: kAk∗ =

n X i=1

σi (A)

Nonconvex optimization problem min D • ZZ T + α kZ k∗ Z

subject to

Z ≥0 (componentwise) (K ) (N) Ze = e I

I

I

Can choose α ≥ 0 to control number of clusters generated (larger means fewer clusters) If α is not too large, most (if not all) entries of Z are zero or one To get the clusters, we “quantize” Z to be a matrix of zeros and ones by setting maximum entry in each row of Z to one and all others to zero.

Computational results

I

Gradient-projection type method: I I I

I I

gradient step for D • ZZ T , gradient step for α kZ k∗ , projection onto feasible set.

Repeat until appear close to minimum. Test data: Rat CNS gene expression data (Wen & Fuhrman et al., 1998)

Computed results Run 1 2 3 4 5 6 7 8 9 10

φ(Z ∗ ) 1531.84 1540.65 1534.43 1566.70 1532.14 1531.84 1532.14 1533.89 1533.26 1532.14

b ∗) φ(Z 1532.57 1542.17 1535.63 1567.32 1532.86 1532.57 1532.86 1535.92 1533.52 1532.86

Run 11 12 13 14 15 16 17 18 19 20

φ(Z ∗ ) 1540.73 1531.84 1534.49 1534.64 1569.72 1531.84 1531.84 1532.14 1532.14 1531.84

b ∗) φ(Z 1542.23 1532.57 1536.11 1534.81 1572.10 1532.57 1532.57 1532.86 1532.86 1532.57

Comparisons Run “best” published 1 0 27 2 7 28 3 4 29 4 13 28 5 2 28 6 0 27 7 2 28 8 1 27 9 3 29 10 2 28

Run “best” published 11 6 29 12 0 27 13 3 27 14 8 32 15 19 25 16 0 27 17 0 27 18 2 28 19 2 28 20 0 27

Suggest Documents