Clustering and Transductive Inference for Undirected Graphs

Clustering and Transductive Inference for Undirected Graphs Kristiaan Pelckmans Stability and Resampling Methods for Clustering ESAT - SCD/sista KULeuven, Leuven, Belgium

July, 2007

K. Pelckmans (KULeuven - ESAT - SCD)

Clustering and TI

Tubingen 2007

1 / 27

Overview of this talk

(0) Clustering generalities (1) Clustering and Transductive Inference (2) Plausible Labeling Classes (3) Stability in Learning and Clustering (4) Graph Algorithms, MINCUT, and Regularization


Clustering and TI

Tubingen 2007

2 / 27

Clustering Generalities

Definition (An attempt) For a set of n observations S and an hypothesis set H, 0 ) max Jγ (f |S) = Falsifiable(f ) + γ ReproducibleSm0 ∼S (f |Sm f ∈H

(f |S) as ’f projected on the data S’ Reproducibility ≈ stability, robust evidence, ... Falsifiability ≈ surprisingness, -entropy, ... Very general ...


Clustering and TI

Tubingen 2007

3 / 27

Clustering Generalities ...thus specify

VQ and compression Generative model (Image) Segmentation Multiple view prediction (2 - associative clustering; many - (Krupka, 2005)) Organization and retrieval ... → What is a good taxonomy?


Clustering and TI

Tubingen 2007

4 / 27

Clustering and Transductive Inference Particular class of clustering


Clustering and TI

Tubingen 2007

5 / 27


Clustering as finding Apparent Structure ...or ’fast learnable hypothesis class’ Clustering for deriving plausible hypothesis space But interpretable: clusterwise constant (independence) Clustering as stage before prediction... ... or as developing regularization (prior) Different from VQ (auto-regression) No stochastical assumption so far...


Clustering and TI

Tubingen 2007

6 / 27


Clustering as finding Apparent Structure ...or ’fast learnable hypothesis class’ Clustering for deriving plausible hypothesis space But interpretable: clusterwise constant (independence) Clustering as stage before prediction... ... or as developing regularization (prior) Different from VQ (auto-regression) No stochastical assumption so far...


Clustering and TI

Tubingen 2007

6 / 27


Suppose Alice and Bob have both a fair knowledge of the political structure in the world Alice knows a specific law in Europe/ USA/... ⇒ It would be easy to communicate this piece of information

→

Alice


Bob

Clustering and TI

Tubingen 2007

7 / 27


Suppose Alice and Bob have both a fair knowledge of the political structure in the world Alice knows a specific law in Europe/ USA/... ⇒ It would be easy to communicate this piece of information

→

Alice


Bob

Clustering and TI

Tubingen 2007

7 / 27

Transductive Inference for Weighted Graphs

So what can be learned from results in transductive inference?


Clustering and TI

Tubingen 2007

8 / 27

Transductive Inference for Weighted Graphs Deterministic Weighted Undirected graphs

Fixed amount of n ∈ N0 nodes (objects) V = {v1 , . . . , vn } Organized in deterministic graph G = (V , E) with edges E = {xij ≥ 0}i6=j (symmetrical xij = xji , no loops xii = 0) Fixed label yi ∈ {−1, 1} for any node i = 1, . . . , n, but only partially observed: S ⊂ {1, . . . , n} yS = {yi ∈ {−1, 1}}i∈S . Predict the remaining labels y¬S = {yi ∈ {−1, 1}}i6∈S .


Clustering and TI

Tubingen 2007

9 / 27

Transductive Inference for Weighted Graphs Deterministic Weighted Undirected graphs

Fixed amount of n ∈ N0 nodes (objects) V = {v1 , . . . , vn } Organized in deterministic graph G = (V , E) with edges E = {xij ≥ 0}i6=j (symmetrical xij = xji , no loops xii = 0) Fixed label yi ∈ {−1, 1} for any node i = 1, . . . , n, but only partially observed: S ⊂ {1, . . . , n} yS = {yi ∈ {−1, 1}}i∈S . Predict the remaining labels y¬S = {yi ∈ {−1, 1}}i6∈S .


Clustering and TI

Tubingen 2007

9 / 27

Transductive Inference for Weighted Graphs (Ct’d) Example

Gene coexpression:


Clustering and TI

Tubingen 2007

10 / 27

Transductive Inference for Weighted Graphs (Ct’d) Risk

Hypothesis set: ˘ ¯ H = q ∈ {−1, 1}n with |H| = 2n Given a restricted hypothesis set H0 ⊂ H with |H0 | |H|, and a few observations yS where S is uniform without replacement Actual risk R(q|G) = E[I(yJ qJ < 0)] =

n 1X I(yi qi < 0), n i=1

with E over uniform choice of j in yJ ∈ {yi }i . Empirical risk RS (q|G) =

1 X I(yi qi < 0). m i∈S

Transductive risk R¬S (q|G) =

1 X I(yi qi < 0). n−m i6∈S


Clustering and TI

Tubingen 2007

11 / 27



n 1X I(yi qi < 0), n i=1


1 X I(yi qi < 0). m i∈S


1 X I(yi qi < 0). n−m i6∈S


Clustering and TI

Tubingen 2007

11 / 27



n 1X I(yi qi < 0), n i=1


1 X I(yi qi < 0). m i∈S


1 X I(yi qi < 0). n−m i6∈S


Clustering and TI

Tubingen 2007

11 / 27

Generalization Bound why Empirical Risk Minimization works..

Theorem (Generalization Bound) Let S ⊂ {1, . . . , n} be uniformly sampled without replacement. Consider a set of hypothetical labelings H0 ⊂ Hn having a cardinality of |H0 | ∈ N. Then the following inequality holds with probability higher than (1 − δ) < 1. r 2(n − m + 1) sup R(q|G) − RS (q|G) ≤ (ln(|H0 |) − ln(δ)). (1) mn q∈H0 Transductive risk 0

„

∀q ∈ H : R¬S (q|G) ≤ R(q|G) +


n n−m

«r

Clustering and TI

2(n − m + 1) (ln(|H0 |) + ln(1/δ)). mn

Tubingen 2007

12 / 27

Generalization Bound why Empirical Risk Minimization works..

Theorem (Generalization Bound) Let S ⊂ {1, . . . , n} be uniformly sampled without replacement. Consider a set of hypothetical labelings H0 ⊂ Hn having a cardinality of |H0 | ∈ N. Then the following inequality holds with probability higher than (1 − δ) < 1. r 2(n − m + 1) sup R(q|G) − RS (q|G) ≤ (ln(|H0 |) − ln(δ)). (1) mn q∈H0 Transductive risk 0

„

∀q ∈ H : R¬S (q|G) ≤ R(q|G) +


n n−m

«r

Clustering and TI

2(n − m + 1) (ln(|H0 |) + ln(1/δ)). mn

Tubingen 2007

12 / 27

Plausible Labeling Classes H0

Which labelings H0 = {q} are supported by a graph G? Can be formalized as e.g. Which labelings H0 can be reconstructed (’predicted’) with a rule? Which labelings H0 can be compressed with respect to a rule? Which labelings H0 correlate with the graph structure? Which labelings H0 are stable with respect to subsampling of the graph Which labelings H0 have large ’between/within cluster’ ratio?


Clustering and TI

Tubingen 2007

13 / 27

Plausible Labeling Classes H0

Which labelings H0 = {q} are supported by a graph G? Can be formalized as e.g. Which labelings H0 can be reconstructed (’predicted’) with a rule? Which labelings H0 can be compressed with respect to a rule? Which labelings H0 correlate with the graph structure? Which labelings H0 are stable with respect to subsampling of the graph Which labelings H0 have large ’between/within cluster’ ratio?


Clustering and TI

Tubingen 2007

13 / 27

Plausible Labeling Classes H0 (Ct’d) Measuring the Richness of H0 ?

1

Cardinality: |H0 | (finite setting!)

2

Covering balls (if many hypotheses in q ∈ H0 similar)

3

Kingdom Dimension (VC-dim ) of H0 : max |S| s.t. ∀p ∈ {−1, 1}|S| , ∃q ∈ H0 : p = q|S S

where q|S denotes q restricted to S. 4

Compression coefficient

5

Rademacher complexity " R(H0 ) = E


2 q∈H0 n sup

Clustering and TI

˛# ˛ n ˛ ˛X ˛ ˛ σ q ˛ i i˛ ˛ ˛ i=1

Tubingen 2007

14 / 27


1


2


3




5



2 q∈H0 n sup

Clustering and TI

˛# ˛ n ˛ ˛X ˛ ˛ σ q ˛ i i˛ ˛ ˛ i=1

Tubingen 2007

14 / 27


1


2


3




5



2 q∈H0 n sup

Clustering and TI

˛# ˛ n ˛ ˛X ˛ ˛ σ q ˛ i i˛ ˛ ˛ i=1

Tubingen 2007

14 / 27


1


2


3




5



2 q∈H0 n sup

Clustering and TI

˛# ˛ n ˛ ˛X ˛ ˛ σ q ˛ i i˛ ˛ ˛ i=1

Tubingen 2007

14 / 27


1


2


3




5



2 q∈H0 n sup

Clustering and TI

˛# ˛ n ˛ ˛X ˛ ˛ σ q ˛ i i˛ ˛ ˛ i=1

Tubingen 2007

14 / 27

Plausible Labeling Classes H0 (Ct’d) Clusterwise constant hypothesis

Definition (Clusterwise constant hypothesis) Assume a clustering C k = {C1 , . . . , Ck } such that Ci ∩ Cj = Φ, ˛ n o ˛ HC k = q ∈ {−1, 1}n ˛ qCi = 1|Ci | or qCi = −1|Ci | ∀i, and qj = −1elsewhere VCdim(HC k ) = k Rademacher complexity R (HC k ) dependent on size clusters! Rademacher complexity related to normalization factor as in (Lange, 2004)


Clustering and TI

Tubingen 2007

15 / 27

Plausible Labeling Classes H0 (Ct’d) Clusterwise constant hypothesis

Definition (Clusterwise constant hypothesis) Assume a clustering C k = {C1 , . . . , Ck } such that Ci ∩ Cj = Φ, ˛ n o ˛ HC k = q ∈ {−1, 1}n ˛ qCi = 1|Ci | or qCi = −1|Ci | ∀i, and qj = −1elsewhere VCdim(HC k ) = k Rademacher complexity R (HC k ) dependent on size clusters! Rademacher complexity related to normalization factor as in (Lange, 2004)


Clustering and TI

Tubingen 2007

15 / 27

Plausible Labeling Classes H0 (Ct’d) The VC dim of G?

”For any vertex v of a graph G, the closed neighborhood N(v ) of v is the set of all vertices of G adjacent to v . We say that a set S of vertices is shattered if every subset R ⊂ S can be realized as R = S ∪ N(v ) for some vertex v of G. The VCdim of G is defined to be the largest cardinality of a shattered set of vertices.” [Haussler and Welz, 1987] Let the neighbor-based labeling q corresponding to v ∈ V be defined as qv : ( 1 ∃e(v , vj ) qv ,j = −1 else Hypothesis set H0 = {qv , ∀v ∈ V } The VC-dim (Kingdom Dimension) of H0 : max |S| s.t. ∀p ∈ {−1, 1}|S| , ∃v ∈ V : p = qv |S S

where q|S denotes q restricted to S.


Clustering and TI

Tubingen 2007

16 / 27

Plausible Labeling Classes H0 (Ct’d) The VC dim of G?

”For any vertex v of a graph G, the closed neighborhood N(v ) of v is the set of all vertices of G adjacent to v . We say that a set S of vertices is shattered if every subset R ⊂ S can be realized as R = S ∪ N(v ) for some vertex v of G. The VCdim of G is defined to be the largest cardinality of a shattered set of vertices.” [Haussler and Welz, 1987] Let the neighbor-based labeling q corresponding to v ∈ V be defined as qv : ( 1 ∃e(v , vj ) qv ,j = −1 else Hypothesis set H0 = {qv , ∀v ∈ V } The VC-dim (Kingdom Dimension) of H0 : max |S| s.t. ∀p ∈ {−1, 1}|S| , ∃v ∈ V : p = qv |S S

where q|S denotes q restricted to S.


Clustering and TI

Tubingen 2007

16 / 27

Plausible Labeling Classes H0 (Ct’d) Consistent 1NN Rule

1NN Predictor rule: fq (vi ) = q(i) where (i) = arg maxj6=i xij . Consistent predictor rule as restriction mechanism. 0 < qi fq (vi ) = qi q(i) , ∀i = 1, . . . , n

→ Kruskal’s MSP algorithm, VCdim(1NN) = ] disconnected components. K. Pelckmans (KULeuven - ESAT - SCD)

Clustering and TI

Tubingen 2007

17 / 27

Plausible Labeling Classes H0 (Ct’d)

Consider hypothesis set (Average Nearest Neighbors) n o Hk = q ∈ {−1, 1}n : q T Lq ≤ k

Theorem (Cardinality of Hk (Pelckmans, Shawe-Taylor, 2007) ) Let {σi }ni=1 denote the eigenvalues of the graph Laplacian L = D − W where D = diag(d1 , . . . , dn ) ∈ Rn×n . The cardinality of the set Hk can then be bounded as ! „ «nσ (k ) nσ (k ) X n en |Hk | ≤ ≤ , (2) d nσ (k ) d=0

where nσ (k ) is defined as nσ (k) = |{σi : σi ≤ k }| .


Clustering and TI

(3)

Tubingen 2007

18 / 27

Plausible Labeling Classes H0 (Ct’d)

Consider hypothesis set (Average Nearest Neighbors) n o Hk = q ∈ {−1, 1}n : q T Lq ≤ k

Theorem (Cardinality of Hk (Pelckmans, Shawe-Taylor, 2007) ) Let {σi }ni=1 denote the eigenvalues of the graph Laplacian L = D − W where D = diag(d1 , . . . , dn ) ∈ Rn×n . The cardinality of the set Hk can then be bounded as ! „ «nσ (k ) nσ (k ) X n en |Hk | ≤ ≤ , (2) d nσ (k ) d=0

where nσ (k ) is defined as nσ (k) = |{σi : σi ≤ k }| .


Clustering and TI

(3)

Tubingen 2007

18 / 27

Stability in Learning

Lemma (Permutation Stability (R. El Yaniv, D. Pechyony, 2007) ) Let Z ∈ Z be a random permutation vector. Let f : Z → R be an (m, n) symmetric permutation function satisfying |f (Z ) − f (Z ij )| ≤ β for all (i, j) exchanging entries in Sm = {1, . . . , m} and Sn = {m + 1, . . . , n}. Then „ « 2 P (f (Z ) − E[f (Z )] ≥ ) ≤ exp − 2 2β K (m, n) P with K (m, n) = (n − m)2 (H(n) − H(n − m)) and H(k ) = ki=1 i12 (and hence 1/K (m, n) ≥ m).


Clustering and TI

Tubingen 2007

19 / 27

Stability in Learning (Ct’d) Rademacher Complexity for Transductive Inference

Definition (Rademacher Complexity for TI) Given hypothesis class H0 , one has "

2 R(H |G) = E sup q∈H0 n 0

˛# ˛ n ˛ ˛X ˛ ˛ σi qi ˛ ˛ ˛ ˛ i=1

with {σi }ni=1 Bernouilli random variables with P(σi = −1) = P(σi = 1) = 21 . Assume further H0 = −H0 (dropping | · |).


Clustering and TI

Tubingen 2007

20 / 27

Stability in Learning (Ct’d) Rademacher Complexity for Transductive Inference

Theorem (Rademacher bound) With probability exceeding 1 − δ for δ > 0, on has for all q ∈ H0 that s„ « « „ n − 2m n2 n R(H0 |G) + log(1/δ) +2 R¬S (q|G) ≤ RS (q|G) + 4m(n − m) 2m(n − m) m(n − m)


Clustering and TI

Tubingen 2007

21 / 27

Stability in Learning and Clustering

Back to the clustering story...


Clustering and TI

Tubingen 2007

22 / 27

Stability in Clustering

Definition (Stability for Clustering) Consider an algorithm A : X → C. If ∃β such that 0 0 d(A(Sm ), A(Sm )) ≤ β, ∀Sm , Sm ⊂ {1, . . . , n}

then for any > 0, one has “ ” P d(A(Sm ), E[A(Sm )]) ≥ ≤ δ() for an appropriate δ : R+ 0 →]0, 1].


Clustering and TI

Tubingen 2007

23 / 27

Stability in Clustering 0 Idea: encode d(A(Sm ), A(Sm )) as ˛ ˛ 0 ˛c(A(Sm )) − c(A(Sm ))˛1 ,

for an appropriate encoding function c : C → {0, 1}nc , and Sm = D(Z |m), or shortly as Z |m, then

Corollary (Stability for Clustering) Consider an algorithm A : X → C. If ∃β(A, m) such that ˛ 1 ˛˛ 0 c(A(Z |m)) − c(A(Z 0 |m))˛1 ≤ β(A, m), ∀Sm , Sm ∼ {1, . . . , n} nc then for any > 0, one has „ « „ « 1 2 P |c(A(Z |m) − EZ [c(A(Z |m))]|1 ≥ ≤ 2 exp − 2 nc 2β (A, m)K (m, n) P with K (m, n) = (n − m)2 (H(n) − H(n − m)) and H(k ) = ki=1 i12 (and hence 1/K (m, n) ≥ m).


Clustering and TI

Tubingen 2007

24 / 27

Stability in Clustering 0 Idea: encode d(A(Sm ), A(Sm )) as ˛ ˛ 0 ˛c(A(Sm )) − c(A(Sm ))˛1 ,

for an appropriate encoding function c : C → {0, 1}nc , and Sm = D(Z |m), or shortly as Z |m, then

Corollary (Stability for Clustering) Consider an algorithm A : X → C. If ∃β(A, m) such that ˛ 1 ˛˛ 0 c(A(Z |m)) − c(A(Z 0 |m))˛1 ≤ β(A, m), ∀Sm , Sm ∼ {1, . . . , n} nc then for any > 0, one has „ « „ « 1 2 P |c(A(Z |m) − EZ [c(A(Z |m))]|1 ≥ ≤ 2 exp − 2 nc 2β (A, m)K (m, n) P with K (m, n) = (n − m)2 (H(n) − H(n − m)) and H(k ) = ki=1 i12 (and hence 1/K (m, n) ≥ m).


Clustering and TI

Tubingen 2007

24 / 27

Stability in Clustering

Remarks Natural encoding by mapping points on canonical ’cluster identity’. McDiarmid inequality... Notions of weak stability Norm k · k1 ? Better choices with supnc Does k comes in only via β(A, m)?


Clustering and TI

Tubingen 2007

25 / 27

Graph Algorithms, MINCUT, and Regularization Application

Can be recast as a convex graph flow algorithm!

Incorporating prior knowledge by relaxing as an LP/QP - e.g. B ≈ n.


Clustering and TI

Pn

i=1 (1

− qi ) ≥ B for

Tubingen 2007

26 / 27

Conclusion

! Learning in finite universes! ! Clustering as setting the stage for prediction (particular sense) ! Clusterwise constant hypothesis class ! Stability results in Learning ? Cluster-ability and Learnability ? Falsifiable vs. Reproducible ? Need for a taxonomy?


Clustering and TI

Tubingen 2007

27 / 27

Clustering and Transductive Inference for Undirected Graphs

Clustering and Transductive Inference for Undirected Graphs

Suggest Documents

Distributed MAP Inference for Undirected Graphical Models

Projection of undirected and non-positional graphs

Variational Inference for Sparse and Undirected Models - arXiv

Preorder Construct on Simple Undirected Graphs

Approximate cycles count in undirected graphs

Approximate Maximum Flow on Separable Undirected Graphs

Undirected Connectivity of Sparse Yao Graphs

NFI: A Neuro-Fuzzy Inference Method for Transductive Reasoning

Distributed MAP Inference for Undirected ... - Research at Google

Bayesian Inference for Transductive Learning of Kernel Matrix Using

Inference Graphs - Semantic Scholar

Efficient solution for finding Hamilton cycles in undirected graphs ...

Transductive Learning on Adaptive Graphs - Association for the ...

A Layout Algorithm for Undirected Compound Graphs - CiteSeerX

Optimal Listing of Cycles and st-Paths in Undirected Graphs

Topology for Distributed Inference on Graphs - arXiv

Spectral Clustering and Transductive Learning with ... - (Denny) Zhou's

Concurrent Reasoning with Inference Graphs

Set Connectivity Problems in Undirected Graphs ... - Semantic Scholar

Rough Set Theory Applied to Simple Undirected Graphs | SpringerLink

Large-Scale Graph-based Transductive Inference - Semantic Scholar

Central Clustering of Attributed Graphs

A faster algorithm to recognize undirected path graphs - Science Direct

Large-Scale Graph-based Transductive Inference - Research at Google