NetMine: Mining Tools for Large Graphs - users.cs.umn.edu

43 downloads 101 Views 2MB Size Report
NetMine: Mining Tools for. Large Graphs ... They are needed to build/test realistic graph generators ... User → Website Clickstream Graph: 222,704 nodes and ...
NetMine: Mining Tools for Large Graphs Deepayan Chakrabarti Yiping Zhan Daniel Blandford Christos Faloutsos Guy Blelloch

Introduction

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

► Graphs are ubiquitious Friendship Network [Moody ’01] 2

Graph “Patterns” Given a large graph dataset, what do we focus on? Patterns Aspects of graphs that show up frequently, in datasets from diverse domains. Degree distributions

Power Laws

Count vs Outdegree 3

Graph “Patterns” Given a large graph dataset, what do we focus on? Patterns Aspects of graphs that show up frequently, in datasets from diverse domains. Degree distributions Hop-plots “Scree” plots and others…

Effective Diameter

Hop-plot 4

Graph “Patterns” Why do we like them? They capture interesting properties of graphs. They provide “condensed information” about the graph. They are needed to build/test realistic graph generators ( useful for simulation studies). They help detect abnormalities and outliers.

5

Our Work The NetMine toolkit contains all the patterns mentioned before, and adds: The “min-cut” plot a novel pattern which carries interesting information about the graph.

A-plots a tool to quickly find suspicious subgraphs/nodes. 6

Outline Problem definition “Min-cut” plots ( +experiments) A-plots ( +experiments) Conclusions

7

“Min-cut” plot What is a min-cut?

Minimizes the number of edges cut

Size of mincut = 2 Two partitions of almost equal size 8

“Min-cut” plot Do min-cuts recursively. log (mincut-size / #edges)

Mincut size = sqrt(N) log (# edges)

N nodes 9

“Min-cut” plot Do min-cuts recursively. New min-cut

log (mincut-size / #edges)

log (# edges)

N nodes 10

“Min-cut” plot Do min-cuts recursively. New min-cut

log (mincut-size / #edges)

Slope = -0.5

log (# edges)

N nodes

For a d-dimensional grid, the slope is -1/d 11

“Min-cut” plot Min-cut sizes have important effects on graph properties, such as efficiency of divide-and-conquer algorithms compact graph representation difference of the graph from well-known graph types for example, slope = 0 for a random graph

12

Experiments Datasets: Google Web Graph: 916,428 nodes and 5,105,039 edges Lucent Router Graph: Undirected graph of network routers from www.isi.edu/scan/mercator/maps.html; 112,969 nodes and 181,639 edges User Website Clickstream Graph: 222,704 nodes and 952,580 edges

13

Experiments log(mincut/edges)/ #edges) log (mincut-size

Used the METIS algorithm [Karypis+, 1995] 0

google-averaged

-1

• Google

Slope~ -0.4

-2 -3

• Values along the yaxis are averaged

"Lip"

-4

Web graph

-5 -6

• We observe a “lip” for large edges

-7 -8 -9 0

5

10

15

log log(edges) (# edges)

20

25

• Slope of -0.4, corresponds to a 2.5dimensional grid! 14

Experiments

0

lucent-averaged

-1

Slope~ -0.57

-2 -3 -4 -5 -6 -7 -8 0

2

4

6

8

10

12

14

16

log(edges)

log (# edges)

Lucent Router graph

18

log(mincut/edges) log (mincut-size / #edges)

log (mincut-size / #edges) log(mincut/edges)

Same results for other graphs too… 0

clickstream-averaged

-0.5

Slope~ -0.45

-1 -1.5 -2 -2.5 -3 -3.5 -4 -4.5 0

2

4

6

8

10

12

14

16

18

20

log(edges)

log (# edges)

Clickstream graph 15

Observations Linear slope for some range of values “Lip” for high #edges Far from random graphs (because slope ≠ 0)

16

Outline Problem definition “Min-cut” plots ( +experiments) A-plots ( +experiments) Conclusions

17

A-plots How can we find abnormal nodes or subgraphs? Visualization but most graph visualization techniques do not scale to large graphs!

18

A-plots However, humans are pretty good at “eyeballing” data ☺ Our idea: Sort the adjacency matrix in novel ways and plot the matrix so that patterns become visible to the user

We will demonstrate this on the Lucent Router graph (112,969 nodes and 181,639 edges) 19

A-plots Three types of such plots for undirected graphs… RV-RV (RankValue vs RankValue) Sort nodes based on their “network value” (~first eigenvector) RD-RD (RankDegree vs RankDegree) Sort nodes based on their degree D-RV (Degree vs RankValue) Sort nodes according to “network value”, and show their corresponding degree 20

We can see a “teardrop” shape and also some blank “stripes” and a strong diagonal

Rank of Network Value

RV-RV plot (RankValue vs RankValue) Stripes

(even though there are no self-loops)! Rank of Network Value 21

The “teardrop” structure can be explained by degree-1 and degree-2 nodes

1

2

NV1 = 1/λ * NV2

Rank of Network Value

RV-RV plot (RankValue vs RankValue)

Degree−1 nodes

Degree−2 nodes Rank of Network Value 22

Strong diagonal nodes are more likely to connect to “similar” nodes

Rank of Network Value

RV-RV plot (RankValue vs RankValue)

Degree−1 nodes

Degree−2 nodes Rank of Network Value 23

RD-RD (RankDegree vs RankDegree) dots due to 2-node isolated components

Rank of Degree of node

• Isolated

Rank of Degree of node 24

D-RV (Degree vs RankValue) 2000 1800

Spikes

Degree Degree

1600 1400 1200 1000 800 600 400 200 0 0

50000

100000 150000 200000 250000 300000

Rank of Network Value Rank of Network Value

25

Explanation of “Spikes” and “Stripes” RV-RV plot had stripes; D-RV plot shows spikes. Why? 2−degree“Stripe” nodes nodes (forming Stripe)

“Spike” nodes high degree, but all edges to “Stripe” nodes Spike1

degree-2 nodes connecting only to the “Spike” nodes

Spike2

Stripe

2000 1800

Spikes

1600

Degree

1400 1200 1000 800 600

External connections

400 200 0 0

50000

100000 150000 200000 250000 300000 Rank of Network Value

26

A-plots They helped us detect a buried abnormal subgraph in a large real-world dataset which can then be taken to the domain experts.

27

Outline Problem definition “Min-cut” plots ( +experiments) A-plots ( +experiments) Conclusions

28

Conclusions We presented “Min-cut” plot A novel graph pattern with relevance for many algorithms and applications

A-plots which help us find interesting abnormalities

All the methods are scalable Their usage was demonstrated on large real-world graph datasets

29

We can see a “teardrop” shape and also some blank “stripes” and a strong diagonal.

Rank of Network Value

RV-RV plot (RankValue vs RankValue) Stripes

Rank of Network Value

30

RD-RD (RankDegree vs RankDegree) dots due to 2-node isolated components

Rank of Degree of node

• Isolated

Rank of Degree of node 31