a visual canonical adjacency matrix for graphs

A VISUAL CANONICAL ADJACENCY MATRIX FOR GRAPHS

By HONGLI LI

ABSTRACT OF A DISSERTATION SUBMITTED TO THE FACULTY OF THE DEPARTMENT OF COMPUTER SCIENCE IN PARTIAL FULFILMENT OF THE REQUIREMENTS

FOR THE DEGREE OF DOCTOR OF SCIENCE IN COMPUTER SCIENCE UNIVERSITY OF MASSACHUSETTS LOWELL 2008

Dissertation Supervisor: Georges G. Grinstein, Ph.D. Professor, Department of Computer Science

Abstract Graph data mining algorithms rely on graph canonical forms to compare different graph structures. Most of these canonical form definitions depend on node and edge labels. Two topologically identical graphs with different node and edge labels may have two different canonical forms. A graph can be visualized as a node-link diagram or an adjacency matrix. There are many different graph layout algorithms but none of them can totally avoid edge crossing and node overlapping problem, especially when a graph is large. Adjacency matrix representations do not have these problems, but how to order the nodes so that the resulting matrix could help a user explore the graph is another issue. There are some node reordering algorithms but very few were designed for visualization and the result node ordering is not stable: the same graph could generate different results. This research introduces an algorithm that can produce a canonical visual matrix which only depends on the graph's topological information (no node or label information required) and for which two structurally identical graphs have exactly the same visual matrix representation. Our procedure is based on the Breadth-First Search spanning tree. We designed several special rules and filters to guarantee the uniqueness. We also arrange similar nodes close to each other when possible. This unique visual matrix representation provides a persistence and stability which can be used and harnessed in data exploration and visualization.

ii

Acknowledgement Many people have helped me during my study. First, I would like to thank Professor Georges G. Grinstein. This thesis would not have been possible without his guidance, support and patience. Professor Grinstein is always available whenever needed. His enthusiasm and clear vision in the field of information visualization had brought me to into this area; they will continue to inspire me. I would like to thank Professor Karen Daniels for her interest, patience, and deep knowledge in the field of Graph Theory. She had provided me many valuable ideas. She also helped me in my thesis writing. I would also like to thank Professor R. Daniel Bergeron and Dr. Alexander G. Gee for their interest and patience. My special acknowledgement goes to everyone in the Institute for Visualization and Perception (IVPR). This includes current students and previous students and stuffs. I want to thank them for their friendships, collaboration and help. IVPR is a fun place to work. I would like to thank Dr. Howie Goodell, it was one of his talks inspired me to start this thesis topic. I am indebted to my family, my parents, my two sisters, my husband and my beloved son. I would never have got this far in my life without their love and support. I dedicate this thesis to the memory of my dear Mother, Zhongzhi Su.

iii

Table of Contents Abstract ................................................................................................................................................... ii Acknowledgement ................................................................................................................................. iii List of Illustrations .............................................................................................................................. vii List of Tables ....................................................................................................................................... xiv 1

2

Introduction ................................................................................................................................... 1 1.1

Problem and Motivation ........................................................................................................ 3

1.2

Goals ...................................................................................................................................... 4

1.3

Contributions ......................................................................................................................... 5

1.4

Overview ................................................................................................................................ 6

Background and Literature Review ................................................................................................ 7 2.1

Graph Definition ..................................................................................................................... 8

2.2

Graph Drawing Problem and Application ............................................................................ 12

2.3

Matrix Representation and Application ............................................................................... 17

2.4

Other Graph Visual Representations ................................................................................... 22

2.5

Graph Data Mining ............................................................................................................... 24

2.5.1

Graph Isomorphism Problem ....................................................................................... 25

2.5.2

Graph Mining Applications .......................................................................................... 27

2.6

2.6.1

Cuthill‐McKee Ordering ............................................................................................... 42

2.6.2

Minimum Degree Ordering .......................................................................................... 44

2.7 3

Bandwidth Reduction Algorithms ........................................................................................ 38

Summary and Discussion ..................................................................................................... 46

Graph Canonical Matrix Ordering ................................................................................................ 50 3.1

Properties of the Cuthill‐McKee Ordering ........................................................................... 52

3.2

Determine the Starting Node ............................................................................................... 56

3.3

Node Ordering at Each Level ............................................................................................... 61

3.3.1 3.4

Symmetric Graph Structures ........................................................................................ 74

Dealing With Multiple Candidate Orderings ........................................................................ 80

iv

4

5

6

7

3.4.1

Global Penalty (

)...................................................................................................... 80

3.4.2

Pattern Matching ......................................................................................................... 82

3.5

Disconnected Graphs ........................................................................................................... 85

3.6

Nodes with Self‐Loops ......................................................................................................... 85

3.7

Dense Graphs ....................................................................................................................... 86

3.8

Discussion............................................................................................................................. 86

Patterns of the Adjacency Matrix ................................................................................................ 89 4.1

Clique ................................................................................................................................... 89

4.2

Complete Bipartite graph..................................................................................................... 90

4.3

Path ...................................................................................................................................... 92

4.4

Highly Connected Node ....................................................................................................... 93

4.5

Other Patterns ..................................................................................................................... 94

System Implementation ............................................................................................................... 95 5.1

Data Structure ...................................................................................................................... 95

5.2

File I/O .................................................................................................................................. 97

5.3

User Interface .................................................................................................................... 100

5.4

Interactions ........................................................................................................................ 102

5.4.1

Selection ..................................................................................................................... 102

5.4.2

Highlight Skeleton Edges ............................................................................................ 106

5.4.3

Reordering ................................................................................................................. 108

5.4.4

(Pseudo) Indistinguishable Nodes .............................................................................. 110

5.4.5

Other Interactions ...................................................................................................... 111

5.5

Non‐Graphical Output ........................................................................................................ 113

5.6

Summary ............................................................................................................................ 113

Result ......................................................................................................................................... 114 6.1

Different Graph Models ..................................................................................................... 114

6.2

Patterns .............................................................................................................................. 117

6.3

IMDB Movie Database ....................................................................................................... 121

6.4

Conclusion .......................................................................................................................... 129

CONCLUSION .............................................................................................................................. 131 7.1

Contributions ..................................................................................................................... 132

v

7.2

Future Works ..................................................................................................................... 134

8

Literature Cited .......................................................................................................................... 137

9

Biographical Sketch of Author ................................................................................................... 144

vi

List of Illustrations FIGURE 1: STRUCTURES OF WORLD TRADE OF 1981 (KREMPEL 2003). ............................................. 2 FIGURE 2: SEVEN BRIDGES OF KÖNIGSBERG PROBLEM (EULER 1736; CHARTRAND 1985). (A). THE REAL MAP REPRESENTATION (GIUŞCĂ 2005). (B). THE GRAPH REPRESENTATION (BIGGS, LLOYD ET AL. 1986; GROSS AND YELLEN 2004). ...................................................................................... 7 FIGURE 3. SIX POPULAR GRAPH LAYOUT METHODS. (A) TREE LAYOUT. (B) HIERARCHICAL LAYOUT. (C) SYMMETRIC LAYOUT. (D) CIRCULAR LAYOUT. (E) ORTHOGONAL LAYOUT. (F) FORCE‐BASED LAYOUT. 14 FIGURE 4: AN EXAMPLE OF THE INCREMENTAL LAYOUT (CORPORATION 2003). (A) THE ORIGINAL GRAPH. (B) FOUR NEW EDGES ARE ADDED, AND THE GRAPH LAYOUT CHANGED AS MINIMALLY AS POSSIBLE IN ORDER TO KEEP THE ORIGINAL LOOK WHILE AT THE SAME TIME TRY TO SATISFY SOME OF THE GRAPH DRAWING AESTHETICS. ..................................................................................................... 17

FIGURE 5: AN UNDIRECTED GRAPH AND ITS TWO OTHER REPRESENTATIONS. (A). AN UNDIRECTED GRAPH WITH 4 VERTICES AND 5 EDGES. (B) THE ADJACENCY MATRIX REPRESENTATION. (C) THE ADJACENCY LIST REPRESENTATION ...................................................................................................... 18 FIGURE 6: MATRIXEXPLORER USER INTERFACE. ON THE LEFT IS THE MATRIX REPRESENTATION; ON THE RIGHT IS THE NODE‐LINK DIAGRAM OF THE SAME GRAPH (HENRY AND FEKETE 2006). .................. 20 FIGURE 7: SIDE‐BY‐SIDE COMPARISONS OF MATRIX REPRESENTATION AND NODE‐LINK GRAPH. WE CAN SEE THAT AFTER REORDERING, BLOCKS IN THE MATRIX REPRESENTATION CORRESPOND VERY WELL WITH CLUSTERS IN THE NODE‐LINK GRAPH. A IS A HIGHLY CONNECTED NODE AND B A CLIQUE (HENRY AND FEKETE 2006). .............................................................................................................. 21 FIGURE 8: CONNECTED COMPONENTS VISUALIZATION (HENRY AND FEKETE 2006). ............................ 21 FIGURE 9 NODETRIX REPRESENTATION OF PART OF THE INFOVIS CO‐AUTHORSHIP NETWORK, WHERE EACH “NODE” IS A MATRIX THAT REPRESENTS A SUBSTRUCTURE (HENRY, FEKETE ET AL. 2007). ............. 22 FIGURE 10: (A) MAP OF THE MARKET. A TREEMAP REPRESENTATION OF THE CURRENT STOCK MARKET. THE COLOR REPRESENTS THE PERFORMANCE OF THE STOCK, WITH RED REPRESENTS DOWN AND GREEN UP (SMARTMONEY.COM 2007). (B) CUSHION TREEMAP (WIJK AND WETERING 1999). ............ 23 FIGURE 11: 3D REPRESENTATION OF NODE‐LINK GRAPH. (A) VISUALIZATION OF NETWORK DISTRIBUTION (MUNZNER 2000). (B) HORIZONTAL CONETREE (ROBERTSON, MACKINLAY ET AL. 1991). .......... 24 FIGURE 12: TWO IDENTICAL GRAPHS WHICH LOOK VERY DIFFERENT BECAUSE OF DIFFERENT LAYOUTS. .... 26 FIGURE 13: ONE EXAMPLE OF THE SUBDUE ALGORITHM ITERATIVELY FINDING THE BEST REPEATED SUB‐ STRUCTURE TO COMPRESS THE GRAPH. THE ALGORITHM ALLOWS A CERTAIN DEGREE OF INEXACT MATCHING AS WE CAN SEE IN THE SUBSTRUCTURE (COOK AND HOLDER 2000). ..................... 28

vii

FIGURE 14: (A).APRIORI ALGORITHM. (B). SUBROUTINE APRIORI‐GEN FOR CANDIDATE GENERATION, P AND Q INDICATE TWO DIFFERENT FREQUENT ITEMSETS, . IS AN ITEM FROM SET , AND . IS FROM THE SET Q. ......................................................................................... 30 FIGURE 15: MODIFIED ADJACENCY MATRIX FOR A GRAPH (INOKUCHI, WASHIO ET AL. 2000). .............. 31 FIGURE 16: THREE DFS TREES OF THE SAME LABELED GRAPH. EACH NODE LABEL’S INDEX INDICATES ITS DISCOVERING ORDER IN THE CORRESPONDING DFS TREE. DASHED LINES REPRESENT NON‐DFS TREE EDGES. NODES COLORED IN BLUE ARE ON THE RIGHT MOST PATHS (FAN AND HAN 2002). ............ 34 FIGURE 17: GSPAN ALGORITHM AND ITS SUBROUTINE (FAN AND HAN 2002). .................................. 35 FIGURE 18: A CANONICAL FORM TESTING FUNCTION. (A) A TESTING FUNCTION THAT CAN BE APPLIED TO BOTH DFS AND BFS‐DEFINED CANONICAL FORMS, THE DIFFERENCE IS IN THE “REC” SUB‐PROCEDURE. (B) A SUB‐PROCEDURE THAT IS DESIGNED FOR A BFS‐DEFINED CANONICAL FORM (BORGELT 2005). ................................................................................................................................... 36 FIGURE 19: A VISUAL EXPLANATION OF THE MATRIX BANDWIDTH REDUCTION EFFECT. (A). A SPARSE GRAPH. EACH NUMBER IS A GRAPH NODE AND THE NUMBER ITSELF REPRESENTS ITS INITIAL ORDER IN THE MATRIX. (B). THE MATRIX CORRESPONDING TO THE GRAPH SHOWED IN (A). THE BANDWIDTH OF THIS MATRIX IS 8. (C). THE GRAPH AFTER THE NODE RENUMBERING. (D). THE NEW MATRIX REPRESENTATION AFTER THE NODE RENUMBERING, THE BANDWIDTH OF THIS MATRIX IS 3. NON‐ EMPTY ELEMENTS ARE CLOSER TO THE MAIN DIAGONAL WHEN COMPARING WITH THE MATRIX IN (B). ................................................................................................................................... 40 FIGURE 20: CUTHILL‐MCKEE ORDERING ALGORITHM (CUTHILL AND MCKEE 1969; CORMEN, LEISERSON ET AL. 1990) ................................................................................................................. 44 FIGURE 21: MINIMUM DEGREE ORDERING .................................................................................. 45 FIGURE 22: AN EXAMPLE OF UPDATING A GRAPH AFTER THE CENTER NODE HAS BEEN DELETED. (A) THE ORIGINAL GRAPH. SHORT EDGES REPRESENT LINKS TO OTHER NODES IN THE GRAPH WHICH ARE NOT SHOWN. (B) THE GRAPH AFTER THE CENTER NODE HAS BEEN DELETED AND NEW EDGES ARE ADDED SO THOSE NODES ORIGINALLY ADJACENT TO THE CENTER ONE BECOME A CLIQUE. THE COST FOR THIS UPDATE DEPENDS ON THE DEGREE OF THE NODE BEING DELETED (WHICH IS | | ), THERE ARE AT | | , MOST ONE EDGE BETWEEN EACH OF THESE NODES, SO THE TOTAL NUMBER EDGES ADDED IS SO | | IS THE COST OF THIS UPDATE. ............................................................................ 45 FIGURE 23: MATRIX AND NODE‐LINK REPRESENTATIONS OF THE SAME GRAPH DATASET. (A) COLUMNS AND ROWS OF THE MATRIX ARE IN A RANDOM LAYOUT ORDER. (B) MATRIX IS REORDERED BY APPLYING THE REVERSE CUTHILL‐MCKEE ALGORITHM. .............................................................................. 49 FIGURE 24: THE NODE ORDERING IS A LEVEL ORDER TRAVERSAL THROUGH THE BFS SPANNING TREE. THE , WHERE ROOT’S ORDER IS SET TO 0, AND THE BOTTOM RIGHT NODE (AT LEVEL ) IS SET TO | | IS THE TOTAL NUMBER OF NODES IN THE GRAPH. ............................................................. 51 FIGURE 25: A GRAPH AND ONE OF ITS BFS SPANNING TREES. (A) A RANDOM GRAPH. (B) ONE OF THE BFS SPANNING TREES. THE NUMBER NEXT TO THE NODE INDICATES ITS ORDER. THE CONTINUOUS BLACK

viii

LINE REPRESENTS THE TREE EDGE AND THE DASHED BLUE LINE CORRESPONDS TO THE NON‐TREE EDGE.

................................................................................................................................... 53 FIGURE 26: THE ALGORITHM OF FINDING A PSEUDO‐PERIPHERAL NODE (GIBBS, POOLE ET AL. 1976; GEORGE AND LIU 1979). ................................................................................................. 57 FIGURE 27: TWO OTHER BFS SPANNING TREES OF THE GRAPH SHOWED IN FIGURE 25 (A). BLACK LINES REPRESENT TREE EDGES WHILE BLUE DASHED LINES REFER TO NON‐TREE EDGES. THE NUMBER NEXT TO THE NODE INDICATES ITS ORDER. NODE AND ARE A PAIR OF PERIPHERAL NODES. (A). THE BFS SPANNING TREE ROOTED AT NODE . THIS TREE’S HEIGHT IS 4, AND THE WIDTH IS ALSO 4. (D) THE BFS SPANNING TREE ROOTED AT NODE , THE HEIGHT OF THIS TREE IS STILL 4 BUT THE WIDTH IS 3. 58 FIGURE 28: THE PROCEDURE THAT RETURNS A LIST OF CANDIDATE STARTING NODES. .......................... 60 FIGURE 29: A RANDOM GRAPH WITH THREE OF ITS BFS SPANNING TREES. THE MATRICES ON THE RIGHT ARE CORRESPONDING TO THE GRAPH OR BFS TREES ON THE LEFT. THE NUMBER NEXT TO A NODE INDICATES THE NODE’S ORDER. (A) NODES ARE IN A RANDOM ORDER. (B). THE ADJACENCY MATRIX CORRESPONDS TO THE GRAPH IN (A). (C), (E), (G) ARE THREE BFS SPANNING TREES ROOTED AT THE SAME PERIPHERAL NODE , BUT WITH DIFFERENT NODE ORDERING. (D), (F), (H) ARE THREE ADJACENCY MATRICES CORRESPONDING TO TREES ON THE LEFT. THE AREA 1 ON THE UPPER LEFT CORNER OF THESE THREE MATRICES REPRESENTS THE CLIQUE GENERATED BY NODES , , , AND . ....................... 63 FIGURE 30: AN EXAMPLE OF TWO PSEUDO‐INDISTINGUISHABLE NODES. NODE AND ARE A PAIR OF PSEUDO‐INDISTINGUISHABLE NODES; EXCHANGING THEIR ORDER WITH EACH OTHER DOES NOT CHANGE THE CORRESPONDING ADJACENCY MATRIX VISUALIZATION. (A). A NODE‐LINK REPRESENTATION OF THE GRAPH. THE TWO PSEUDO‐INDISTINGUISHABLE NODES ARE HIGHLIGHTED IN THICK BLUE OUTLINE. (B) TWO ADJACENCY MATRICES FOR TWO DIFFERENT ORDERINGS. THE ONLY DIFFERENCE BETWEEN THEM AND AT THE BEGINNING. ................................................ 64 IS THE ORDER BETWEEN NODE FIGURE 31: A SUBGRAPH IN A LARGER GRAPH. NODE AND ARE NOT PSEUDO‐INDISTINGUISHABLE NODES BUT THEIR ORDER CAN BE INTERCHANGED AND THE CORRESPONDING ADJACENCY VISUAL MATRIX STAYS THE SAME. ..................................................................................................................... 67

FIGURE 32: FOUR DIFFERENT EXAMPLES SHOW HOW TO DECIDE THE ORDER BETWEEN NODE AND . THE GRAPHS SHOWED HERE ARE ALL SUBGRAPHS IN SOME LARGER GRAPH STRUCTURES. SAME AS IN PREVIOUS EXAMPLES, THICK BLACK LINES REPRESENT TREE EDGES, WHILE DASHED BLUE LINES REPRESENT NON‐TREE EDGES. THE BLUE TRIANGLES REPRESENT SOME ARBITRARY SUBSTRUCTURES. NODES BEFORE AND OR ON LEVELS ABOVE HAVE ALREADY BEEN ORDERED. (A) NODE AND ARE PSEUDO‐INDISTINGUISHABLE. (B) THE ORDER BETWEEN NODE AND CAN BE DETERMINED BY THE ORDER BETWEEN NODE AND . (C) THE ORDER BETWEEN NODE AND IS DETERMINED BY THE DEGREE SUM OF THOSE NODES CONNECTED TO NODE AND SEPARATELY. (D) NODE SHOULD BE PUT BEFORE NODE BECAUSE NODE ’S PARENT NODE IS BEFORE NODE ’S PARENT NODE . 68 FIGURE 33: THE PROCEDURE TO ORDER TWO NODES WITH THE SAME DEGREE. ................................... 72

ix

FIGURE 34: A 3‐REGULAR GRAPH WITH TWO OF ITS SPANNING TREES ROOTED AT THE NODE . EVERY NODE IS OF DEGREE 3. .............................................................................................................. 76 FIGURE 35: THE PROCEDURE FOR TESTING IF TWO SUBSTRUCTURES ROOTED AT AND SEPARATELY ARE SYMMETRIC (ISOMORPHIC) TO EACH OTHER. ......................................................................... 79 FIGURE 36: TWO ADJACENCY MATRICES OF TWO DIFFERENT ORDERINGS. THE VISUALIZATION OF BOTH MATRICES APPEAR VERY SIMILAR, THE ONLY DIFFERENCE IS EDGE , SHOWED IN AREA 1 AND 2. THE GLOBAL PENALTY FOR MATRIX IN (A) IS 41 WHILE THE GLOBAL PENALTY FOR MATRIX IN (B) IS 40. ................................................................................................................................... 81 FIGURE 37: TWO DIFFERENT ORDERINGS OF THE SAME GRAPH HAVE EXACTLY THE SAME VISUAL PATTERN IN THEIR ADJACENCY MATRICES. ON THE LEFT ARE TWO BFS SPANNING TREES, THEIR CORRESPONDING ADJACENCY MATRICES ARE SHOWED ON THE RIGHT. ............................................................... 82 FIGURE 38: PATTERN MATCHING PROCEDURE. ............................................................................. 84 FIGURE 39: THE GRAPH PARTITION PROCEDURE. .......................................................................... 85 FIGURE 40: A CLIQUE WITH TWO OF ITS REPRESENTATIONS. THE CLIQUE FORMS A SQUARE THAT IS SYMMETRIC TO THE MAIN DIAGONAL. THE MAIN DIAGONAL IS EMPTY BECAUSE THERE IS NO SELF‐LOOP IN THIS GRAPH. ON THE RIGHT IS THE CORRESPONDING NODE‐LINK REPRESENTATION. ................... 89 FIGURE 41: A COMPLETE BIPARTITE GRAPH IN BOTH MATRIX AND NODE‐LINK REPRESENTATIONS. ON THE LEFT IS THE ADJACENCY MATRIX REPRESENTATION. THE COMPLETE BIPARTITE GRAPH SHOWS AS TWO COMPLETED FILLED UP SQUARES IN THE MATRIX. ................................................................... 91 FIGURE 42: A SIMPLE BIPARTITE GRAPH. IN THE MATRIX, THE EDGES FORM A TRIANGULAR SHAPE. NOTE THAT THIS TRIANGULAR COULD APPEAR ANYWHERE IN THE MATRIX, THEY ALL REPRESENT A BIPARTITE GRAPH. ......................................................................................................................... 92 FIGURE 43: IN ADJACENCY MATRIX, A LINE (COLORED IN GREEN) PARALLEL TO THE MAIN DIAGONAL AND ALSO NEXT TO THE DIAGONAL REPRESENTS A PATH. IN THE RIGHT WINDOW, THE CORRESPONDING PATH IS ALSO COLORED IN GREEN. .............................................................................................. 93 FIGURE 44: A HIGH DEGREE NODE FORMS A HORIZONTAL/VERTICAL BAR (COLORED IN RED) IN THE MATRIX REPRESENTATION ON THE LEFT. THE CORRESPONDING NODES AND LINKS ARE HIGHLIGHTED IN BLUE ON THE RIGHT. .................................................................................................................... 94 FIGURE 45: DATA STRUCTURE FOR OUR MATRIX GRAPH VISUALIZATION FRAMEWORK. ......................... 95 FIGURE 46: AN EXAMPLE OF THE GRAPHVIZ FILE FORMAT (GANSNER, KOUTSOFIOS ET AL. 2006). (A) THE DOT FILE FORMAT. LINE 2 INDICATES THE GRAPH’S SIZE IN INCHES. EACH OF THE REST OF THE LINES REPRESENTS A NODE OR AN EDGE, ALONG WITH ITS DRAWING PROPERTIES (SUCH AS THE SIZE, SHAPE AND COLOR). (B) THE NODE‐LINK REPRESENTATION OF THE GRAPH. .......................................... 97 FIGURE 47: A SIMPLE GRAPH WITH ITS GRAPHML FILE DESCRIPTION (BRANDES, EIGLSPERGER ET AL.). ... 98 FIGURE 48: A GRAPH AND ITS CORRESPONDING XML FILE FORMAT. ................................................ 98 FIGURE 49: GRAPH MATRIX VISUALIZATION INTERFACE. ON THE LEFT IS THE MATRIX VISUALIZATION AND ON THE RIGHT IS THE CORRESPONDING GRAPH’S NODE‐LINK REPRESENTATION. ............................... 100

x

FIGURE 50: LINKING BETWEEN THE MATRIX AND THE NODE‐LINK REPRESENTATIONS. (A) USER CAN SELECT BOTH NODES AND EDGES IN THE MATRIX DISPLAY AND SELECTED ELEMENTS WILL BE HIGHLIGHTED IN THE OTHER WINDOW. (B) SELECTION IN THE NODE‐LINK DISPLAY MIGHT CONTAIN BOTH NODES AND EDGES. AGAIN SELECTED ELEMENTS ARE HIGHLIGHTED IN BOTH DISPLAYS. ................................ 103

FIGURE 51: THE SELECTED SUBGRAPH CAN BE VIEWED IN A SEPARATE WINDOW AS A NODE‐LINK GRAPH.105 FIGURE 52: SKELETON EDGES ARE HIGHLIGHTED IN PURPLE IN THE MATRIX DISPLAY. THIS GRAPH HAS ~ . NODES AND ~4.2 EDGES. THIS OPTION ONLY AVAILABLE AFTER THE MATRIX HAS BEEN REORDERED. THE LINE ALONG THE MAIN DIAGONAL OF THE MATRIX INDICATES SELF LOOP EDGES. . 107 FIGURE 53: MARK SKELETON EDGES IN THE NODE‐LINK GRAPH DISPLAY. THIS IS THE SAME GRAPH AS THE ONE SHOWED IN FIGURE 52. ........................................................................................... 108 FIGURE 54: THE SAME MATRIX AS SHOWED IN FIGURE 52 BUT BEFORE THE REORDERING. NON‐EMPTY ELEMENTS ARE SPREAD ALL OVER THE MATRIX. THE LINE ALONG THE MAIN DIAGONAL INDICATES SELF‐ LOOPS. ........................................................................................................................ 109 FIGURE 55: MANUAL REORDERING IN THE MATRIX DISPLAY. (A) NODE IS BEING MOVED FROM ITS ORIGINAL POSITION (OUTLINED IN GREEN) TO A NEW POSITION (OUTLINED IN RED). DURING THE MOVING, COLUMN KEEPS ITS ORIGINAL LOOK, AND THE WHOLE IMAGE IS TRANSPARENT SO A USER WILL HAVE A FEELING OF THE PHYSICAL MOVING AND SOME HINTS OF WHAT THE MATRIX WILL LOOK LIKE AFTER THE MOVING. (B) THE MATRIX AFTER NODE HAS BEEN MOVED TO A NEW POSITION. . 109

FIGURE 56: INDISTINGUISHABLE NODES AND PSEUDO‐INDISTINGUISHABLE NODES’ LABELS ARE HIGHLIGHTED IN RED. ....................................................................................................................... 110 FIGURE 57: ZOOMING IN THE MATRIX DISPLAY, IT IS ACTIVATED BY THE MOUSE WHEEL. USER CAN INTERACTIVELY CHANGE THE ZOOM LEVEL IN THE MATRIX DISPLAY. THIS IS THE SAME GRAPH AS SHOWED IN FIGURE 52. THE “CLUSTER 1” INDICATED HERE IS THE SAME “CLUSTER 1” IN FIGURE 52. ................................................................................................................................. 111 FIGURE 58: PROBING IN THE MATRIX DISPLAY. GREEN LINES ASSOCIATED WITH THE MOUSE POSITION CAN HELP A USER LOCATE THOSE TWO NODES THAT ARE CONNECTED BY THE CURRENT PROBING EDGE. THE TWO ENDPOINTS’ LABELS ARE ALSO HIGHLIGHTED IN GREEN. .................................................. 112 FIGURE 59: 3‐REGULAR GRAPH OF SIZE 16 WITH ITS NODE‐LINK REPRESENTATION AND THREE OF ITS VERTEX ORDERINGS. FOR THIS SPECIAL GRAPH, EVERY NODE IS A PERIPHERAL NODE, AND THEY ARE ALL QUALIFIED STARTING NODE. FOR EACH STARTING NODE, THERE ARE 6 UNBREAKABLE TIES, SO TOTALLY 12 DIFFERENT PERMUTATIONS. (B)‐(D) ARE THREE DIFFERENT VERTEX ORDERINGS. THE VISUAL REPRESENTATIONS STAY THE SAME.................................................................................... 116 FIGURE 60: A SYNTHETIC DENSE GRAPH WITH 30 NODES AND 300 EDGES. (A) VERTICES ARE IN RANDOM ORDER. (B) THE CANONICAL VISUAL MATRIX OF (A) GENERATED BY OUR PROCEDURE. THE BANDWIDTH OF THIS MATRIX IS 25, THE GLOBAL PENALTY IS 2268. WE CAN CLEARLY SEE THE CLUSTERING EFFECT. ................................................................................................................................. 118

xi

FIGURE 61: A SYNTHETIC GRAPH AND ITS CANONICAL VISUAL MATRIX. PATTERNS ARE HIGHLIGHTED IN COLORED BOXES IN BOTH DISPLAYS. EACH COLOR REPRESENTS A CORRESPONDENCE. THE BLUE BOX REPRESENTS A HIGH DEGREE NODE (NODE 2); THE PURPLE BOX IS A HIGHLY CONNECTED SUBGRAPH; NODES IN THE ORANGE BOX ARE LOW DEGREE NODES. THE LAST 4 NODES (18, 19, 20, 21) FORMED A PATH PATTERN IN THIS MATRIX. (A) THIS GRAPH HAS THREE CLEAR PATTERNS, TWO HIGH DEGREE NODE (NODE 2 AND NODE 6), A HIGHLY CONNECTED SUBGRAPH (IN THE MIDDLE), AND SEVERAL LOW DEGREE NODES (AT THE BOTTOM). (B) THE CANONICAL VISUAL MATRIX OF THIS GRAPH. THERE ARE SEVERAL PATTERNS IN THIS MATRIX AND THESE PATTERNS ARE CORRESPONDING TO THOSE PATTERNS IN THE NODE‐LINK REPRESENTATION. .......................................................................................... 118

FIGURE 62: A MODIFIED GRAPH OF THE GRAPH IN FIGURE 61. THIS GRAPH HAS MORE EDGES THAT CONNECTING NODES IN DIFFERENT CLUSTERS. THE GRAPH IN FIGURE 61 IS A SUBGRAPH OF THIS ONE. PATTERNS IN BOTH DISPLAYS ARE NOT AS CLEAR AS PATTERNS IN FIGURE 3. (A) THE NODE‐LINK DISPLAY IN ITS ORIGINAL LAYOUT AS IN FIGURE 61. WE KEEP NODES IN THEIR ORIGINAL POSITIONS SO WE CAN COMPARE TWO GRAPHS EASILY. THE TRADE OFF IS THAT THERE ARE MORE EDGE/NODE CROSSINGS. (B) THE CANONICAL VISUAL MATRIX OF THE NEW GRAPH IN (A). WE CAN STILL IDENTIFY THE HIGHLY CONNECTED SUBGRAPH (HIGHLIGHTED IN PURPLE BOX) AND THE HIGH DEGREE NODE (NODE 2 IN THE BLUE BOX). .................................................................................................................. 120 FIGURE 63: A FURTHER MODIFIED GRAPH OF THE STRUCTURE IN FIGURE 61. THE NODE‐LINK REPRESENTATION HAS BEEN REARRANGED BY USING THE ORIGINAL CIRCULAR LAYOUT AS THE GRAPH IN

FIGURE 61. THE DISPLAY CHANGES DRAMATICALLY. THE UPDATED CANONICAL VISUAL MATRIX ALSO CHANGES. BUT WE STILL CAN SEE THE HIGH DEGREE NODE (NODE 2) AND THE CLIQUE FORMED BY NODE 10, 11, 12, 13. .................................................................................................. 121 FIGURE 64: MOVIES RELEASED 2006. THE WHOLE GRAPH CONTAINS 3953 NODES (MOVIES) AND 12957 EDGES. THERE ARE TOTALLY 867 CONNECTED SUBGRAPHS. EACH SUBGRAPH IS A CONNECTED GRAPH AND THERE ARE NO CONNECTIONS BETWEEN DIFFERENT SUBGRAPHS. THE LARGEST SUBGRAPH MATRIX IS SHOWED ON THE UPPER LEFT PORTION OF THE MATRIX VISUALIZATION AREA. IT TAKES MOST OF THE DISPLAY SPACE AND ITS CANONICAL MATRIX LOOKS LIKE A “LEAF”, EACH OF THE REST SUBGRAPHS CONTAINS LESS THAN 10 NODES ON AVERAGE, THEY FORM A “LINE” ALONG THE MAIN DIAGONAL. IN THE FIRST SUBGRAPH, THE UPPER LEFT AREA HAS SLIGHTLY MORE CONNECTIONS THAN THE LOWER RIGHT PART. ................................................................................................................. 124 FIGURE 65: A CLUSTER IS HIGHLIGHTED IN RED IN THE MATRIX REPRESENTATION IN (A), ITS CORRESPONDING NODE‐LINK SUBGRAPHS ARE SHOWED IN (B). THE 4 CLUSTERS ARE LABELED 1 – 4. NODE LABELS ARE MOVIE NAMES. THE LABELS ARE VERY SMALL HERE TO FIT THE NODE BOUNDARY. IF A USER WANTS TO READ A NODE’S LABEL, HE/SHE CAN PROBE THAT NODE, THE CORRESPONDING LABEL TEXT WILL BE SHOWN IN THE TOOLTIP. ................................................................................................. 125 FIGURE 66: A CLIQUE AND SEVERAL BARS NEXT TO THE CLIQUE APPEAR IN THE CANONICAL MATRIX ORDERING OF SUBGRAPH 1. ............................................................................................ 127

xii

FIGURE 67: A NODE‐LINK VISUALIZATION OF THE CLIQUE IN FIGURE 65. ALL THESE MOVIES ARE CONNECTED BECAUSE OF LARRY LAVERTY, HE PARTICIPATED IN ALL THESE 14 MOVIES IN THE YEAR 2006. ....... 128 FIGURE 68: HORIZONTAL AND VERTICAL BARS FORM A WRAP OF ALL EDGES. THESE BARS ARE FORMED BY TREE EDGES OF THE CORRESPONDING BFS TREE. ................................................................. 129

xiii

List of Tables TABLE 1: SEVERAL COMPUTATION FACTORS OF APPLYING OUR PROCEDURE TO SEVERAL SYNTHETIC GRAPHS. IF THERE IS MORE THAN ONE QUALIFIED STARTING NODE, WE ONLY SHOW THE AVERAGE OF MULTIPLE CASES. FOR THE “RUNNING TIME” COLUMN, NUMBERS SHOWN IN PARENTHESES REPRESENT THE TIME TAKEN BY FINDING THE CANDIDATE STARTING NODES. THE SYNTHETIC GRAPH DATA IS GENERATED BY USING THE BARABASI GRAPH GENERATOR (HTTP://WWW.CS.UCR.EDU/~DDREIER/BARABASI.HTML) (BARABASI AND R. 1999). ................................................................................................... 117 TABLE 2: SOME COMPUTATION FACTORS OF THE IMDB MOVIE DATA SET OF DIFFERENT YEAR. THE COLUMN OF “RUNNING TIME” SHOWS BOTH THE TOTAL RUNNING TIME AND ALSO THE TIME USED FOR FINDING THE CANDIDATE STARTING NODES (IN PARENTHESES). ................................................................. 123

xiv

1

1

Introduction

“A picture is worth a thousand words”. There is no doubt that an effective graphic could help us understand information better. Using graphic to represent information can be dated as far back as geographic maps engraved on clays, which happened even before Christ (Bertin 1984). We have learned, practiced and also went through a long way to achieve our current knowledge on how graphics could help to understand this world. Now we have many different visualization and interaction techniques. How to use them are all depending on data properties and our special requirements. A graph is a natural way to model relational structured information where nodes represent entities and edges represent associations between them. Graphs have been applied in many areas such as social networks, biological pathways, internet topologies and many more. Graph data could be very complicated. When displaying a graph, we need to show not only different entities (nodes) but also the complicated relationship (edges) among them. They are equally important. One common way to display a graph is to use the node-link representation. If a graph is reasonable small and could be nicely laid out then the node-link representation is not hard to understand. Figure 1 shows a graph of world trade in 1981, where nodes represent countries and edges represent trading among them (Krempel 2003). As we can see from this display, color, node size, and edge width

2

are all used to tell different aspects of this data. In most cases, we have to use a 2D display to show a graph. When a graph is large (in term of number of nodes and edges), it is hard to avoid node overlapping and edge crossing problems. Actually, this is one of the main focuses of the graph drawing research and this is also a very hard problem.

Figure 1: Structures of World Trade of 1981 (Krempel 2003).

General graph mining, on the other hand, focuses on finding algorithms to identify interesting patterns. They are trying to answer questions like “which (sub)structure appears more frequent?”, “Is this subgraph one of the sub-structures in that (larger) graph?” and many more. In graph mining, we need to compare graphs in order to tell if they are the same. Comparing graphs is not an easy task; some of its related problems have been proven to be NP-Complete problem.

3

1.1 Problem and Motivation Graph visualization and graph mining are two research areas that have been studied intensively in years. These are two research areas that could benefit each other. Most networking visualization systems use the node-link representation (Henry, Fekete et al. 2007). There are different graph layout algorithms and many aesthetics that could help display node-link visualizations reasonably well. But these aesthetics often conflict with each other and it is very hard to satisfy all. Several graph layout problems have been proven to be NP-Complete problems. A matrix is an alternative way to represent a graph. Recent research shows that a properly vertex ordered matrix could also reveal special graph structures and aid in the graph exploration process (Mueller, Martin et al. 2007). Graph data mining’s main goal is to find interesting patterns algorithmically. Much of this work has been focused on finding frequent (sub)-graphs. This requires comparing graphs and telling if they are the same. How to compare two graphs is one of the most difficult problems in this area. It is called the Graph Isomorphism (GI) problem. So far, there is no known polynomial algorithm solution nor has it been proven to be an NP-complete problem (Corneil and Kirkpatric 1980; Fortin 1996). There are many existing vertex reordering procedures, many of them were designed for matrix computation or to decrease the graph storage space. Although very few methods were designed for a pure visualization purpose, many of them have shown potentials in the graph matrix visualization application (Mueller, Martin et al. 2007).

4

These include clustering algorithms and the matrix bandwidth reduction algorithms. But none of these algorithms could produce a stable visual matrix, which means two different matrices might represent the exactly same graph structure. The vertex ordering result could be affected by many factors, for example, the initial vertex order, the starting node, and the traversal strategy to visit every node in a graph. Our understanding of matrix patterns is still limited. Many users cannot associate a pattern in a graph’s adjacency matrix with the (sub)-structure it is representing. This makes a matrix even harder to be accepted as an alternative efficient way to visualize a graph.

1.2 Goals If we can produce a canonical visual matrix that has a one-to-one relationship with its graph structure, then we can maintain a stable mental map. When we see two different canonical visual matrices, we know right way that they are different. In another word, two graphs are isomorphic to each other if and only if these two graphs are structurally identical. Node labels, edge labels and other none structurally related factors will have no influence on the canonical visual matrix. Our goal is to find a good graph matrix representation that could assist graph visual exploration. We would like our vertex reordering procedure to produce not only a stable visual matrix but also could help to identify interesting graph patterns.

5

To take a full advantage of graph’s matrix representation, we need to familiarize ourselves with this special visualization. We need to have a clear understanding of the relationship between patterns in the matrix and their corresponding graph structures. Finally, we have to test the correctness of our algorithm for producing a canonical visual matrix and also understand its potential in assisting graph visual mining procedure, and we need a prototype that allows us to do some experiments.

1.3 Contributions This research provides the following main contributions to the area of graph visualization and graph mining. 1. Provides a detailed explanation on existing vertex ordering procedures. 2. Provides a procedure that can produce a canonical visual matrix for a graph, we also give a detailed analysis on special properties of this canonical visual matrix and also prove its correctness. 3. Provides examples of matrix patterns and their corresponding special graph structure. 4. Provides a prototype application with some special interactions that are specially designed for graph matrix representations. This gives us a test bed for our algorithm and also allow us examine its potential in assisting visual graph mining.

6

1.4 Overview In chapter 2, we survey the existing graph and matrix visualization applications, vertex ordering algorithms, and graph mining algorithms. In chapter 3 we show how to achieve this canonical graph visual matrix. We then discuss our approach and prove its correctness. Chapter 4 provides some examples of patterns in matrices and also how they are related to some graph structures. Chapter 5 discusses our prototype application, the data structure, interface, and also some special designed interactions for matrix visualization. In chapter 6 we show the results of some evaluation tests and discuss future works in chapter 7.

7

2

B Backgrou und and Liiterature Review R

Graph h theory begiins with the famous Sevven Königsbeerg Bridge problem p (Euller 1736; Chartraand 1985). The T problem asks if it is possible p to start s from onne place, travverse alll the bridges exactly once and returnn to the origginating place. Euler reprresented the prroblem as a graph as shoown in Figurre 2 (a) withh lands repressented by noodes and briddges byy edges (Euller 1736). Soo the problem m becomes how h to draw w the graph shhown in Figgure 2 (bb) with the constraint c of not retracingg any line annd without picking p the pencil p up offf the paper (Biggs,, Lloyd et al. 1976).

(a)

(b)

Figure 2: Seven n Bridges of Königsberg K Problem (Euler 1736; Chartrand 1985). (a)). The real maap reepresentation (Giuşcă 2005)). (b). The graaph representaation (Biggs, Lloyd L et al. 1986; Gross and d Y Yellen 2004).

8

2.1

Graph Definition Many problems in the real world can be modeled as graphs. Here is a formal

definition of the graph and some important graph concepts we are going to use in this thesis. These definitions are from (Gross and Yellen 2004). Definition 1: A graph

,

consists of two sets

are called vertices (or nodes) and the elements of

and . The elements of

are called edges (or links,

connections, arcs, etc). Each edge has a set of one or two vertices associated with it, these vertices are called endpoints of the edge. An edge is said to join its endpoints. Vertices are also sometimes referred to as nodes, and edges as links, connections, or arcs. In this thesis we will use these terms interchangeably. An edge is directed if there comes before ). An edge can also be

is a specified ordering to its endpoints (

bidirectional and it is normally represented by two edges of opposite direction. Definition 2: A directed graph is a graph where each of its edges is directed. Definition 3: A walk in a graph ,

,

,

,

,

,

is an alternating sequence of vertices and edges, 1,

such that for

endpoints of the edge . If moreover, the edge

, , the vertices

and

is directed from

are

to , then the

is a directed walk. In a simple graph (Definition 7), a walk may be represented simply by listing a sequence of vertices: and

,

,

are adjacent (joined by an edge).

,

, such that for

1,

, , the vertices

9

Definition 4: A graph is connected if between every pair of vertices there is a walk. Definition 5: A trail in a graph is a walk such that no edge occurs more than once. Definition 6: A path in a graph is a trail such that no internal vertex is repeated. Definition 7: A simple graph is a graph that has no self-loops or multi-edges. Since many problems regarding general graphs can be reduced to problems about simple graphs, we will only concern ourselves with simple, unweighted, undirected, and also connected graphs. The only exception is self-loops. We will show in chapter 3 that our algorithm can extend to these special cases if they are treated specially. Definition 8: A graph

,

and

,

is called a subgraph of graph

contains all the endpoints of all the edges in

,

.

Definition 9: In a graph , the induced subgraph on a set of vertices ,…

, denoted

whose endpoints are in and

, has

as its vertex-set, and it contains every edge of

. That is, |the end points of edge are in

Definition 10: A tree is a connected graph with no cycles. Definition 11: A forest is a (not necessarily connected) graph with no cycles.

if

10

Definition 12: For a given tree in a graph , the edges and vertices of called tree edges and tree vertices, and the edges and vertices of

are

that are not in are

called non-tree edges and non-tree vertices. Definition 13: A spanning tree of a graph

is a spanning subgraph of

that is a

tree. Definition 14: For two vertices to

and

in a graph , the distance

is the length (number of edges) of a shortest Definition 15: For a vertex

the distance from

,

from

path in . of is

in a connected graph , the eccentricity ,

to a vertex farthest from . That is,

.

Definition 16: The minimum eccentricity among the vertices of a connected graph is the radius of , denoted

, and the maximum eccentricity is its diameter,

. Definition 17: A vertex ; while a vertex

in

in a connected graph is a peripheral vertex if

is a central vertex if .

Definition 18: The subgraph induced by the central vertices of a connected graph is the center of , denoted vertices is its periphery,

, and the subgraph of .

induced by its peripheral

11

Definition 19: A shortest-path tree

from a vertex is a tree, rooted at , that

contains all the vertices that are reachable from . The path in

from to any vertex

is

a shortest path in . Any breadth –first tree is a shortest path tree. Definition 20: For a given tree T in graph G, the edges and vertices of T are called tree edges and tree vertices, and the edges and vertices of G that are not in T are called non-tree edges and non-tree vertices. Definition 21: An adjacency matrix representation for a simple graph or directed ,

graph

is a | | ,

vertex to vertex ;

| | matrix , where

,

1 if there is an edge from

0 otherwise.

Definition 22: An adjacency list representation for a graph or directed graph , pointer

is an array

of | | lists, one for each vertex in . For each vertex , there is a

to a linked list containing all vertices adjacent to . Definition 23: Two simple graphs

that for any two vertices , ,

in

and

,

are

, if there is a one-to-one, onto mapping :

isomorphic, denoted

edge

,

, there is an edge

,

in

, such

if and only if there is an

. Such an adjacency-preserving bijection

is called an

isomorphism. Definition 24: A labeled graph is a graph whose vertices and/or edges are labeled, possibly with repetitions, using symbols from a finite alphabet. If some vertices and/or

12

edges have no labels, then they can be regarded as having a special label different from the rest. Thus, we may always assume that all vertices and all edges are labeled. Definition 25: Two labeled graphs

,

and

isomorphic as labeled graphs if there is an isomorphism : each

, the vertices

and

,

,

are

, such that for

have the same label.

Definition 26 Given a graph , a permutation if

,

of

, for all ,

is an automorphism of .

The node-link diagram (also called graph drawing) and the adjacency matrix are two common ways to represent a graph. In a node-link diagram, entities are represented by nodes and connections between them are denoted by edges. When a graph’s size (number of nodes) becomes large, the nodes might overlap and edges intersect with each other. How to lay out nodes and arrange edges to avoid these problems are major concerns among the graph drawing community.

2.2 Graph Drawing Problem and Application How to draw a node-link graph so it can be easily understood is one of the big questions in the graph drawing community. Based on theories and also experiences, people have developed a set of general rules that could help to draw a graph nicely. Drawing conventions are a set of basic regulations that a drawing should obey to be acceptable. Depending on the specific application, these conventions could be very

13

complex. Generally these conventions involve how to draw a line and place a vertex so the whole graph is relatively easy to read. They might include drawing each edge as a polygonal chain (Polyline Drawing), a straight line segment (Straight-line Drawing), or a polygonal chain of alternating horizontal and vertical segments (Orthogonal Drawing), and avoid edges crossing (Planar Drawing); draw vertices and edge bends in a place that has integer coordinates (Grid Drawing). For directed acyclic graphs the edges are drawn as curves strictly upward or downward (Upward/Downward drawing) (Battista, Eades et al. 1998). Drawing aesthetics are things we would like to follow to make a graph drawing readable. These rules include minimizing edge crossing; minimizing the total drawing area (smallest convex hull that covers the drawing); minimizing the total length of the edges; maximizing the edge length, etc.(Battista, Eades et al. 1998). We can easily tell that some of these aesthetics are in conflict with each other, and even in some special case where there is no conflict, it is still hard algorithmically to satisfy all of them at the same time. Tradeoffs are needed in most of the cases and how to prioritize them varies with different requirements. There are several common approaches to draw a graph, such as the divide and conquer approach, the hierarchical approach, the force-directed approach and etc. The most popular graph layout applications based on these approaches includes tree layout, hierarchical layout, symmetric layout, circular layout, orthogonal layout, and force layout. Examples of these layout results are showing in Figure 3.

14

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3. Six popular graph layout methods. (a) Tree layout. (b) Hierarchical layout. (c) Symmetric layout. (d) Circular layout. (e) Orthogonal layout. (f) Force-based layout.

15

Tree layout, as we can guess from the name, consists of layout algorithms especially for trees or graph structures that are very similar to a tree. They often display a graph as a tree with the root on the top or upper left of the display and clearly show the structure of the tree (Garg and Tamassia 2002). Figure 3 (a) is an example of a tree layout. Note, there is a cycle in this structure, edges in this cycle are colored in red. Hierarchical layouts are designed mostly for directed graphs. Given a graph structure the algorithm breaks those nodes into hierarchical layers and arranges each layer’s nodes based on certain criteria with the goal to minimize edge crossing and bending (North and Woodhull 2001). Figure 3 (b) is an example of a hierarchical layout. With some modification, hierarchical layouts can also be applied to undirected graphs. Symmetric layout algorithms try to find “symmetric groups” within the graph structure and then lay out the graph. We can see many of its applications in the bioinformatics area (Hong and Eades 2003). Figure 3 (c) is an example of a symmetric layout. Circular layout or radial layout algorithms constrain the vertices along the perimeter of a circle. In circular layouts, the crossing minimization problem is NP-hard (Masuda, Kashiwabara et al. 1987) and researchers seek heuristics that simplify the problem (Mäkinen 1988; DoÊrusöz, Madden et al. 1996; Six and Tollis 1999) Figure 3 (d) is an example of a circular layout.

16

With orthogonal layouts, the edges run horizontally or vertically, and the goal is to minimize the line crossing and area covering (Eiglsperger, Fekete et al. 2001). Orthogonal layouts are used extensively in the areas of VLSI and PCB design. Figure 3 (e) is an example of an orthogonal layout. Force layout has drawn a great deal of attention because it is data driven and sometimes could bring out interesting patterns. This algorithm treats a graph as a virtual physical system. The core is an energy model which assigns forces to nodes and edges. The most common method is to simulate a spring system, with edges as springs and nodes as electrically charged particles. Usually, the algorithm starts with vertices at random positions, and then goes into iterations with the goal to minimize the system’s total energy. The procedure ends when whole system’s energy becomes less than a predefined threshold. One of the advantages of this layout is that it does not require any knowledge about graph theory and it often clusters nodes according to their connectivity. The theory behind this algorithm is easy to understand but its running time is generally high (Garg and Tamassia 1994). No matter what layout algorithm is selected, when dealing with dynamic data, the layout of the overall graph might change dramatically as new data is coming in or existing data is being deleted. The constantly changing image, most often discrete or jumping, is disturbing to the user because it is hard to keep a consistent mental map. Misue et al. proposed the incremental layout (North 1996) to ease this problem; it tries to preserve the mental map or keeps dynamic stability when the data is changing. In other

17

words, the goal is to minimize the overall layout changes. Of course an incremental layout might have to compromise some of the layout aesthetics aspects if needed, but if we could keep a relatively stable mental map, user might be able to understand the changing more easily. Frishman and Tal (Frishman and Tal 2004) applied this idea to a clustered graph. In their algorithm, they insert some dummy nodes called spacers to each cluster. When a new vertex is being added, it will be first placed into one of these spacers, and if the user wants to delete some nodes, these nodes will become new spacers. Animation can also help to smoothly show the changes while data is changing.

(a)

(b)

Figure 4: An example of the incremental layout (Corporation 2003). (a) The original graph. (b) Four new edges are added, and the graph layout changed as minimally as possible in order to keep the original look while at the same time try to satisfy some of the graph drawing aesthetics.

2.3 Matrix Representation and Application In computer systems or mathematics, a graph can be modeled by an adjacency matrix, or a collection of adjacency lists (Cormen, Leiserson et al. 1990). There are several ways to reduce the storage space or increase efficiency of the graph structure. For

18

exxample, the adjacency matrix m of an undirected u simple graphh is always syymmetric aloong thhe main diag gonal, so we can cut abouut half of thee storage spaace by just saving the poortion abbove the diaagonal. The bandwidth b reeduction algorithms thatt we are goinng to introduuce in seection 2.4 co ould move noon-empty elements (edgges) in the grraph matrix closer c to the m diagonaal, which cann also be useed to reduce the graph stoorage space by efficientlly main coompressing the graph daata.

(a a)

(b)

(cc)

Figure 5: An un ndirected grap ph and its twoo other representations. (a).. An undirecteed graph with 4 veertices and 5 edges. e (b) Thee adjacency maatrix represen ntation. (c) Th he adjacency liist representattion

Bertin n introduced visual matriices to repreesent networkks (Bertin 19984). Recenttly, m more and morre researchers have beguun looking for fo alternativve ways that could avoid prroblems in graph g drawinng. Compareed to traditioonal methodss, the graph’s matrix reepresentation n does not have the probblem of nodee overlappingg or edge croossing, and its i laayout is relattively straighht forward. To T evaluate readability, r Ghoniem et al. (Ghoniem m, Fekete et al. 2004) 2 perforrmed an experiment to compare node-link diagraams with thee m matrix-based representatiion. They gaave users 7 different d taskks, which weere “nodeCouunt”, “eedgeCount”,, “mostConnnected”, “finndingNode”, “findingLinnk”, “findinggNeighbor”, and

19

“findPath”. They let users solve these problems with either a node-link graph or a matrixbased representation. The results were quite unexpected: the matrix-based representation outperformed the node-link graph representation in all tasks except for “findPath”. Ghoniem also pointed out that most users who went through this experiment were actually more familiar with node-link representation. Henry et al. (Henry and Fekete 2006) proposed the MatrixExplorer which uses both matrix and node-link representations to show a graph. This tool was designed for sociologists who are actually very familiar with matrix-based graph representations. The node-link diagram view was added into this tool for other users’ convenience and also to “publish or communicate exploration results.” Major designs of MatrixExplorer come from a list of requirements collected from social science researchers. MatrixExplorer also provides a set of interaction and analytic tools such as filtering, clustering and matrix reordering mechanisms to help a user explore graph data. Figure 6 shows the user interface of the MatrixExplorer.

20

Figure 6: MatrixExplorer User Interface. On the left is the matrix representation; on the right is the node-link diagram of the same graph (Henry and Fekete 2006).

MatrixExplorer is the first tool that applies matrix reordering to facilitate graph visualization. Reordering algorithms being applied here are similar with those used for microarray data’s rearrangement. This tool provides two reordering algorithms. The basic idea is to put similar columns (rows) close to each other. The result visualization shows blocks in the matrix representation. Interestingly, these clusters correspond to cliques and hubs (high degree nodes) in the node-link representations. Figure 7 is a side-by-side comparison of those clustered blocks in the reordered matrix and corresponding nodelink diagram.

21

Figure 7: Side-by-side comparisons of matrix representation and node-link graph. We can see that after reordering, blocks in the matrix representation correspond very well with clusters in the nodelink graph. A is a highly connected node and B a clique (Henry and Fekete 2006).

Figure 8: Connected components visualization (Henry and Fekete 2006).

Other than matrix reordering, MatrixExplorer also includes the connectedcomponents visualization as a way to represent the macro-structure of a graph as well as a filtering tool that a user can control. This tool is purely designed for the need of the social scientists. Figure 8 shows this connected-components visualization, the size and color of the square all mapped to the number of vertices in the components. The squares are arranged from top left to bottom right based on decreasing size.

22

Figure 9 NodeTrix representation of part of the infovis co-authorship network, where each “node” is a matrix that represents a substructure (Henry, Fekete et al. 2007).

NodeTrix (Henry, Fekete et al. 2007) goes one step further on integrating the matrix and node-link representations. Each “node” in NodeTrix is actually a matrix, it represents a substructure. When there is only one node in a substructure, that matrix becomes a standard node. Moreover, a user can interactively expand a matrix node to a standard node-link graph or select any subgraph and compress it into a matrix. NodeTrix provide a way to show the overall structure and at the same time also shows the detailed information by using a matrix.

2.4 Other Graph Visual Representations TreeMap is a visual representation for hierarchical data. As the name indicates, it is especially targeted for tree structures (Bederson, Shneiderman et al. 2002). Figure 10 shows examples of TreeMaps. TreeMap uses nested rectangles to represent the hierarchical structure, with levels of nesting representing levels of hierarchy. Figure 10(a) shows one application of the TreeMap, the Map of the Market (SmartMoney.com 2007), a live system that monitors the live stock market. TreeMaps use most of the display space

23

and also show detailed leaf information that could help a user easily identify outliers or other interesting patterns. But on the other hand, they also “hide” the hierarchical structure of the tree, especially for users that are not familiar with the concept of TreeMaps; it is hard to tell the overall hierarchical structure. Several extensions of TreeMaps have been proposed to overcome disadvantages of the original design. Cushion TreeMap (Wijk and Wetering 1999), showed in Figure 10(b), is among one of them. It uses the shadowed cushion-like 3D mounds to show the depth of the nesting. The display is better than the original one but the hierarchical structure of the tree is still not easy to be identified.

(a)

(b)

Figure 10: (a) Map of the Market. A TreeMap representation of the current stock market. The color represents the performance of the stock, with red represents down and green up (SmartMoney.com 2007). (b) Cushion TreeMap (Wijk and Wetering 1999).

We live in a 3D world. 3D display can be used to visualize a graph. Comparing to 2D visualizations, the added dimension allows more information to be rendered. 3D

24

visualization is commonly employed where the information itself has 3D properties or can be mapped to a 3D space which might help express the data. Figure 11 (a) is a 3D visualization of the Internet's multicast backbone topology worldwide (Munzner 2000). The data itself has a 3D spatial interpretation supported by the image of the globe and the map, so a user can easily understand the basic meaning of the graph. Figure 11 (b) is a 3D horizontal cone tree (Robertson, Mackinlay et al. 1991). Here the data does not have any 3D characteristics but 3D is used to help lay out the nodes and links. 3D visualizations are usually accompanied with navigation strategies so a user can explore the hidden data.

(a)

(b)

Figure 11: 3D representation of node-link graph. (a) Visualization of network distribution (Munzner 2000). (b) Horizontal ConeTree (Robertson, Mackinlay et al. 1991).

2.5 Graph Data Mining The goal of graph data mining is to find frequent (sub)-graph structures. When doing graph data mining we need to compare graphs. This brings out the famous graph isomorphism problem.

25

2.5.1 Graph Isomorphism Problem In Figure 12, a graph is drawn in two different layouts, and they look very different. Comparing two graphs is not an easy task because of the complicated graph structure. Graph isomorphism problem has been a hot research topic for decades (Read and Corneil 1977; Fortin 1996). One interesting fact is that it hasn’t been proved to be a NP-complete problem and also no polynomial algorithm has been found (Corneil and Kirkpatric 1980; Fortin 1996). It is believed to belong to a category between P and NP, and people even named a new class GI for it (Skiena 1990). Everyone agrees GI is a challenging problem. It is believed that GI can be solved in a “moderately exponential time”, which means the running time is within

, where ,

1 are

arbitrary real numbers (Babai 1981; Fortin 1996). Some GI related problems have been proved to be NP-complete problems, this include the subgraph isomorphism problem (Ullman 1976; Garey and Johnson 1979) and the graph symmetry detection problem, etc. (Manning 1990; Fraysseix 1999).

26

(a)

(b)

Figure 12: Two identical graphs which look very different because of different layouts.

In practice, people try different ways of testing if two graphs are isomorphic. One very practical technique is to use different vertex invariants. For example, one can test the number of nodes and the degree of each node. There are many other invariants; some are very simple, some are very complicated. There is no “polynomial computable” invariant that has been found that can guarantee if two graphs passed this invariant test, then they are isomorphic (Fortin 1996). A common way to use invariants is to define multiple different vertex invariants and use them to help prune the search space. Nauty (No automorphisms, yes?) is currently the fastest application for graph isomorphism testing. It computes automorphism groups of a graph and uses them to prune the search space efficiently. It also uses a set of vertex invariants to help the computation (McKay 1981; McKay 2007). Nauty can also generate a canonical labeling for the input graph. In the worst case, the running time of Nauty is still exponential (Miyazaki 1997).

27

2.5.2 Graph Mining Applications Much work has been done on graph data mining and graph visualization. Here are some of the commonly used approaches, techniques and algorithms.

SUBDUE

SUBDUE (Holder, Cook et al. 1994; Coble, Rathi et al. 2005; Ketkar, Holder et al. 2005) is considered to be one of the first systems for graph mining. It iteratively identifies sub-structures then uses these to compress the original graph data. SUBDUE has the ability to do both unsupervised learning and supervised learning, and also clustering and graph grammar learning. It uses Minimum Description Length (MDL) (Grunwald 2005) as the metric to evaluate how good the candidate sub-structure is. MDL states that any regularity in the data can be used to compress the data. A score is calculated for each of these regularities to evaluate how well it describes the original data, the one gives the minimum description is the best. SUBDUE tries to find this best regularity (repeated sub-structure) and use it to compress the graph recursively. Figure 13 shows how SUBDUE works. As we can see, it builds a hierarchical cluster of the input graph. For supervised learning, the dataset will have clearly indicated “positive” and “negative” graphs (positive and negative indicate two different classes), SUBDUE will find a subgraph appears often in the positive graphs but not in the negative graphs (Cook and Holder 2000).

28

Figure 13: One example of the SUBDUE algorithm iteratively finding the best repeated sub-structure to compress the graph. The algorithm allows a certain degree of inexact matching as we can see in (Cook and Holder 2000). the substructure

Apriori-based Frequent Graph Mining Many frequent graph mining algorithms are inspired by the frequent itemset mining algorithm – Apriori (Agrawal and Srikant 1994). Apriori is the first efficient frequent itemset mining algorithm. Frequent itemset mining is also called market basket analysis, which aims to find regularities in customers’ shopping behavior; for example, milk and cereal are often bought together. Here is a formal definition on frequent itemset derived from (Agrawal and Srikant 1994). Definition 27: Let

, ,…,

be a set of items. Let D be a set of

transactions, where each transaction T is a set of items such that transaction T contains , a set of items in I, if

. We say that a

. An itemset has support s in the

transaction set D if % of transactions in D contain . An itemset is called frequent (or large) if its support is greater than the user specified minimum support (called minsup).

29

A naïve frequent itemset mining algorithm would try to examine all possible combinations of the items, compute the support for each of them, and then return those with support larger than the minimum support s as the result. For a dataset with m items, the total number of possible combinations is 2 , and for a large m, the computation is very expensive. In order to get better performance, the algorithm needs to smartly prune the search space by eliminating unnecessary tests and at the same time guarantee the completeness of the result. Agrawal et al. point out an important rule (Agrawal and Srikant 1994) that any subset of a frequent itemset must be frequent. By using this rule, the candidate itemset of size

1 can be generated by joining two frequent itemsets of

size , and then eliminating those

1 sized itemsets with k-subsets that are not

frequent. This dramatically reduced the search space and made mining frequent itemsets in large databases more feasible. In order to avoid more unnecessary tests, the algorithm defines a lexicographic order. Items in the frequent itemset are lexicographically ordered, so to generate a candidate itemset of size

1, the algorithm only needs to merge two

frequent itemsets of size k with all but the last item different. Figure 14 shows the Apriori algorithm and its subroutine apriori-gen (Agrawal and Srikant 1994).

30 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

apriori L : large itemset of size 1 : set : set set :set :set set : int begin 2 while do apriori‐gen for each transaction subset , for each candidate . end end | . end return

return all the large itemsets large itemset of size transaction set Candidate itemset of size Candidate itemset contained in the transaction subset of user defined minimum support size counter, start with size 2 generate new candidate itemset of size candicates contained in the transaction increase coresponding counter select all the itemset with support greater than the minimum support

end

(a) 1. 2. 3. 4. 5. 6. 7. 8. 9.

apriori‐gen : large itemsets of size : set :set :string . . :string for each in for each in if . . and . ) and ( .

10.

( ) and

11. 12.

( .

13. 14. 15. 16. 17. 18. 19. 20.

for each k‐sized subset i then delete end end return

.

: set

return candidate itemsets of size 1 , items in are lexicographically ordered , items in are lexicographically ordered item in item in for all sets in select two large itemsets from where all the elements are the same except the last one

) then

merge two sets

.

if there is any ‐sized subset of not in , delete the current

do end

return the union of

that is

end (b)

Figure 14: (a).apriori algorithm. (b). Subroutine apriori-gen for candidate generation, P and Q indicate two different frequent itemsets, . is an item from set , and . is from the set Q.

31

Frequent graph/subgraph mining shares similar properties with frequent itemset mining. The basic rule also holds for graph mining. Any subgraph of a frequent graph/subgraph must also be frequent. There are several graph mining algorithms derived from Apriori. Inokuchi et al. (Inokuchi, Washio et al. 2000) proposed the Apriori-based Graph Mining (AGM) algorithm. AGM is actually the Apriori algorithm applied to graph data with little modification. In order to do that, each graph in the graph database is treated as a transaction, and is represented by a code derived from the modified adjacency matrix. The difference between standard adjacency matrix and this modified one is that in a standard adjacency matrix the entries are either 0 or 1, depending on if there is an edge; in the modified adjacency matrix, the entry could have value of 0, or

,

, which are defined in Figure 15.

⎛ x1,1 ⎜ ⎜ x2,1 X k = ⎜ x3,1 ⎜ ⎜ M ⎜x ⎝ k ,1

x1,2 x2,2 x3,2 M xk ,2

x1,3 x2,3 x3,3 M xk ,3

L L L O L

x1,k ⎞ ⎟ x2,k ⎟ x3,k ⎟ ⎟ M ⎟ xk ,k ⎟⎠

where

xi , j

⎧ num(lb(eh )); eh = (vi , v j ) ∈ E , lb is the label of eh , and num(lb(eh )) is ⎪ the integer arbitrarily assigned to the label value lb (eh ). ⎪ = ⎨ num(lb(vi )); i = j , vi ∈ V , and num(lb(vi )) is the number assigned to the label value lb (vi ) ⎪ ⎪0 ; (vi , v j ) ∉ E and i ≠ j ⎩ Figure 15: Modified adjacency matrix for a graph (Inokuchi, Washio et al. 2000).

32

For example, for the modified adjacency matrix in Figure 15, if the graph is undirected, then the corresponding code is the upper triangular matrix:

code( X k ) = x1,1x1,2 x2,2 x1,3 x2,3 x3,3 x1,4Lxk −1,k xk ,k

.

If it is a directed graph, then the code will also include the diagonally symmetric elements

,

, which will be added after each

,

when

:

code( X k ) = x1,1 x1,2 x2,1x2,2 x1,3 x3,1x2,3 x3,2 L xk −1,k xk ,k −1 xk ,k As we can see here, the same graph could have several different codes if vertices are ordered differently. Given two codes, it is hard to tell if they represent the same graph. To solve this problem, Inokuchi (Inokuchi, Washio et al. 2000) proposed to use the idea of a canonical form. They defined a lexicographic order for both edges and nodes, so the code can be ordered by following this rule: a canonical form for a graph is defined as the minimum of all possible coding of its normalized adjacency matrix. A modified adjacency matrix is called normalized if it was generated by following those special rules defined by the algorithm. As in the Apriori algorithm, candidate (sub)graphs of size 1 are generated by merging two frequent subgraphs of size , but here, the merging is more complicated because it actually merges two matrices, and some special rules need to be applied in order to keep the graph structure for both directed or undirected graphs. After merging, the algorithm needs to check if each

sized subgraph of this newly

generated candidate is frequent. The sized subgraph can be generated by removing the vertex

, where 1

1, and all its connected edges. After

is removed, the

33

modified matrix left needs to be transformed to its normal form, so it can be compared with those known frequent subgraph’s matrices. The transformation reconstructs the matrix structure in a bottom up manner. It first sets the size 1

1 adjacency matrix for

each vertex, then merging them similarly as merging two frequent subgraph’s matrix until the size

matrix is reconstructed. This process is called normalization.

AGM uses a level-wise approach as in Apriori and candidate graphs of size are generated from frequent graphs of size

1 . Because of the graph isomorphism

problem, comparing two graphs and then pruning false positives is much more complicated than those similar steps in the frequent itemset mining. The idea of using a canonical form to represent graphs has been adopted by several graph mining algorithms, such as gSpan (Fan and Han 2002), CloseGraph (Yan and Han 2003), FSG (Kuramochi and Karypis 2001), and FFSM (Huan, Wang et al. 2003) etc. How to define this canonical form has become one of the key steps that determine the performance of various algorithms. As we can see in the next section, the canonical form is highly related to searching strategies.

34

Graph-based Substructure Pattern Mining (gSpan)

(b)

(a)

(c)

Figure 16: Three DFS trees of the same labeled graph. Each node label’s index indicates its discovering order in the corresponding DFS tree. Dashed lines represent non-DFS tree edges. Nodes colored in blue are on the right most paths (Fan and Han 2002).

The graph-based substructure pattern mining (gSpan) (Fan and Han 2002) is also based on the fact that every subset of a frequent graph/subgraph is also frequent. It uses a different coding method called the Depth First Search (DFS) code to represent the graph structure. Figure 16 shows three DFS trees of the same undirected graph. The edges in the graph are classified into two categories, forward edges and backward edges. The forward edges are the DFS tree edges, and all other edges are backward edges (indicated by dashed lines in Figure 16). A node’s index represents its discovering order in the DFS tree. The DFS codes corresponding to Figure 16 are: (0,1,X,a,X)(1,2,X,a,Z)(2,0,Z,b,X)(1,3,X,b,Y); (0,1,X,a,X)(1,2,X,b,Y)(1,3,X,a,Z)(3,0,Z,b,X); (0,1,Y,b,X)(1,2,X,b,X)(2,3,X,b,Z)(3,1,Z,a,X).

35 function graphset_projection : graph structure set, : frequent graph structures graph structures of size 1. var : set 2. : edge 3. begin 4. initial the result set 5. for each do for each edge in the data set 6. if . then if its frequency is less than the given value 7. delete delete edges that are not frequent 8. end 9. for each do for each node in the data set 10. if . then check its frequency 11. delete delete infrequent nodes 12. end ∑ frequent one edge graph in 13. 14. sort in DFS lexicographic order initial 15. in order do 16. for each 17. graphs in containing 18. subgraph_mining , , check all structures derived from 19. prune searching space 20. if | | then break end there is no frequent structure left 21. end 22. end 23. end (a) function subgraph_mining : graph set, : frequent graph set, : frequent graph graphs that contains 1. var : graph set 2. begin 3. if then if is the canonical 4. is frequent | do is the graph derived from 6. for each derived from 7. if . then check its support 8. subgraph_mining , , 9. end 10. end 11. end 12. end (b)

Figure 17: gSpan Algorithm and its subroutine (Fan and Han 2002).

36 function isCanonical :array of int, : graph : Boolean 1. var : node; to traverse the nodes of the graph 2. : edge to traverse the edges of the graph 3. : array of node to collect the numbered nodes 4. begin 5. forall . do begin traverse all nodes and * 6. . 1 clear their indices 7. end 8. forall . do begin traverse all edges and * 9. . 1 clear their markers 10. end 11. forall . do begin traverse the potential root nodes 12. if . 0 then if is acceptable as a root node 13. . 1; 0 number and record the root node 14. if not rec , 1, , 1,0 then check the code word recursively and 15. return false abort if a smaller code word is found 16. end 17. . 1 clear the node index again 18. end 19. end 20. return true the code word is canonical 21. end

(a) function rec : array of int, : int, : array of node, : int, : int : Boolean 1. var : node node at the other end of an edge 2. : int index of destination node 3. : boolean flag for unnumbered destination node 4. : boolean buffer for a recursion result 5. begin 6. if then return true full code word has been generated 7. while do begin check whether there is an edge with 8. forall incident to do begin a source node having a smaller index 9. if . 0 then return false end 10. end 11. go to the next extendable node 12. end 13. forall incident to in sorted order do begin 14. if . 0 then begin traverse the unvisited incident edges 15. if . 1 then return false end check the 16. if . 1 then return true end edge attribute 17. node incident to other than 18. if . 2 then return false end check destination 19. if . 2 then return true end node attribute 20. if . 0 then else . end 21. if 3 then return false end check destination 3 then do begin node index 22. if 23. . 1; . 0; mark edge and number node 24. if then do begin . ; ; ; end 25. , 4, , , check recursively 26. if then do begin . 1; ‐‐ ; end 27. . 1 unmark edge and node again 28. if not then return false 29. end 30. end 31. return true return that no smaller code word 32. end than could be found

(b) Figure 18: A canonical form testing function. (a) A testing function that can be applied to both DFS and BFS-defined canonical forms, the difference is in the “rec” sub-procedure. (b) A sub-procedure that is designed for a BFS-defined canonical form (Borgelt 2005).

37

Each code is formed by concatenating all edges’ information in a DFS order, with backward edges placed in front of the corresponding forward edges (if there are any). For example, in Figure 16(a), backward edge (2,0,Z,b,X) is placed after the edge (1,2,X,a,Z) but before the edge (1,3,X,b,Y). This is because the node

does not have any forward

edges, so (2,0,Z,b,X) is placed just after (1,2,X,a,Z) where

is the destination node.

Each edge’s information includes the source and target node indices, followed by the source node attribute/label, the edge attribute/label and the target node attribute/label. Again, a graph could have different DFS codes. In order to get a unique representation for a graph, we need to define a canonical form. Similar to AGM, gSpan defines a lexicographic ordering for those codes, and then selects the minimum code as the canonical form to represent the graph. For all three DFS trees showed in Figure 16, the code derived from the Figure 16(a) is the minimum. Given an initial frequent one-edge subgraph, gSpan finds a frequent subgraph (superset of the initial one) by adding new elements to the initial structure. There are many ways to extend a graph. Random extension of the graph will make searching expensive. gSpan uses the Right-Most Extension method to restrict the extension space. A new edge can only be added between vertices on the Right-Most path (backward edge) or extended from the vertex on the right-most path to a new vertex (forward edge). These extended subgraphs are called children of the original graph. In Figure 16, vertices on the right-most path are colored in blue; the node

is the right-most vertex in each of

these DFS trees. It is also proved in the paper that by following a Right-Most extension

38

the searching is complete (Fan and Han 2002). Figure 17 shows the gSpan algorithm. The subgraph_mining procedure is a recursive procedure, it is used to grow the given graph and find all their frequent descendants. It stops when the support of a graph is less than the minsup or the code of a graph is not canonical. Other than DFS, there are also graph mining algorithms based on the canonical form defined by Breadth First Search spanning tree (BFS). As Borgelt mentioned in (Borgelt 2005), algorithms based on either DFS or BFS defined canonical form actually are similar in the terms of how they work. Figure 18 (a) shows a canonical form testing function; this function does not specify which kind of canonical form it is testing. Actually, this function can be applied to both DFS and BFS defined canonical form by using a different sub-procedure “rec”. Figure 18 (b) shows one example of it, which is designed for BFS canonical form. Of course, different algorithms need to have different candidate generation (graph extension) methods to reduce the search space. For BFSbased algorithm, only nodes having an index no less than the maximum source index of an edge already in the (sub)graph may be extended (Borgelt 2005).

2.6 Bandwidth Reduction Algorithms Interesting patterns should not only be limited to repeated patterns. Highly connected substructures, high degree nodes, and paths can also help us understand a graph. Graph visualization is another way that could help to identify interesting patterns. As we mentioned in section 2.3, Henry and Fekete used a clustering method to order

39

graph nodes; the idea is to put similar nodes close to each other so the corresponding matrix could show some interesting patterns. They showed in their paper, cliques and highly connected nodes are very easy to identify. Besides clustering methods, there are also other matrix reordering algorithms that were designed originally for some different purposes. One example is the sparse matrix bandwidth (Definition 30) reduction algorithm. These algorithms are widely used in linear system computations. For example, general Gaussian elimination algorithm has the complexity of Bandwidth reduction algorithm can help to reduce it to When

, which is expensive. , where

is the bandwidth.

, it is a big gain. Figure 19 shows the visual effect of the bandwidth

reduction algorithm. Figure 19 (a) is the original graph with the number indicate the node’s order in this matrix representation. Those non-empty elements (those 1s) are spread everywhere in the matrix in Figure 19 (b). After the nodes renumbering (equivalent to node reordering in the matrix), those non-empty elements are moved closer to the main diagonal (Figure 19 (d)), which shows a clustering effect.

40

(a)

(b)

(c)

(d)

Figure 19: A visual explanation of the matrix bandwidth reduction effect. (a). A sparse graph. Each number is a graph node and the number itself represents its initial order in the matrix. (b). The matrix corresponding to the graph showed in (a). The bandwidth of this matrix is 8. (c). The graph after the node renumbering. (d). The new matrix representation after the node renumbering, the bandwidth of this matrix is 3. Non-empty elements are closer to the main diagonal when comparing with the matrix in (b).

Matrix bandwidth and its related concepts are defined as following (Gross and Yellen 2004). Definition 28: A proper numbering of

is a bijection :

1,2,

,

.

Definition 29: Let f be a proper numbering of a graph . The bandwidth of , denoted

, is given by

max |

|:

,

.

41

Definition 30: The bandwidth of numbering of

is

. A bandwidth numbering of

min

: is a proper

is a proper numbering

(i.e., a proper numbering that achieves

such that

).

In another word, the bandwidth of a matrix A = {ai , j } is the maximum absolute distance between i and j for which ai , j ≠ 0 (Martía, Lagunab et al. 2001). Definition 31: The bandwidth decision problem is the problem which accepts as input an arbitrary

and an arbitrary integer

and returns “YES” if

and “NO”

otherwise. The bandwidth decision problem is an NP-complete problem (Gross and Yellen 2004). Bandwidth reduction techniques attempt to find the approximate optimal solutions by moving the non-zero elements as close as possible to the main diagonal. Bandwidth reduction is also widely applied to reduce the graph storage space. Most of these algorithms are designed for sparse symmetric matrices. Sparse matrix is defined as follows: Definition 32: A sparse matrix is a matrix that has a large number of zero elements (Gilbert, Moler et al. 1992; Press, Flannery et al. 1992). Definition 33: A subgraph on Yellen 2004).

vertices is sparse if it has

edges (Gross and

42

From graph theory resources, it is hard to find a definition that gives a strict distinction between the sparse graph and the dense graph. Definition 33 is for a subgraph, but we can derive from it that a sparse graph is a graph has

edges, where

is the

number of nodes. A graph is sparse or not directly affects our choice of the underline data structure. Generally, if a graph is sparse, we use adjacency list; otherwise, we use adjacency matrix. A graph is sparse or not also affects the performance of some graph related algorithms. For example building a BFS tree takes time sparse, then the running time is close to time could close to

| |

| | if a graph is

| | ; if a graph is not sparse, then the running

| | .

There are several different bandwidth reduction algorithms. Cuthill-McKee Ordering and Minimum Degree Ordering are two commonly used methods.

2.6.1 Cuthill-McKee Ordering The Cuthill-McKee algorithm was introduced in 1969 (Cuthill and McKee 1969). The algorithm tries to find a strategy to re-label nodes in order to get a narrowed bandwidth. The pseudocode is shown in Figure 20. The Cuthill-McKee ordering is actually a modified breadth first search (BFS). The only difference between this algorithm and the ordinary BFS search is the sorting in step 17 and the choice of the starting node in step 11. As we mentioned before, finding the optimal minimum bandwidth for a sparse symmetric matrix is an NP-complete problem. Cuthill-McKee ordering tries to find an approximate solution. The result of this algorithm is highly

43

dependent on the selection of the starting vertex. Cuthill and McKee pointed out that choosing the vertex with the minimum or nearly minimum degree as the starting node is a good strategy, but there are also many cases where this strategy fails. In order to increase the chance to get a better solution, they proposed to select several starting nodes instead of just one, then run the procedure on each of them, and select the best one as the result. ,

They suggested selecting those nodes with degree in the range of where

and

,

are the minimum and maximum degree, respectively. After some

experiments, they point out selecting

1⁄2 generally brings good results.

The running time of this algorithm is similar to BFS, which is

| |

| |

,

where is the number of candidate starting point, plus a sorting procedure. We can use quick sort here for its average running time of

| | log | | .

44 function cuthill‐mckee :graph 1. var : node 2. : node set 3. : int 4. : array of node 5. begin 6. for each do begin 7. . 8. . ∞ 9. 1; ; 10. end 11. select a starting node 12. . 13. . 14. 15. while do begin 16. 17. sort in non‐decreasing order by degree 18. for each in order do begin 19. if . then 20. . 21. . 22. , 23. end 24. end 25. 26. . 27. end 28. end

nodes belong to graph order count

set all nodes to their initial state white indicates unvisited node the initial order set the initial state randomly select a starting node set its order to 1 gray is the current visiting node put into queue get the head of the queue is the nodes adjacent to the node has not been visited set it to the node being visited set its order and increase the counter put it into queue remove node has been fully explored

Figure 20: Cuthill-Mckee Ordering Algorithm (Cuthill and McKee 1969; Cormen, Leiserson et al. 1990)

2.6.2 Minimum Degree Ordering Minimum degree ordering is another commonly used matrix bandwidth reduction algorithm. It repeatedly finds the node with the minimum degree, removes it from the graph, and then updates the degree of those nodes that are left. The actual algorithm is shown in Figure 21.

45

function minimum_degree_ordering ( :graph)

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

var : node : set of nodes : int begin 1; while do begin min . update the graph left end end

nodes belong to graph order count

set the initial state select node with minimum degree in set its order and increase the counter remove node from the graph add edges so the nodes originally adjacent to now become a clique

Figure 21: Minimum degree ordering

(a)

(b)

Figure 22: An example of updating a graph after the center node has been deleted. (a) The original graph. Short edges represent links to other nodes in the graph which are not shown. (b) The graph after the center node has been deleted and new edges are added so those nodes originally adjacent to the center one become a clique. The cost for this update depends on the degree of the node being deleted (which is | | ), there are at most one edge between each of these nodes, so the total number edges added is | | , so | | is the cost of this update.

In step 10, the algorithm requires removing the specified node and all the edges corresponding to it, and also at the same time adds new edges so that those adjacent nodes of the deleted one become pairwise adjacent (a clique). This is an expensive step, depending on the structure of the graph, this step can cost up to O( V ) (illustrated in

46

Figure 22), and there are total V of such steps. The step 10 also includes updating each node’s degree after adding new edges. In the worst case, the algorithm requires a | | | | run time (Heggernes, Eisenstatz et al. 2001; Ingram 2006). Some authors try to improve this by finding smart strategies to eliminate more nodes in step 7 (mass elimination) and also updating only part of the graph in step 10 (Liu 1985).. The minimum degree ordering procedure alters the graph topological structure (adding new edges and removing nodes) in order to achieve the final node ordering.

2.7

Summary and Discussion Graph drawing is a broad area that involves several other complicated fields, such

as aesthetic, perception, psychologies, etc. As we already discussed, because of different fields, different requirements and different properties of the dataset, we need to select different strategies to show the data. There is no universal optimal layout algorithm. When a graph dataset is large, especially beyond the limitation of the regular display, it is hard to avoid edge crossing and node overlapping problems. Graph drawing algorithms are normally complicated and computationally expensive. Several graph layout problems belong to the NP-hard problem set. Matrix representation doesn’t have the edge crossing or node overlapping problems. But it is not a popular graph visualization technique probably because most people are not used to this display. Research shows that matrix visualization outperforms the traditional graph drawing when dealing with general graph mining problems.

47

Graph data mining tries to find interesting patterns algorithmically. Finding the most frequently repeated substructures is one of the most common tasks. Many frequent graph mining algorithms are based on the Apriori algorithm which was designed for finding repeated unstructured subsets. Frequent graph and itemset mining algorithms are based on a similar concept but the graph data mining algorithm is much more complicated because we need to consider the graph structure. One of the hard problems is how to compare two graphs. Graph Isomorphism problem has no known P algorithm solution. A common approach is to code the structure into a unique representation, called the canonical form, so two graphs are the same if and only if they have the same coding. There are different ways to define a graph’s canonical form. The most commonly used canonical form is based on a DFS (or BFS) spanning tree. Different canonical form determines different ways to generate the candidate and also how to prune the search space. Many algorithms require node/edge labels to in order to construct its canonical form. This implies that two structurally identical graphs with different node/edge labels will be considered two different graphs. It could be a potential problem. Graph mining algorithms are generally complicated and computationally expensive. And some of them even cannot guarantee to get all frequent sub-graphs. For the most people, they are like black boxes with very limited or no user interactions. Henry et al. proposed a visualization method to explore the graph data. It uses matrix and the traditional graph drawing as the main display for a graph. It also provides several interaction mechanisms. One of them is the matrix reordering. It uses the

48

clustering method to rearrange the matrix so that similar rows/columns are put close to each other. After matrix reordering, some blocks appeared in the final matrix, and very interestingly these clusters are corresponding to interesting patterns in the graph drawing. Graph comparison is a difficult problem. It is also a problem we would like to embark upon. The clustering effect in MatrixExplorer shows a potential way to use matrix to explore graph data, but it cannot be used to compare graphs because the reordering result is not unique. The output depends on the input graph and also its initial node order. Matrix bandwidth reduction algorithms were aimed to make the computation more efficient in the Linear Systems by bringing non-empty elements in the matrix closer to the main diagonal. Figure 23 shows an example of Reverse Cuthill-McKee algorithm applied to a graph dataset. In Figure 23(b), non-empty elements are clustered around the main diagonal, the highlighted areas indicate the relationship between the matrix and the node-link representations. Cuthill-McKee algorithm is a relative simple and efficient algorithm. But again, it cannot guarantee the uniqueness of the reordering result.

49

(a)

(b) Figure 23: Matrix and node-link representations of the same graph dataset. (a) Columns and rows of the matrix are in a random layout order. (b) Matrix is reordered by applying the Reverse CuthillMcKee algorithm.

In the next chapter, we will introduce a node reordering procedure that can guarantee to generate a unique visual matrix for a given graph.

50

3

Graph Canonical Matrix Ordering

Our goal is to design a procedure that can generate a canonical visual adjacency matrix for a graph that has following properties: 1. The visual appearance of the adjacency matrix is unique. Two graphs with the same topological structure can only have one canonical visual matrix; 2. The canonical visual matrix is generated by using only the graph’s topological information (nodes and connections among them); 3. The canonical visual matrix has a small bandwidth; 4. The canonical matrix shows a clustering effect that is also similar to the Cuthill-McKee algorithm. Before start, we want introduce several important terminologies that we use intensively this thesis. Definition 34: A visual adjacency matrix is an adjacency matrix without node and edge labels. Definition 35: For a given tree T in a graph G, p, v

G, if node p and v are on two

consecutive levels of T with node p is at one level higher than node v, if edge (p,v) is a

51

tree edge, then node p is called the direct parent of node v, and node v is called a direct

child of node p. Our algorithm is based on the Cuthill-McKee algorithm (See section 2.6.1.) Similar to it, the output node ordering of our algorithm is a level order traversal (Hubbard 2000) of the BFS spanning tree. Starting with the root node (ordered 0), it visits each level successively from top to bottom; and on each level it orders nodes from left to right (see Figure 24). This chapter begins with a detailed analysis on properties of the Cuthill-McKee algorithm. We also provide a detailed analysis on why the current algorithm cannot guarantee to produce a canonical visual matrix. Then we give our solutions and the detailed information on how they help us to achieve the unique visual matrix representation.

Figure 24: The node ordering is a level order traversal through the BFS spanning tree. The root’s order is set to 0, and the bottom right node (at level ) is set to | | , where | | is the total number of nodes in the graph.

52

3.1 Properties of the Cuthill-McKee Ordering Any BFS spanning tree of an un-weighted, undirected graph is a shortest path tree (Definition 19). If we follow the Cuthill-McKee procedure, we observe the connections between nodes will have the following properties: 1. The root node can only connect to its direct children and they are linked by tree edges; 2. If a node connects to other nodes on its direct parent’s level, those nodes must be visited after its direct parent; similarly, if a node connects with nodes on its direct children’s level, then those nodes must be visited before its direct children ; 3. A node could connect to any other nodes on its own level; Properties 1 and 3 are obvious. Here we want to prove property 2 is true. Proof of property 2: If a node level of its direct parent

connects with another node

and also if node

comes before node

that is at the same

in this BFS spanning

tree, then by the Cuthill-McKee procedure, node should be a direct child of node instead of being a directed child of its current parent . So if there are any connections between node

and nodes on the level above (except with its direct parent), those nodes

must appear after node ’s direct parent . The proof is similar for the connections between a node

and nodes at its direct children’s level. ■

53

Also, since any BFS tree is a shortest path tree (see Definition 19 in chapter 2), if the root node is fixed, then each node’s level in this BFS tree is also fixed. The sorting mechanism in step 17 (see chapter 2, Figure 20) of the Cuthill-McKee algorithm could only affect a node’s order within the same level of a tree. Which edge is a tree edge also depends on the node order at each level. Furthermore, the bandwidth of the corresponding adjacency matrix, which is derived from this BFS spanning tree, is determined by these tree edges.

(a)

(b)

Figure 25: A graph and one of its BFS spanning trees. (a) A random graph. (b) One of the BFS spanning trees. The number next to the node indicates its order. The continuous black line represents the tree edge and the dashed blue line corresponds to the non-tree edge.

Theorem 1: If a node ordering is the visitation order of a BFS tree, then the bandwidth of this ordering is determined by one of the tree edges of this BFS tree. Proof: we will use the BFS Spanning tree in Figure 25 as an example to help us explain different scenarios. Non-tree edges can be put into three categories,

54

1. Edges between nodes at the same level; 2.

Edges between a node and other nodes at its direct parent level;

3. Edges between a node and nodes at the level below. These three cases cover every scenario. There are no edges that could jump over one or more tree levels because that would violate the fact that any BFS tree is a shortest path tree (See Definition 19). We will prove that in each of these three possibilities, there exists a tree-edge, where the absolute difference of its two endpoints’ order is larger than any of the other non-tree edges that are associated with its category. Again, the order here refers to the visitation order of the corresponding BFS tree. Case 1: The non-tree edges are between nodes at the same level, for example, the ,

edge

on level 1 in Figure 25 (b). For any level except the root level, the largest

possible order difference between two connected nodes is between the first node .

node

and

the last node

.

( represents the level). Let’s use

. We know that node ’s order

.

order

.

.

because node

to denote the parent of

must be less than node

.

is on the level above, this means

.

.

, and

,

.

is an

existing tree edge. Case 2: The non-tree edges are between a node and other nodes at its direct parent’s level, for example edge

,

in Figure 25 (b). As we proved before, if a node

55

is connected with another node

on its parent’s level, then node

node ’s directed parent . This means , and

must appear after

, so ,

is an existing tree edge.

Case 3: The non-tree edge is between a node and another node at its direct children’s level. This is similar to case 2. If a node its children’s level (node

is connected to another node

is not node ’s direct child), then node

any of node ’s direct children (if there is any). Node before node , otherwise node

’ parent node

on

must appears before must appear

would be a direct child of node . So we have

, so

, and

,

is an existing tree edge. ■ If we call the absolute order difference between two endpoints of an edge the bandwidth of that edge, then Theorem 1 means that for any non-tree edge, there exists a tree edge who shares one endpoint with this non-tree edge and whose bandwidth is larger than that of the non-tree edge. From now on, the order or a node always refers to the node’s visitation order in a BFS spanning tree. Definition 36: The number of nodes that connected with node and also ordered before node

is the pre-degree of node u.

56

3.2 Determine the Starting Node As we mentioned in section 2.4, the result of the Cuthill-Mckee ordering is heavily dependent on the starting node. Cuthill and McKee suggested choosing a special set of nodes instead of only one to start with. One example is to choose nodes whose degree fall into the range of min

, max

, run the procedure on each

of them, then choose the minimum bandwidth one as the final output (Cuthill and McKee 1969). Later, Gibbs, Pool, and Stockmeyer (GPS) (Gibbs, Poole et al. 1976) proposed a method that could find a slightly smaller bandwidth ordering than the Cuthill-McKee procedure. Most importantly; GPS is more efficient, on average it runs 8 times faster than the Cuthill-McKee procedure (Gibbs, Poole et al. 1976; Martía, Lagunab et al. 2001). One of the major differences between the GPS and the Cuthill-McKee procedure is the selection of the starting node. GPS suggested that “choosing nodes that are at maximum or nearly maximum distance apart” often brings a good result. Nodes with such a property are called peripheral or pseudo-peripheral nodes (see section 2.1 for the concept of the peripheral node). Finding these nodes efficiently is one of the main contributions of the GPS. Figure 26 shows the procedure of the pseudo-peripheral node finder. This method is based on the observation that eccentricity of node , and tree rooted at node .

, where

is the

is a node that is at the deepest level of the BFS spanning

57

function pseudoperipheral_node_finder G : graph 1. var : a node 2. : tree structure 3. a set of node 4. : node 5. l : int 6. begin 7. min V 8. 9. | at the last level of 10. 11. foreach in non‐decreasing degree order do begin 12. 13. then 14. if 15. ; 16. goto step 10 17. end 18. return 19. end

a BFS spanning tree nodes has at the last level of a node in the last level of the height of the tree

a node with the minimum degree in build a BFS spanning tree rooted at for all nodes at the deepest level of the tree in non‐decreasing degree order build a tree get the height of tree if is higher than set to and to is one of the pseudo‐peripheral nodes

Figure 26: The algorithm of finding a pseudo-peripheral node (Gibbs, Poole et al. 1976; George and Liu 1979).

The BFS spanning tree constructs a level structure of the given graph. The width of a level structure refers to the maximum number of nodes the tree could have on any level. Gibbs et al. (Gibbs, Poole et al. 1976; George and Liu 1979) made the pseudoperipheral node finder even more efficient by taking the width of the level structure into account. It discards wide level structures as soon as they are detected. This is based on another observation that a high BFS spanning tree is often skinny. Figure 27 shows two other BFS spanning trees of the graph showed in Figure 25 (a). The tree in Figure 25 (b) is rooted at a random node . The height of this tree is 3 and the width is also 3. Trees in Figure 27 are rooted at a pair of peripheral nodes

and respectively. Note the tree

rooted on a peripheral node has the maximum height, but not necessarily the minimum

58

width. As we can see from Figure 27, these two trees are of the same height but of different width.

(a)

(b)

Figure 27: Two other BFS spanning trees of the graph showed in Figure 25 (a). Black lines represent tree edges while blue dashed lines refer to non-tree edges. The number next to the node indicates its order. Node and are a pair of peripheral nodes. (a). The BFS spanning tree rooted at node . This tree’s height is 4, and the width is also 4. (d) The BFS spanning tree rooted at node , The height of this tree is still 4 but the width is 3.

As the example in Figure 27 indicates, for two same height spanning trees, the one with smaller width might produce a smaller bandwidth node ordering. This is because the bandwidth is the largest order difference between two endpoints of one of the tree edges (See Thermo 1). And this tree edge is actually one of the tree edges connecting nodes on the widest level with their direct parent/children. The wider the tree, the larger the order difference. For example, edge (e , b) in Figure 27 (a) determines the bandwidth of the ordering derived from this spanning tree and node b is on the widest level of this tree. Since tall trees are more likely to be skinny, peripheral nodes are good candidates for the starting node.

59

Our goal is to find a unique visual matrix to represent a graph. Although we are not aimed at finding the absolute minimum bandwidth ordering, we still want a small bandwidth ordering. The procedure in Figure 26 could help us get a good starting node, but the node returned by this algorithm is not unique. As we can see, at step 7, the procedure selects a node with the minimum degree, but there could be more than one node with the minimum degree. Step 7 just chooses any one of them. From step 11 to 17, the procedure builds a BFS spanning tree rooted at each node on the deepest level of the tree, and as soon as it discovers a higher BFS spanning tree the algorithm goes back to step 10. Again, at the deepest level of the tree, there could be nodes with the same degree. A different choice could change the returning node of this procedure. Finding a pseudo-peripheral node is an efficient procedure. (Gibbs, Poole et al. 1976; H. L. Crane, Gibbs et al. 1976; Smyth and Arany 1976; George and Liu 1978; George and Liu 1978; George and Liu 1979). But if we allow this approximation, there would be more qualified starting nodes. In order to get the unique one, we have to test each one and filter them out later. These are expensive tasks. It is also hard to define a clear boundary for how much approximation we could allow. Based on the above analysis, we need to set a clear rule to limit the number of possible starting nodes. Since peripheral nodes are very likely to generate a small bandwidth node ordering, we select our candidate only from these peripheral nodes. A graph has at least two peripheral nodes, but trees generated by them do not necessarily

60

have the same width. We will test every level structure rooted at a peripheral node, and select those with the smallest width. These root nodes will be our starting nodes. function candidate_starting_nodes_finder :a graph 1. var : a set of node candidate starting node set, initially empty 2. : node set node set in the input graph 3. : a tree structure 4. : int maximum height of a tree seen so far 5. : int minimum width of a tree seen so far 6. begin 7. 1 initialization 8. largest positive int 9. 10. foreach do begin for all the nodes in the input graph 11. build a BFS spanning tree rooted at node 12. if the new tree is higher than the tree seen before then 13. update the maximum height 14. update the result set 15. update the width 16. end 17. if the new tree’s height equals the max height then 18. get the width of the new tree 19. if then if the new tree has a smaller width 20. reset the minimum width 21. reset the result set 22. end 23. if then if the width equals the minimum width 24. then add node to the result set 25. end 26. end 27. end 28. return 29. end Figure 28: The procedure that returns a list of candidate starting nodes.

Figure 28 shows the candidate generating procedure. Creating a BFS spanning tree runs in time

| |

| | . We can get the height and width of this tree while

constructing it. Since we need to create a BFS spanning tree for each of the nodes, the whole method runs in time

| | | |

| | . Building the BFS spanning tree for each

61

starting node is an independent procedure. It can be easily parallelized so we could get a better performance.

3.3 Node Ordering at Each Level Another important issue is how to order nodes at the same level of a BFS spanning tree. The BFS procedure adds a natural ordering to groups of nodes on each level. That is, for all levels except the root (level 0), a node’s order depends partially on its parent’s order on the level above. For example, on level 2 of the BFS tree in Figure 27 (b), node

appears before node

, and node

and , because node ’s parent

is the parent of node

appears before node

and . In this thesis, we say that nodes are “in a

same group” if they share the same direct parent in that tree. The Cuthill-McKee algorithm orders nodes within a group only by their degree. But there are good chances that multiple nodes share the same degree. Figure 29 shows a graph with 7 nodes and 12 edges. Figure 29 (c), (e), (g) are three BFS spanning trees of the same graph (Figure 29 (a)) rooted at the same peripheral node . The difference between these three trees is the order of node also node and node

and

on the third level (level 2). Node

on the second level (level 1), and and node

share the same parent

and they also have the same degree. According to the Cuthill-McKee algorithm,

the order between them could be any one of these, either node

appears before or after

node . Same thing happens between node and . Figure 29 (d), (f), and (h) are three different visual adjacency matrices derived from the BFS spanning trees on the left. The

62

three different orderings in Figure 29 (c), (e), (g) have the same bandwidth. The result of the Cuthill-Mckee algorithm could be anyone of them. Before we go further into details of how to order nodes with the same degree, we would like to introduce an important concept called the indistinguishable nodes and one of its interesting properties. Indistinguishable nodes are two nodes who share the same neighbors and also adjacent to each other; here is a formal definition (George and Liu 1981). Definition 37: For two nodes ,

, node

is indistinguishable from

if

. Indistinguishable nodes have a very interesting property, that is, their order is interchangeable in the corresponding BFS spanning tree and the visual appearance of the corresponding adjacency matrix will not change. Also, it won’t change other nodes’ order. It is like we just exchanged their labels. If we ignore the label difference, then the tree’s structure and connections are exactly the same. In Figure 29, node

and

are

indistinguishable nodes. In a clique, every pair of nodes is indistinguishable from each other.

63

(a)

(b)

(c)

(d)

(e)

(f) 0

a 1

3

2

d

e

b

4 g

5 f

(g)

6 c

(h)

Figure 29: A random graph with three of its BFS spanning trees. The matrices on the right are corresponding to the graph or BFS trees on the left. The number next to a node indicates the node’s order. (a) Nodes are in a random order. (b). The adjacency matrix corresponds to the graph in (a). (c), (e), (g) are three BFS spanning trees rooted at the same peripheral node , but with different node ordering. (d), (f), (h) are three adjacency matrices corresponding to trees on the left. The area 1 on the upper left corner of these three matrices represents the clique generated by nodes , , , and .

64

Pseudo indistinguishable nodes

(a)

(b) Figure 30: An example of two pseudo-indistinguishable nodes. Node and are a pair of pseudo-indistinguishable nodes; exchanging their order with each other does not change the corresponding adjacency matrix visualization. (a). A node-link representation of the graph. The two pseudo-indistinguishable nodes are highlighted in thick blue outline. (b) Two adjacency matrices for two different orderings. The only difference between them is the order between node and at the beginning.

65

Definition 38: For two nodes , if

, node

is pseudo-indistinguishable from

. Inspired by indistinguishable nodes, we discovered that if any two nodes

satisfy the property

and

then their order is also interchangeable just like

those indistinguishable nodes. Other nodes’ order in the BFS spanning tree and also the visualization of the corresponding matrix stay the same. In Figure 30, node

and node

are two pseudo indistinguishable nodes. Theorem 2: (a) For two indistinguishable nodes

and

in a graph , if neither of

them is the root node, then they must be at the same level of the BFS spanning tree that generated by the Cuthill-Mckee procedure. (b) Moreover, no matter these two nodes are on the same level or not, exchanging the order between them won’t change the visual appearance of the corresponding adjacency matrix. (a), (b) are also true for pseudoindistinguishable nodes. Proof for part (a): Since two indistinguishable nodes share the same neighbor, if neither of them is the root node, they must be discovered at the same time through one of its neighbor nodes. So by following the process of the Cuthill-Mckee, these two nodes must be in the same group and they must be on the same level of the BFS spanning tree. Proof for part (b): Assume node then node

appears before node

in the spanning tree,

won’t have any direct child, it can only have connections with node ’s

direct children. If we exchange their order then rebuild the BFS spanning tree by

66

following the same procedure, since node u and v have the same neighbors, previous node ’s children in the tree now become node ’s children and node Previous non-tree edges that connected node connections between node

become childless.

with node ’s direct children now become

with node ’s direct children. The order among these

children should stay the same, so does every other node’s order. Since node

and node

share the same neighbor, their corresponding

rows/columns in an adjacency matrix are the same except the cell that represent the connection between them. Exchange these two nodes’ order won’t change the visual appearance of the matrix. It is like we just changed two labels of the corresponding two columns/rows. The proof for the pseudo-indistinguishable nodes is almost the same except that if node u and node v are pseudo-indistinguishable nodes, the column/row that represent node u and node v are exactly the same. We can still use the same argument that exchanging these two nodes’ order won’t affect the final visual matrix results. ■ Other than the indistinguishable nodes and pseudo-indistinguishable nodes, there are other cases when two nodes’ order can be interchanged and the final visual representation of the adjacency matrices are still the same. Figure 31 shows one subgraph in a larger graph. Node

and are not pseudo-indistinguishable or indistinguishable

nodes, but their order can be exchanged just like those nodes. This is because node

and

share the same parent and also are symmetric to each other. In another words, if we could ignore the labels, the two subgraphs have exactly the same size and also the same

67

topology. We call nodes like

and a pair of similar nodes. Indistinguishable nodes and

pseudo-indistinguishable nodes belong to this category.

Figure 31: A subgraph in a larger graph. Node and are not pseudo-indistinguishable nodes but their order can be interchanged and the corresponding adjacency visual matrix stays the same.

Similar nodes won’t change the visual representation of the corresponding adjacency matrix. But it is hard to detect whether two nodes are similar because there are many different situations. Detecting graph symmetry is actually a difficult task (see section 3.3.1). One way to solve this problem is to use the label to break ties. It is similar to the graph mining algorithms we introduced in chapter 2. We could generate all possible nodes’ arrangement and choose from them the lexicographically smallest one as our canonical form. But labels have nothing to do with the graph structure. It is possible that there exist two graphs have the exactly same structure but with different labels. In that case, we could get two different forms and two different adjacency visual matrices. This is confusing. We want to use visualization to help graph mining. Since we are interested in the graph structure more than anything else, a unique visual representation would be more important.

68

(a)

(b)

(c)

(d)

Figure 32: Four different examples show how to decide the order between node and . The graphs showed here are all subgraphs in some larger graph structures. Same as in previous examples, thick black lines represent tree edges, while dashed blue lines represent non-tree edges. The blue triangles represent some arbitrary substructures. Nodes before and or on levels above have already been ordered. (a) Node and are pseudo-indistinguishable. (b) The order between node and can be determined by the order between node and . (c) The order between node and is determined by the degree sum of those nodes connected to node and separately. (d) Node should be put before node because node ’s parent node is before node ’s parent node .

The Cuthill-McKee procedure determines that when we are trying to decide two nodes' order within a group, orders of those nodes appearing before this group are already known, moreover, orders of those nodes who come after this group are also partially derivable. If these two nodes are of different degree, then we just pick the smaller degree one and put it before the other; if they have the same degree then there are several different cases we need to consider.

69

1. If these two nodes are (pseudo) indistinguishable nodes, then we just pick one node and put it before the other. In Figure 32 (a) node

and

are two pseudo

indistinguishable nodes. 2. If node or have connections with other nodes that come before them (either nodes on their direct parent’s level or nodes on their own level), then a. If node

has a higher pre-degree (See Definition 36) than node does,

then we put node

before node , and vice versa. In this way, we manage

to put more connected nodes closer to each other, so in the result adjacency matrix, more non-empty elements will be closer to each other too. b. If node

and

have the same pre-degree, then we compare the two

lowest ordered nodes connecting to only one of them respectively. If these two nodes exist, then we can use their order to break the tie. For example, in Figure 32 (b), node and node and

is the lowest order node connected only to node

is the lowest order node connected only to node , node

are two different nodes and also

situation, we put node

; in this

before , and vice versa. The reason we choose

this strategy is to reduce the maximum bandwidth of edge (a, u) and edge (b, v).

70

c. If node

and connect to the same set of nodes before them, since they

are not indistinguishable or pseudo-indistinguishable to each other, they must connect to different nodes who come after them. i. Let

be the set that contains all nodes connected to

and have not been ordered and let

represent the same set of

nodes connected to node . The size of set same. If we add up all nodes’ degree in sum of the node degrees in

but not to

and S must be the and compare it to the

and if they are different, then we put

the one with smaller sum of degrees before the other (see Figure 32 (c), in this case node

should be put before node ).

ii. If the tie still exists, then we generate of all adjacencies of nodes in also generate a similar set ordered nodes in

with node being removed, and . It is possible that there are

or

(d), node and belong to set node and belong to

, which is the union

. As we can see in Figure 32 and and

respectively, and respectively. Node

and have been ordered. We can use their order to determine node and ’s orders; iii. If the tie still exists then node

and

might be similar to each

other. We will use the similar test to break the tie.

71

iv. If we cannot break the tie after all these steps, we will generate all possible orderings and break the tie by using the filters in section 3.4. We select these heuristics with the goal to break a tie situation. At the same time, we are trying to put (locally) more connected nodes closer to each other. These rules might not guarantee the minimum global bandwidth. When we apply these rules, we apply them in the order as stated above. Later study shows (Chapter 6), if the graph is not a regular graph (See Definition 42 in the next section), the first rule can differentiate two nodes’ order in most cases, and the second rule can pick up most of the rest unsettled cases. Among all tests we have done, there are very few cases of unsettled ties where we do need to test all arrangement situations. Test 1 — 2c (ii) is fairly straight forward, among these rules, test 2c (ii) is the most expensive one since it is looking into more nodes. Test 2c (iii) is more complicated as showed in Figure 35. The last test is the most expensive test, in the worst case, it might lead to exponential computation time.

72 function compare_nodes :node, :node, _ : list, :int : set nodes adjacent to 's adjacency nodes 1. var _ : set nodes adjacent to 's adjacency nodes 2. _ : integer the min ordered node in a set 3. _ : integer the min ordered node in a set 4. _ 5. _ : set intersection of two sets 6. : integer return value, 0, 1, or ‐1 7. begin then if and are known indistinguishable … 8. if . 9. return result 0 then return 0 _ contains ordered nodes adjacent to 10. _ contains ordered nodes adjacent to 11. get the common nodes 12. _ 13. _ remove common nodes _ 14. or then if at least one of them is not empty 15. if . . get the pre‐degree difference 16. 17. if 0 then return if they are not the same then the tie is broken 18. else otherwise they must contain different nodes min get the minimum order for nodes in 19. _ min get the minimum order for nodes in 20. _ _ return the difference 21. return _ 22. end _ contains only unordered nodes 23. _ contains only unordered nodes 24. ∑ then if two sets’ degree sum are different 25. if ∑ ∑ ∑ 26. return then the tie is broken 27. end ; _ ; initialize 28. _ for all nodes in 29. for each node do _ add their adjacencies to _ 30. _ do for all nodes in 31. for each node _ add their adjacencies to _ 32. _ _ or if there are any ordered nodes in or 33. if _ _ and 34. then and if and are different 35. 36. _ min _ get the minimum order for nodes only in 37. _ min _ get the minimum order for nodes only in 38. _ _ return the difference 39. return _ 40 end 41. return , test if these two nodes are similar 42. end Figure 33: The procedure to order two nodes with the same degree.

73

Figure 33 shows the procedure of determining the order between two nodes of the same degree. For nodes with different degrees, the order between them is obvious. The procedure follows rules we just introduced. It first tests if these two nodes are known to be (pseudo) indistinguishable to each other, then returns 0 (means they are “equal”) if the test is true. If not, it goes on to test the pre-degrees (line 15-17), the minimum ordered node in each set (line 19 – 21). If we still cannot decide their order, then we will find all un-ordered nodes adjacent to and

or

but not to both of them, let’s call them the set

(line 23, 24). These two sets must have the same size because

or

have the

same degree and connect to the same ordered nodes. Because they are not (pseudo) indistinguishable, they must connect to different nodes come after them. We first get the sum of all degrees in these two sets respectively, if these two sums are different, then we can break the tie (line 25 – 27). If we can’t, then we will look into nodes adjacent to and

respectively. As we discussed in section 3.1, a node in a BFS spanning tree can

have connections with nodes on one level above, on its own level, or on one level below. It is possible that there are some nodes in

or

that have been ordered; if

there are different ordered nodes in these two sets, then we can use them to decide the tie (line 37 – 39). If the tie still exists, then we will test if they are similar (line 41. See section 3.3.1). The similar test is more for the testing dissimilarities, if there is any detectable dissimilarity then the tie can be broken. Otherwise, we have to generate all possible orderings between these two nodes and also test all possible arrangement because of this. In section 3.4.1 and 3.4.2 , we designed two strategies, the global penalty and pattern matching test, that can guarantee one and only one ordering will be picked.

74

3.3.1 Symmetric Graph Structures Symmetric graph structure is an interesting topic in graph theory. It starts with Foster’s tabulation of symmetric cubic graphs (Foster 1932; Harary 1999). Here are formal definitions for the related concepts. Definition 39: Two nodes automorphism

and

of a graph

are similar if for some

. A graph is point symmetric if every pair of points are

of ,

similar (Harary 1999). Definition 40: Two edges an automorphism

of ,

,

,

and ,

,

are similar if there is

. A graph is line symmetric if every pair

of lines are similar (Harary 1999). The “point” here refers to a node and the “line” here refers to an edge. Definition 41: A symmetric graph is an undirected graph whose group of automorphisms is transitive on its vertices as well as on its edges (Chao 1971). Definition 42: A graph is regular if every vertex is of the same degree. It is kregular if every vertex is of degree k (Gross and Yellen 2004). Symmetry is one of the important aesthetic criteria that have been studied intensively in recent years. A symmetric graph can be decomposed into a number of isomorphic subgraphs that help to represent or display the graph. (Manning and Atallah 1988; Manning 1990; Manning, Atallah et al. 1995; Chen, Lu et al. 2000). But “testing

75

symmetry of a general graph is a NP-complete problem” (Fraysseix 1999; Munson 2004; Bhowmick and Hovland 2006). Several heuristic algorithms for detecting symmetries are presented in (Fraysseix 1999; Abelson, Hong et al. 2001; Buchheim and Junger 2001). Hard detectable symmetric structure or near symmetric structures might bring the worst case scenario for our algorithm. Except for those indistinguishable nodes or pseudo-indistinguishable nodes, other similarity test between two nodes is like testing if the two subgraphs derived from these two different nodes are isomorphism graphs. As we already mentioned in chapter 2, general graph isomorphism problem is a hard problem that currently has no polynomial algorithm solution. Symmetric graph won’t bring us the | |! performance because there are always clues to limit the number of possible orderings. Figure 34 shows a 3-regular graph where all the nodes have the same degree of 3. We can see from the display that there are several pairs of similar nodes. For example,

,

,

,

, and

,

etc. are all similar

node pairs. For this particular example, every node is a peripheral node, so they can all be used as the candidate starting node. Figure 34 (b) and (c) are two spanning trees rooted on node . On the second level, we have nodes , , and ; their degrees are the same. It is not hard to find out that node connects to node

should be put before the other two because node

and (two unordered nodes) and they are on a level higher than node

and , which are neighbors of node

and . The order between node

and is hard to

decide; we need to find more clues. But once we choose an order for them, the rest of the nodes’ order is fixed. Node

and

are two pseudo-indistinguishable nodes. The order

76

difference between them won’t change the corresponding visual representations of the adjacency matrix. So in this example, for a fixed starting node, there is only one undecidable case. For 8 starting nodes, there are totally 16 possible orderings. Comparing to 8!, 16 is a small number.

(a)

(b)

(c)

Figure 34: A 3-regular graph with two of its spanning trees rooted at the node . Every node is of degree 3.

Assume node

and

are of the same degree and also have passed initial tests in

Figure 33 and their order is still unknown. We need to find more information from the things we already know. As mentioned before, for a fixed node, there is only one level structure. Testing inequality between these two sub-trees rooted on node

and

respectively becomes our next step. We first construct two BFS spanning subtrees rooted at each of these two nodes. Let’s call these two subtrees

and

respectively.

and

contain only unordered

nodes. While we are building these two trees simultaneously in a top-down manner, at each level, we compare a couple of properties. These include the total number of nodes

77

on the current working level, the node degree distribution and the node level distribution. The node’s level refers to its level in the fixed level structure. If these two nodes pass all tests on the current working level, we will continue to construct

and

and go to the next level. We will keep doing these tests until we reach

the leaf level. If at any point, there is an inequality then the tie can be broken. If at any point, the corresponding two levels in

and

share the same set of

nodes, we will stop the procedure and return 0. This is because from then on all the nodes below in

and

will be the same. If till the leaf level we still keep the tie, we will also

return 0. Figure 35 is the pseudo code for this similarity check procedure. When getting a 0 as the return value for the similarity testing, the calling procedure will generate all possible orderings for each of the possibility, and try to break the tie by global penalty and pattern matching methods in section 3.4.1 and 3.4.2. If similarity testing returns a value other than 0, the tie is broken. If the returning value is larger than 0, then we put node u before node v; if the returning value is less than 0, then we put node u after node v. At step 21, we check the max level information of those nodes in the current construction of tree Tu . If the newer added nodes’ maximum level is less than the existing nodes’ maximum level, we will stop the construction of Tu . This is because we only want to check unordered nodes below node u’s current level.

78

This procedure needs to build two leveled structures. The building cost is | |

| | . At each level, we need to test several properties, including the node

degree distribution and the node level distribution. For these distribution tests, we have to order the nodes first and then do the test, this could add cost.

| | log| |

| | to the total

79 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50.

function isSimilar : root node, : root node, var , : node collections , : node set , : node set : integer , _ : integer _ begin ; ; _ _ while true _ _ if then return 0 end if . . return . . end _ . _ . if _ then _ then if _ _ return 0 else return 1 end else if _ then _ return 1 else _ _ _ _ end end for each and in order if . . then return 1 end end for each and in order if . . then return 1 end end . ; . end

_ : set, : level structure nodes on a level of the LS nodes on a level higher nodes has been visited return value the maximum level of the node in or start from node and separately initialize nodes in tree rooted at node and initialize the max level value for initialize the max level value for unordered nodes on the next level of unordered nodes on the next level of if both sets contain the same nodes if two sets are different in size the tie is broken get the maximum node level of in get the maximum node level of in stopping point for tree Tu stopping point for tree Tv undecidable case

put node before

put node after reset the maximum value

add new set to the corresponding tree test the degree distribution

the tie is broken test the level distribution the tie is broken remove redundant nodes from each set

Figure 35: The procedure for testing if two substructures rooted at (isomorphic) to each other.

and

separately are symmetric

80

3.4 Dealing With Multiple Candidate Orderings In section 3.2, we talked about how to select the starting node. Section 3.3 explained how to determine two nodes’ order when there is a tie situation. After all, it is possible that we still have more than orderings. So far, we cannot guarantee the unique ordering either because of multiple starting nodes or because of some un-solvable ties. Since smaller bandwidths are more likely to generate a nice clustering effect in the result adjacency matrix, selecting the ordering with the minimum bandwidth is always our first criteria (bandwidth filter). But there might be more than one ordering with the same bandwidth. In order to choose one and only one ordering, we propose the global penalty and the pattern matching strategy to help us achieve this goal.

3.4.1 Global Penalty ( Global Penalty (

)

) is focused on all graph edges as a whole. It is the summation

of all edges’ bandwidth in the graph. For example, for the edge node ’s order is 0 and node ’s order is 1, the bandwidth of edge 1

0

,

in Figure 36, since ,

would be

1. A smaller global bandwidth means more non-empty elements in the graph are

closer to the main diagonal, so they appear more visually clustered in the corresponding adjacency matrix.

81

1

1 2

2

(a)

(b)

Figure 36: Two adjacency matrices of two different orderings. The visualization of both matrices appear very similar, the only difference is edge , showed in area 1 and 2. The global penalty for matrix in (a) is 41 while the global penalty for matrix in (b) is 40.

Figure 36 shows an example of how global penalty can help us to select one ordering from another. The two patterns in Figure 36 appear very similar although they are actually very different. If we use the global penalty, the ordering in Figure 36 (b) will have a smaller value because of the edge

, . In Figure 36 (b) node

and are closer to

each other than the ordering in Figure 36 (a). Whenever we need to compare two orderings that have the same bandwidth, we will calculate their global penalties and choose the smaller one. The calculation procedure needs to loop through all the edges and get their two endpoints’ order; the total time is in the order of

| | , where | | is the number of edges. We assume that at each

step, getting the order information of one node takes expected time of to implement if we use a hash table to store that information.

1 . It is not hard

82

It is still possible that there are more than one orderings which have exactly the same global penalty; in such a case, we will further our tests by the pattern matching strategy.

3.4.2 Pattern Matching

(a)

(b) Figure 37: Two different orderings of the same graph have exactly the same visual pattern in their adjacency matrices. On the left are two BFS spanning trees, their corresponding adjacency matrices are showed on the right.

After all rules, filters and the penalty we have introduced, if there is still a tie, we will compare the two corresponding adjacency matrices, row-by-row, and break the tie if

83

there is a mismatch. If the two different orderings still have the tie after this row-by-row matching, then we know these two corresponding matrices’ visual patterns are exactly the same. We just choose any one of them, and also mark the first two nodes which make these two orderings different similar to each other. This might be an interesting character that the user wants to know and it will also help us during the computation. In the worst case scenario where we need to match every cell of the two matrices, it takes a time of | | , where | | is the number of nodes in the graph. Figure 37 shows an example where two different orderings generate exactly the same matrix pattern. Please note that the example in Figure 37 is just for the purpose of explaining how pattern matching works. It is not a real case since node

is not a peripheral node.

An undirected graph’s adjacency matrix is symmetric respect to the main diagonal. When comparing two matrices, we will only need to consider the upper triangle (or lower triangle). Starting from the first row, beginning from the diagonal position, we match all cells one by one from left to right. The only possible mismatch is when one matrix has a non-empty element while the other matrix has an empty one at the same position. When this mismatch is first detected, we will stop the procedure and select the first pattern. This pattern matching strategy might match two rows that represent two different nodes. For example, in Figure 37 (a) the node ordering is the ordering in Figure 37 (b) is

, , , , , , , , ,

while

, , , , , , , , , . To compare these two matrices,

we start from the first row and go down to the second. The third row in Figure 37 (a) corresponds to node

while the third row in Figure 37 (b) represents node . We just

84

ignore the fact that these are two different nodes and only focus on two patterns. If these two rows share the same pattern, then we say it is a match and go on to the next row. Figure 38 shows the pattern matching procedure. We use a bitmap structure to help us compare two rows. function pattern_matching , 1. var , , : int array 2. : a node 3. : int 4. int 5. order : int 6. begin do begin 7. foreach in 0, 8. 9. reset 10. foreach do begin 11. index of in 1 12. 13. end 14. 15. reset 16. foreach do begin 17. index of in 1 18. 19. end XOR 20. 0, 21. . 22. if 1 then 1 then 23. if 24. return 1 25. else 26. return 2 27. end 28. else 29. return 3 30. end 31. end 32. end

and are two node ordering arrays a bitmap of 0 and 1s

an integer represent the length of and the first column that two orderings are different node’s order in or for each node in the order of set all the bit to 0 for each node connect to node get the order of in turn on the flag set all the bit to 0 for each node connect to node get the order of in turn on the flag

and

compare these two bitmap find the first bit that isn’t 1 after the main diagonal if there is a bit in that is 0 if has a non‐empty element in that position select and discard the other one otherwise, select the two pattern is the same

Figure 38: Pattern matching procedure.

85

3.5 Disconnected Graphs So far, we assume we only deal with the connected graph. Many real life graph data contain disconnected graphs. In that case, we can first go through the data and partition the data into connected graphs. The partition algorithm is very simple. It starts at a random node and builds a level structure rooted at this node. All nodes belong to this structure forms a connected graph. We then remove these nodes and start the above procedure again in the rest of the graph. The output is a list of connected sub-graphs. We then apply our strategies to each of these graphs and get the unique ordering for each of them. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

function graph_partition G : graph var : a set of nodes : a set of edges : a set of sub‐graphs : a sub‐graph begin while do begin . 0 . . end return end

each sub‐graph is a connected graph , initial the return set as long as is not empty reset the sub‐graph pick the first element in build a BFS spanning tree rooted at node remove all the node in from add to the returning sub‐graph list

Figure 39: The graph partition procedure.

3.6 Nodes with Self-Loops Although our algorithm is designed for sparse undirected graphs, it is not sensitive to self-loops. This is because self loops won’t change the level structure for a

86

fixed starting node. The only place we need to pay more attention is when comparing two nodes by their degrees. We remove the self loop’s influence by not counting it as a degree for a node. Self loop can be treated as a special property for a node.

3.7 Dense Graphs Our procedure can be applied to a dense graph. This is because none of our strategies is limited to the sparse matrix. But the density of the input graph will affect the performance of our procedure. As we already mentioned in chapter 2, if a graph is dense, identifying its candidate starting nodes will take more time. Besides that, more edges might bring more ties and more permutations; these will also slow down our algorithm.

3.8 Discussion Our goal is to get a unique visual adjacency matrix representation for every input graph and this unique adjacency matrix depends only on the graph structure; it will stay the same no matter what the initial input ordering is. Theorem 3: Two graphs are isomorphic if and only if their unique visual adjacency matrix representations are the same. Proof: We only need to prove that for an input graph, there is only one visual adjacency matrix produced by our procedure. There are three steps in our procedure. We will examine each of them and show that the result is only depending on the topology of the input graph.

87

The first step is to select candidate starting nodes. We did a brute-force search and selected all qualified nodes. This step is a complete search. The second step is to order nodes on each level of the level structure. For a fixed starting node, nodes on each level are also fixed. And except the root and the first level, nodes on each level are first ordered by their parent (which we call groups); these groups reduced the number of possible undividable cases. Within each group, the nodes are ordered by their degree and also a set of heuristics. Each of these heuristic is well defined, there is no chance that a test (heuristic) could produce more than one different result. For each unsolvable tie, we check every possible permutation. It is expensive, but complete. The last step is to select a unique ordering from all these possible orderings. We design two filters, the bandwidth and the global penalty, to rule out some orderings. These two filters are only based on properties of the matrix but they still cannot guarantee there will be only one ordering left. So we propose the pattern matching strategy to compare every two matrices cell-by-cell and select only one of them. This pattern matching guarantees there will be only one matrix selected, and no matter which comparing order we choose, the visual matrix representation is always the same. ■ Fortin stated in his report that “it is almost always trivial to check two random graphs for isomorphism” (Fortin 1996). This also applies to our procedure for generating the canonical visual adjacency matrix. General random graphs are not likely to have many undecidable cases. As our analysis showed in section 3.3.1, the undecidable cases are related to hard detectable symmetric or approximately symmetric graph structure. In

88

the worst case where there are many ties that we couldn’t break, we need to generate every possible ordering for each possibility, which could result an exponential running time. This worst case running time is comparable to those graph isomorphism testing algorithms (Read and Corneil 1977; Babai, Erdos et al. 1980; Babai 1981; Fortin 1996). Finding the minimum bandwidth is not our goal. The above strategies might not lead us to the minimum bandwidth, but it is not likely to produce a large bandwidth. Compare to the GPS procedure (section 2.6.1), we only use peripheral nodes as our starting nodes and select those that can produce the smallest width trees as our candidate for further tests; our result’s bandwidth could be comparable to the result GPS procedure.

89

4

Patterns of the Adjacency Matrix

This chapter gives several example patterns that could appear in a visual adjacency matrix.

4.1 Clique Definition 43: A clique is a subset of vertices in a graph that are mutually adjacent to one another (Gross and Yellen 2004).

Figure 40: A clique with two of its representations. The clique forms a square that is symmetric to the main diagonal. The main diagonal is empty because there is no self-loop in this graph. On the right is the corresponding node-link representation.

90

Figure 40 is a clique in two of its representations. In the matrix representation on the left, the clique forms a completely (except the main diagonal) filled up square with its main diagonal overlapping with the matrix’s main diagonal. Lemma 1: A completely filled up square (except the main diagonal) is a clique if its main diagonal overlaps with the main matrix diagonal. We skip the proof for lemma 1 since it is very straight forward. Please note that the reverse of lemma 1 is not necessary true. For example, if a clique is a sub-graph of a larger connected graph, it won’t necessary appear as a completely filled up square. Instead, it could appear in many different forms, all depend on how the graph nodes have been ordered. If nodes in the clique are arranged next to each other with no other nonclique nodes in between, then we will see a filled up square as showed in Figure 40. On another hand, if the whole graph is a clique, complete filled up square is the only possible pattern.

4.2 Complete Bipartite graph Definition 44: A graph sets

and

2

is bipartite if its vertex set

such that every edge joins a vertex in

1

can be partitioned into two

with a vertex in

2

(Gross and

Yellen 2004). Definition 45: A complete bipartite graph is a simple bipartite graph in which each vertex in one partite set is adjacent to all the vertices in the other partite set (Gross and Yellen 2004).

91

Figure 41: A complete bipartite graph in both matrix and node-link representations. On the left is the adjacency matrix representation. The complete bipartite graph shows as two completed filled up squares in the matrix.

The complete bipartite graph is also an interesting pattern. In the matrix representation (see Figure 41), a complete bipartite graph may appear as a complete filled up square, and unlike the clique, this square could be anywhere in the matrix, and the rows and columns of the square represent two disjoined vertex sets. Note that the pattern in Figure 41 is not the only possible pattern for a complete bipartite graph, depending on the node ordering, it could have many different patterns. A general bipartite graph could also have different patterns; Figure 42 shows one of them.

92

Figure 42: A simple bipartite graph. In the matrix, the edges form a triangular shape. Note that this triangular could appear anywhere in the matrix, they all represent a bipartite graph.

4.3 Path Finding a path (See section 2.1 for related concepts) is difficult with just an adjacency matrix representation (Ghoniem, Fekete et al. 2004). If rows and columns are arranges properly with vertices that are in the order of their appearance in the path, then a path will show up as a line that is parallel to the main diagonal and also just next to the main diagonal (Figure 43).

93

Figure 43: In adjacency matrix, a line (colored in green) parallel to the main diagonal and also next to the diagonal represents a path. In the right window, the corresponding path is also colored in green.

4.4

Highly Connected Node A highly connected node (node with relatively high degree) is very easy to be

identified if all or most of the nodes (rows and columns) connected to it are next to each other in the adjacency matrix representation.

94

Figure 44: A high degree node forms a horizontal/vertical bar (colored in red) in the matrix representation on the left. The corresponding nodes and links are highlighted in blue on the right.

4.5 Other Patterns Sections 4.1-4.4 introduced four interesting matrix patterns. These are just a few examples; there are many other patterns. For a large and complicated graph, any kind of patterns could emerge and whether they are interesting or not depends on the actual application and also a user’s interest.

95

5 System Implementation In this chapter, we introduce the prototype of our matrix/graph visualization.

5.1 Data Structure

Figure 45: Data structure for our matrix graph visualization framework.

Our data structure is designed to assist both the reordering computation and the user interaction. We use a similar data structure concept as the Tom Sawyer graph editor toolkit (TSE)(Corporation 2003). On the top level is the object called “Graph Manager.” It manages a list of subgraphs. Each of these subgraphs is a connected graph, and there are no connections between them. In TSE, a graph object represents a general graph and it could be either connected or disconnected. Our graph object represents only connected

96

graph. Our graph manager keeps some basic information of the overall graph(s), such as the total number of subgraphs, the number of nodes and the number of edges. The subgraph list is sorted in descending order by the size of the subgraph (the number of nodes in a graph). When we display these subgraphs in the matrix visualization, we follow the same order as this subgraph list. The subgraph with the largest number of nodes will be put before others. So, it will be displayed at the upper left of the matrix, while the graph with the fewest number of nodes will be shown at the lower right of the matrix. Our graph object contains an ordered node list, an edge list and a unique ID to differentiate it from other graph objects. The initial order of the node list is depending on the graph loader; it could be in any sequence. The graph object also contains two lists for selected node ids and selected edge ids. These two lists are empty in the beginning; it will be updated when there is a selection action. A node object contains the basic information about itself, an adjacency list indicates all its neighbors, a list contains nodes that are (pseudo) indistinguishable to it, a list indicates which nodes are similar to it, its unique ID, a label and a flag indicate whether it has been selected. Similar to the node object, an edge object also contains its unique ID, a label, a selection flag, and also two objects referring to its two end nodes.

97

5.2 File I/O There are many ways to store the graph information. Different applications usually use different formats for various reasons. The GraphViz (Gansner, Koutsofios et

al. 1993; Gansner and North 2000) uses the DOT format which describes not only the topological information of a graph but also the detailed information about how each object (edge/node) should be drawn on the screen (see Figure 46).

(a)

(b)

Figure 46: An example of the GraphViz file format (Gansner, Koutsofios et al. 2006). (a) The dot file format. Line 2 indicates the graph’s size in inches. Each of the rest of the lines represents a node or an edge, along with its drawing properties (such as the size, shape and color). (b) The node-link representation of the graph.

Various formats make it hard to exchange graph information between different groups. GraphML (Graph Markup Language) project was launched by the graph drawing steering committee in the year 2000. The goal is to form a language that “allows to define extension modules for additional data” (Brandes, Eiglsperger et al. 2001). So different applications could add or remove these extensions easily according to the actual need while the original graph structure will not be affected.(Brandes, Eiglsperger et al. 2001;

98

GraphMLTeam 2007). GraphML is based on the XML. Figure 47 shows a simple graph with its GraphML file that describes only the topology information of the graph.

(a)

(b)

Figure 47: A simple graph with its GraphML file description (Brandes, Eiglsperger et al.).

(a)

(b) Figure 48: A graph and its corresponding XML file format.

99

Our graph file format is also based on the XML. Since our application is focused on the computation that only uses the topological information of a graph, we store only the basic graph structure information plus some flags that support the selection interaction. The graph element consists of two child elements: the node element and the edge element. Each node element contains the node’s id, whether it has been selected, and its label. The edge element is similar to the node element; it has a unique id, a label, a selection flag, and a source and a target field that keep ids of its two endpoints. The selection flag is used to indicate whether this object should be written out when a user chooses the “save selected …” option which we will introduce later. Although we didn’t use the strict guideline of the GraphML format, our format is quite similar to it and can be changed to it easily.

100

5.3 User Interface

Figure 49: Graph matrix visualization interface. On the left is the matrix visualization and on the right is the corresponding graph’s node-link representation.

The user interface of our graph matrix visualization is very simple. It is similar to that of the MatrixExplorer (Henry and Fekete 2006). Figure 49 shows our graph matrix explorer interface. On the left is the matrix visualization, and on the right is the graph drawing visualization. These two windows are put side-by-side so a user can compare different views of the same graph. The frequently used interactions associated with each display are put at the bottom of the corresponding window. The split bar at the center is movable so user can easily adjust the view size when needed. Non-empty elements in the matrix representation are colored in yellow, and node labels are displayed at the top and left of the matrix. When the number of nodes is not

101

large, the matrix will have a gray grid background to help the user identify the position of each of these non-empty elements. In the case that the graph has a lot of nodes, displaying this background will interfere with the display of the non-empty element, and it also loses the functionality of helping the user to position a cell. Our visualization will drop this background automatically and just use a pure black background. Our experience shows that black background makes an image appear more dominant. The graph matrix in Figure 52 has ~1.8 nodes and ~4.2 edges. When displaying a graph of this size in the matrix visualization, each edge (non-empty element) gets only 1 pixel space on a regular screen. Of course we can give each edge more pixels but then it would be even harder to show the whole image on a regular display. After dropping the gray grid background, the image seems more pleasing to the eyes. In this display, node labels are overlapping with each other; it is very hard to read them. Actually, hiding labels is an option that a user could choose interactively on either large or small graph matrix displays. Both windows are also associated with scroll bars to help display a large matrix/graph. We use the TSE to draw the node-link graph on the right. TSE provides a powerful library for the graph display and interactions. The example in Figure 49 shows a graph in the circular layout (see section 2.2). TSE also provides some other frequently used layouts. Most interactions associated with this window are implemented with the TSE library.

102

5.4 Interactions Our visualization system supports several basic interactions for both the matrix and node-link visualizations.

5.4.1 Selection Linking is one of the key features for an interactive visualization system. Linking refers to the mechanism that several visualizations are linked in such a way that records selection occurred in one window will have the same selected records highlighted in other visualizations (Becker and Cleveland 1987; Gee 2004). Extra attentions are needed when we are trying to link the matrix visualization with the node-link representation. These two visualizations’ focuses are different. Matrix visualizations are more emphasized on edges while node-link representations are focused on both nodes and edges.

103 Node selection Area

Node selection area

Edge selection area

(a)

(b)

Figure 50: Linking between the matrix and the node-link representations. (a) User can select both nodes and edges in the matrix display and selected elements will be highlighted in the other window. (b) Selection in the node-link display might contain both nodes and edges. Again selected elements are highlighted in both displays.

104

Our matrix graph visualization system allows user do selections in both windows and the two displays will be updated accordingly. The matrix visualization window can be further divided into two areas, one is inside the matrix where edges are displayed, and the other is along the two sides (upper and left) of the matrix where node labels (if there are any) are showed. Selections inside the matrix will only have selected edges highlighted in both displays and selecting labels (nodes) will only highlight nodes in the two windows. The node selection and the edge selection are two independent selections; one won’t dismiss the other. Both selections support multiple selections, when the “shift” key is pressed, a new selection will be added to previous selections, otherwise the new one will clear the old one. In the node-link display, a selection could contain both nodes and edges. When that happens, we highlight selected records in both displays. Figure 50 (b) shows an example of selection in the matrix or node-link display. The current selection area is indicated in the green rectangle. Edges being selected are highlighted in red and nodes are highlighted in blue.

105

Figure 51: The selected subgraph can be viewed in a separate window as a node-link graph.

Selection is one of the basic and very important interactions that can help a user explore the graph data. Although a selection will have the selected entities highlighted in both views, sometimes, when the graph is large, it is still hard to see the subgraph clearly in the node-link display because of the overlapping. In this case, a user could choose to view the selected subgraph in a separate window (see Figure 51). The size of this subgraph view window can be freely changed by a user when needed, and a user can also change the layout of the node-link representation. This subgraph view window also supports probing so a user can get the detailed information about node/edge label when necessary. The selected subgraph can be saved to an XML file. This can be activated by selecting the “Save selected…” option in the file menu.

106

5.4.2 Highlight Skeleton Edges We have proved in section 3.1 that the bandwidth of a non-tree edge is always smaller than one of the tree edges who shares the same endpoint with it. We call tree edges the skeleton edges of a graph. If we highlight them in the canonical visual adjacency matrix, they will show up as an outline to those non-tree edges. The highlight skeleton option is only available after the matrix has been reordered. Before the reordering, the skeleton doesn’t mean anything. This is because it is only associated with a BFS spanning tree. A user can also choose to see these skeleton edges in the node-link graph display (see Figure 52 and Figure 53). In the node-link display, we also provide the option to display or hide non-skeleton edges.

107

High deegree nodes Clustter 1

Skeletton

Clique

Cluster 2

Figure 52: Skelleton edges arre highlighted in purple in th he matrix disp play. This graaph has ~ . edges. This option only available a afterr the matrix has h been reord dered. The linee noodes and ~ . allong the main diagonal of th he matrix indiicates self-loop p edges.

108

(a) (b) Figure 53: Mark skeleton edges in the node-link graph display. This is the same graph as the one showed in Figure 52.

5.4.3 Reordering We provide two reordering strategies for the matrix visualization: the automatic reordering and the manual reordering. When a user chooses the automatic reordering option, nodes will be rearranged by following the procedure introduced in chapter 3. The resulting adjacency matrix is the canonical visual adjacency matrix and some interesting patterns might also emerge. Figure 54 shows a matrix before the reordering, while Figure 52 is the same matrix after reordering. We can see in Figure 52 that non-empty elements are closer to the main diagonal, and several small clusters can be easy identified (including a clique).

109

Self loops

Figure 54: The same matrix as showed in Figure 52 butt before the reeordering. Non n-empty elemeents arre spread all over o the matriix. The line aloong the main diagonal indiccates self-loop ps.

(a)

(b)

nual reorderin ng in the matrrix display. (a)) Node is beiing moved froom its originall Figure 55: Man ned in green) to t a new positiion (outlined in i red). During the moving,, column keeeps its poosition (outlin orriginal look, and a the whole image is transsparent so a user u will have a feeling of the physical mooving an nd some hints of what the matrix m will loook like after th he moving. (b)) The matrix after a node has beeen moved to a new position n.

110

The manual reordering option allows a user drag a row/column and put it in a desired position anywhere inside the matrix. We find this interaction is very useful to let a user learn how a pattern is formed and what a pattern means when comparing it with the node-link display (see Figure 55).

5.4.4 (Pseudo) Indistinguishable Nodes In section 3.3, we introduced two special kinds of node, the indistinguishable nodes and the pseudo-indistinguishable nodes. As we discussed, a pair of indistinguishable (or pseudo-indistinguishable) nodes’ order can be interchanged and the visual effect of the adjacency matrix will stay the same. These nodes might be of interest to the user and they are highlighted by changing the node’s label to red (see Figure 56).

Figure 56: Indistinguishable nodes and pseudo-indistinguishable nodes’ labels are highlighted in red.

111

5.4.5 Other Interactions

Cluster 1

Figure 57: Zooming in the matrix display, it is activated by the mouse wheel. User can interactively change the zoom level in the matrix display. This is the same graph as showed in Figure 52. The “Cluster 1” indicated here is the same “Cluster 1” in Figure 52.

112

Figure 58: Probing in the matrix display. Green lines associated with the mouse position can help a user locate those two nodes that are connected by the current probing edge. The two endpoints’ labels are also highlighted in green.

Zooming and probing are two other commonly used interactions. They are particularly useful when exploring large graphs. Zooming in the matrix display is activated by the mouse wheel. A user can freely zoom in or zoom out by turning the wheel in two opposite directions. A user can also control how much he/she wants to zoom in/out by turning the wheel. Probing shows the detailed information about an edge. When the mouse cursor is inside the matrix display, probing is activated. Two green lines will appear, one horizontal, one vertical, both starting from the current mouse cursor position. The

113

corresponding nodes (labels) will turn from white to green so the user can find out who are the endpoints of the current edge.

5.5 Non-Graphical Output During the computation of the canonical visual adjacency matrix, we need to calculate the bandwidth of every candidate matrix, the height and width of the spanning tree we are using, the global penalties, and more. These parameters could be useful; but so far, we don’t know how to use them except for the matrix computation. Our current application outputs these items as a command line output.

5.6 Summary The data structure, file I/O are designed for supporting both computation and graph matrix visualization. The user interface and all the interactions associated with the visualization system are very basic and reflect our general ideas of how graph matrix visualization could help the graph exploration. This system is designed for general uses. New functionalities could be easily added if there are special requirements. For example, the indistinguishable nodes and the pseudo-indistinguishable nodes, the similar nodes, and the skeleton edges are very interesting features that reveal special features of the input graph and its canonical adjacency matrix. Depending on the actual application, the user might want to compare two skeletons and get a sense of how two graphs are related. A user can also compare two canonical visual adjacency matrices for two input graphs. This can be implemented by calling the existing pattern matching procedure.

114

6

Result

We tested our algorithm for its sensitivity to the initial layout and also its sensitivity on several different graph data sets including several symmetric graphs. We then applied it to the IMDB movie dataset of year 2006 (see section 6.3) to test its potential ability in assisting graph exploration. 6.1

Different Graph Models The influence of the initial vertex ordering is tested by randomizing the initial

vertex order every time before starting of the vertex reordering process. Just as we expected, the visual representation of results stay the same although the actual node ordering of the resulting adjacency matrices might be different because of indistinguishable nodes, pseudo-indistinguishable nodes, and similar nodes. We did some testing on different graphs; these include symmetric graphs and cliques (a special case of the symmetric graph). Symmetric graphs generally have several different vertex orderings that could generate the same canonical visual matrix, which means there are more unbreakable ties while computing the result. Generating different permutations is expensive. Moreover, when a symmetric graph is large, the level structure (BFS tree structure) gets deeper; breaking various tie situations becomes more costly.

115

Although a clique is also a symmetric graph, its special property makes it one of the easiest graph structures for our procedure. As we discussed in chapter 3, this is because every pair of nodes is indistinguishable to each other, we don’t need to check different combinations since we already know they won’t change the visual representation of the matrix. As Figure 18 in chapter 4 showed, a clique’s adjacency matrix is always a fully filled square except the diagonal. Figure 59 shows a 3-regular graph (Definition 42) of size 16 and three of its vertex orderings. This is a symmetric graph and every node in this graph is a qualified starting node. The result verifies this and also shows that although actual orderings are different their visual representations stay the same (see Figure 59 (b)-(d)). For each starting node, there are 6 undecidable cases, and for each of these 6 cases, there are 2 options. So, we have to test totally 12 different permutations for a single starting node. There are 16 qualified starting nodes. The total number of permutations for the whole graph’s computation is 192. Compared to 16!, 192 is a very small number.

116

(a)

(b)

(c)

(d)

Figure 59: 3-regular graph of size 16 with its node-link representation and three of its vertex orderings. For this special graph, every node is a peripheral node, and they are all qualified starting node. For each starting node, there are 6 unbreakable ties, so totally 12 different permutations. (b)-(d) are three different vertex orderings. The visual representations stay the same.

While computing the canonical visual matrices, we also record some important parameters that could be used to assess our procedure. Very interestingly, these parameters generated by synthetic graphs and also the IMDB database suggest that for random graphs, most tie situations can be broken by differences between two nodes’ pretotal degree and post-total degree. Here, the pre-total degree refers to the degree summation of nodes that appear before these two same degree nodes; a tie can be broken by this criterion means these two nodes are connected to at least one pair of different nodes that appear before them; the post-total degree is the degree summation of vertices that appear after them (see section 3.3 for details). Comparing with the symmetric graph,

117

a random graph has much less unbreakable ties. Moreover, for random graphs, the qualified starting nodes are also very limited. Table 1 shows some parameters while computing the canonical visual matrix.

broken by ancestor node ordering

broken by post‐ degree

permutations

running time (ms)

bandwidth

Random

broken by pre‐ degree

Random

un‐breakable ties

Random

starting nodes

Barabasi

identified indistinguishable node pairs

Barabasi

# edges

Barabasi

# nodes

Graph Type

Table 1: Several computation factors of applying our procedure to several synthetic graphs. If there is more than one qualified starting node, we only show the average of multiple cases. For the “running time” column, numbers shown in parentheses represent the time taken by finding the candidate starting nodes. The synthetic graph data is generated by using The Barabasi Graph Generator (http://www.cs.ucr.edu/~ddreier/barabasi.html) (Barabasi and R. 1999).

30 106 206 30 110 210

200 600 1200 200 500 2000

3 0 0 0 0 0

2 1 1 7 2 1

0 0 0 0 0 0

8 24 94 5 31 58

0 10 49 2 7 7

0 6 4 9 2 3

0 0 0 0 0 0

16 (16) 78 (78) 562 (547) 16 (16) 78 (78) 890 (875)

25 72 160 21 74 161

6.2 Patterns Figure 60 shows a graph’s adjacency matrix before and after the reordering procedure. This is a dense graph; we can see clusters clearly in the result canonical visual matrix.

118

(a)

(b)

Figure 60: A synthetic dense graph with 30 nodes and 300 edges. (a) Vertices are in random order. (b) The canonical visual matrix of (a) generated by our procedure. The bandwidth of this matrix is 25, the global penalty is 2268. We can clearly see the clustering effect.

(a)

(b)

Figure 61: A synthetic graph and its canonical visual matrix. Patterns are highlighted in colored boxes in both displays. Each color represents a correspondence. The blue box represents a high degree node (node 2); the purple box is a highly connected subgraph; nodes in the orange box are low degree nodes. The last 4 nodes (18, 19, 20, 21) formed a path pattern in this matrix. (a) This graph has three clear patterns, two high degree node (node 2 and node 6), a highly connected subgraph (in the middle), and several low degree nodes (at the bottom). (b) The canonical visual matrix of this graph. There are several patterns in this matrix and these patterns are corresponding to those patterns in the node-link representation.

119

We want to test how well the canonical visual matrix might reveal interesting graph structures. So we create a graph with the following known patterns; several highly connected nodes, a highly connected subgraph, and some low degree nodes (see Figure 61 (a)). Then we compute the canonical visual matrix of this graph. The resulting matrix also shows clear patterns, and very interestingly, these patterns are also corresponding to those patterns in the node-link display. Figure 61 shows the node-link graph and also its canonical visual matrix. Patterns are highlighted in both displays by using colored boxes. Their correspondences are indicated by both colors and arrows. Patterns in Figure 61 are very clean in a way that each cluster only has one or at most two nodes that have connections with nodes in another cluster. We then randomly added several edges between nodes in different clusters, to test if the canonical visual matrix of this newly edited graph still could reveal those patterns. The result is shown in Figure 62. We can still easily identify the highly connected subgraph and also the high degree node (node 2).

120

(a)

(b)

Figure 62: A modified graph of the graph in Figure 61. This graph has more edges that connecting nodes in different clusters. The graph in Figure 61 is a subgraph of this one. Patterns in both displays are not as clear as patterns in figure 3. (a) The node-link display in its original layout as in Figure 61. We keep nodes in their original positions so we can compare two graphs easily. The trade off is that there are more edge/node crossings. (b) The canonical visual matrix of the new graph in (a). We can still identify the highly connected subgraph (highlighted in purple box) and the high degree node (node 2 in the blue box).

We test even further by adding more crossing cluster edges (edges that connect nodes in different “clusters”). The graph changes accordingly. If we reapply the same layout algorithm as we did for the original graph, the node-link representation changes and it is hard to identify those original clusters. Figure 63 shows the graph after all modifications. Patterns in the canonical visual matrix are also changed. Similar to the node-link representation, it is hard to identify those original patterns as they are immersed in new patterns.

121

Figure 63: A further modified graph of the structure in Figure 61. The node-link representation has been rearranged by using the original circular layout as the graph in Figure 61. The display changes dramatically. The updated canonical visual matrix also changes. But we still can see the high degree node (node 2) and the clique formed by node 10, 11, 12, 13.

6.3 IMDB Movie Database The Internet Movie Database (IMDB) is an online database that contains “a huge collection of movie/TV information”. Each movie item records many information about itself, this includes who had worked in it, where the film was filmed, when it was released and etc. (IMDB 2008). Figure 64 shows all movies released in 2006. There are totally 3953 nodes (movies) and 12957 edges. Two movies are linked together if they share some crew members. There are totally 867 disconnected subgroups (subgraphs). Each of them is a connected graph and there are no connections between different groups. The largest subgraph contains 2964 movies (nodes) and 12396 connections (edges). The majority of the rest of the graphs contain only one node. As showed in Figure 64, the largest subgraph takes most of the matrix display area. Matrices in Figure 64 are in their canonical forms. The largest one shows a “leaf” shape while the rest 866 sub-graphs form

122

the “petiole.” There are some obvious clusters (some are cliques) along the diagonal. The line overlapping with the main diagonal indicates self-loops. Self-loops are formed because someone has taken more than one roles in the same movie, for example, both as a director and a writer. For the movie data of year 2006, finding the canonical ordering for the first subgraph (the largest one) takes most of the computation time. Other year’s movie data set also showed similar patterns. Table 2 shows some computation parameters for each of the largest component of the corresponding year’s graph. If there are more than one qualified starting nodes, the number shown here is the average of all cases. The running time relates to the size of the graph (both number of nodes and also number of edges) and also the structure of the graph. As we can see that the graph of year 2006 is denser when comparing to the graph of year 2004. The running time increased dramatically. Another interesting fact is that finding the candidate starting nodes takes most of the computation time. The time showed here reflects the result of single thread computation. The step of identifying the candidate starting node can be implemented by using multithread technique. It should speed up the whole computation.

123

broken by post‐ total degree broken by post‐ post degree

permutations

running time (sec.)

1 142 1 192 1 171 1 469 2 1103 1 295

213 174 232 294 378 463

46 70 68 100 131 183

39 69 104 155 316 74

0 0 0 4 28 0

61 (58) 79 (75) 111 (108) 115 (110) 422 (373) 1196 (1191)

4 4 0 3 11 0

bandwidth

broken by Ancestor ordering

4007 4068 4429 3915 6480 12396

broken by pre‐ degree

# edges

1208 1340 1543 1626 2306 2964

# indistinguishable Nodes (pairs)

# nodes

2000 2001 2002 2003 2004 2006

#starting nodes

Year

Table 2: Some computation factors of the IMDB movie data set of different year. The column of “Running time” shows both the total running time and also the time used for finding the candidate starting nodes (in parentheses).

516 540 607 520 865 1388

124

Subgraph 1

Subgraph 2 –– 867

Figure 64: Mov vies released 2006. 2 The whoole graph conttains 3953 nod des (movies) an nd 12957 edgees. T There are totallly 867 conneccted subgraphs. Each subgraph is a conneected graph an nd there are no n coonnections bettween differen nt subgraphs. The largest su ubgraph matrrix is showed on o the upper left l poortion of the matrix m visualizzation area. Itt takes most of the display space s and its canonical c matrix loooks like a “lea af”, each of th he rest subgrap phs contains less l than 10 noodes on averagge, they form a “lline” along thee main diagon nal. In the firsst subgraph, the upper left area has sligh htly more coonnections tha an the lower right r part.

125

A cluster

(a)

3

4 1 2 (b) Figure 65: A cluster is highlighted in red in the matrix representation in (a), its corresponding nodelink subgraphs are showed in (b). The 4 clusters are labeled 1 – 4. Node labels are movie names. The labels are very small here to fit the node boundary. If a user wants to read a node’s label, he/she can probe that node, the corresponding label text will be shown in the tooltip.

126

Figure 65 shows a zoom-in screenshot of part of the matrix in Figure 64. Clusters along the main diagonal can be identified easily. They represent relatively highly connected subgraphs. Figure 65 (b) shows the subgraph view of one cluster (colored in red) in Figure 65 (a). There are 4 sub-clusters in this cluster; they are either cliques (sub cluster 3 and 4) or some structures closer to a clique. If we look into this data closely, it is not hard to find that movies in each cluster are connected because of one person. For example, most of the connections in cluster 1 are because of an actor called Anglin Chriss; cluster 2 are because of Jed Rowen; cluster 3 are because of Teresa Berkin, and cluster 4 are because of Tyson Richard. Besides the above clusters, Figure 65 also shows a clique that is along the diagonal and comes after the cluster we just studied. When we examine it more closely, as we can see in Figure 66, there are several “bars” appeared next to this clique. Figure 67 is a node-link representation of this clique. Similar to the above clusters, these 14 movies (nodes) are connected to each other because of Larry Laverty.

127

Figure 66: A clique and several bars next to the clique appear in the canonical mattrix ordering of o ubgraph 1. su

128

Figure 67: A node-link visualization of the clique in Figure 65. All these movies are connected because of Larry Laverty, he participated in all these 14 movies in the year 2006.

The outline of all edges in the final canonical visual matrix is formed by horizontal and vertical bars (see Figure 68). These bars consist of all tree edges in the corresponding BFS spanning tree. This is consistent with theorem 1 (proved in section 3.1) that “if an ordering is derived from a BFS tree, then the bandwidth of this ordering is determined by one of the tree edges in this BFS tree.”

129

Figure 68: Horizontal and vertical bars form a wrap of all edges. These bars are formed by tree edges of the corresponding BFS tree.

6.4 Conclusion The result of our testing supports our analysis in chapter 3. Our canonical visual matrix depends only on the input graph’s topological information. The canonical matrix visual representation also tends to have a small bandwidth with some clusters along the diagonal. These clusters represent highly connected subgraph structures, as shown in the year 2006 movie dataset. Synthetic dataset and as well as the 2006 movie dataset show that three of our designed criteria can break most of the tie situations, which suggests, in most cases we

130

don’t need to go through those complicated tests to find our canonical visual matrix. It also suggests that for a random graph, getting a canonical visual representation is not very expensive. The canonical visual matrix for the 2006 movie dataset shows a clear outline that is formed by horizontal and vertical bars. Similar patterns also showed up in other years’ datasets. These are formed by the tree edges of the corresponding BFS spanning tree.

131

7

CONCLUSION

Graph visualization and exploration is an interesting task that has been studied intensively in recent years. Graph drawing research has produced a set of wonderful graph layout strategies but it is still very difficult to totally avoid edge crossing and node overlaps, especially when a graph is large. Graph matrix visualization does not have these problems by nature, but how to order vertices to generate a meaningful visual matrix is still an unsolved problem. Many of the vertex reordering algorithms were not designed for matrix visualization; although some of them could be used for visualization, we cannot take them just as they are. Even if we could use the vertex reordering algorithm that was designed for matrix visualization (such as the clustering method), we still need to be aware that none of the existing vertex reordering algorithms could generate a stable matrix; which means, for the same input graph the output is not necessary the same. There are many factors that could affect the output matrix, such as the starting node, the initial node ordering, and if it has strategies for breaking ties. The idea of using the graph canonical form for comparing graphs has been adopted widely in the graph mining community. There are several different ways to generate a graph’s canonical form. One of the limitations is that they can only be applied

132

to attribute graphs, which means, they need the node and edge labels to construct their graph canonical forms. This could be confusing since two structurally identical graphs with different node and edge labels could have different canonical forms. This research introduces a new method that can produce a canonical visual adjacency matrix. This method is based on the Cuthill-McKee algorithm (section 2.6.1), it uses the properties of a graph and the Cuthill-McKee ordering (or more generally, the BFS ordering) to select the candidate starting nodes and break ties. The resulting canonical visual adjacency matrix is only depending on a graph’s structure, nothing else.

7.1 Contributions None of the existing vertex reordering algorithms can generate a stable visual matrix. We carefully examine current vertex reordering strategies, identify their advantages and disadvantages, and design a procedure that can help us achieve a canonical visual matrix. Our procedure is based on the Cuthill-McKee algorithm, which was designed to reduce the bandwidth of the matrix. We choose this algorithm because we believe a smaller bandwidth might bring edges closer to the main diagonal, which could help to form interesting clusters that might be useful for graph visual exploration. Our procedure uses special properties of the BFS spanning tree ordering. We know that there are only three possibilities that a node can be connected with other nodes in the same BFS spanning tree — with nodes on its own level, or with nodes on the level above or below. For nodes that are connected on different levels, there are special rules to

133

apply. We formally analyzed these properties and designed a set of tests that can help us break tie situations as much as possible. In the mean while, whenever there is a tie situation, we always use the simplest but efficient test first, and we keep all useful information on the fly to save the computation time. We proved that two graphs are topologically identical if and only if they share the same visual adjacency matrix by following our procedure. Our canonical visual matrix depends only upon a graph’s structure; none of those factors such as initial layout and graph labels will have effects on our final canonical visual matrix. We also point out that the final vertex orderings are not always the same, this is because the existence of indistinguishable nodes, pseudo-indistinguishable nodes, and other similar nodes. We studied the properties of our procedure and point out that its potential challenges might come from symmetric graphs, since there are more unbreakable ties and we have to try every possibility that could happen in order to guarantee the unique visual matrix. We examined several small symmetric graphs; they do have unbreakable ties but our procedure also efficiently reduced the number of possibilities. Some of the existing algorithms, such as the minimum degree ordering for the bandwidth reduction, need to alter the graph structure (adding and removing edges) in their procedure to get the result. Our algorithm only uses graph properties and the consequences of the BFS ordering to order nodes in a graph.

134

Chapter 4 gives several sample patterns that could appear in the matrix representations. We want to show that patterns in the matrix are corresponding to interesting graph structures. We also built a prototype application to test the correctness of our theory and also used it to test our canonical visual matrix’s potential utility in helping the visual graph exploration. Results of our tests confirm our theory in all aspects. We also collected several important computational parameters to show the performance of our algorithm. They suggest that for random graphs, we only need three simplest tests in our procedure to break most of the ties. The result of year 2006 IMDB data set shows the potential of our canonical visual matrix in helping the graph visual exploration. Several clusters emerged along the main diagonal; we showed that they are relatively highly connected subgraph structures. And very interestingly, each cluster is caused by a similar reason – nodes (movies) are connected because there was a person who had worked in most of the movies in its own cluster. A clear outline in the canonical visual matrix is formed by many horizontal and vertical bars. Edges represented by this line are tree edges of the corresponding BFS spanning tree. Our prototype application also provides tools to highlight this information.

7.2 Future Works Although symmetric graph structure could bring the worst case scenario to our procedure, they can also be useful in another way. Our current vertex reordering strategy

135

does not take advantage of the symmetric data structure. General symmetric testing is a hard problem, but some simple ones might not be difficult to identify. This information could help reduce the overall computational costs by eliminating unnecessary permutations. We would like to further evaluate the utility of our application, especially in helping of the graph mining procedure. We would also like to study its performance on more graph models and more real graph datasets. User studies could also help to design a better application. Several properties of our reordering result might help to compare the similarities between different graphs. These include the skeleton, the bandwidth and the clusters. We would like to study their potentials, and their meanings for a graph. The BFS spanning tree we used to reorder nodes implies a hierarchical level on all nodes in a given graph. How it can be used other than the node ordering also needs more studies. Our current reordering procedure is running in a single thread. The results in chapter 6 show that the first step of our procedure (identifying the candidate starting nodes) takes a good portion of the total running time. This step can be implemented by using multithreading processes. Also as we mentioned in chapter 3, if there are multiple qualified starting nodes, or whenever we need to try different permutations, the procedure can be separated into multiple processes that could boost the performance of the whole application.

136

Our current application is using XML as its input and output file format. It is recommended in the graph drawing community to use the GraphML instead. We would like to change our file loader and writer functions to be able to read and write a standard GraphML file. So, our application and results can be easily shared with others. Our algorithm is based on the Cuthill-McKee algorithm and its improvement algorithm, the GPS. Both of these two algorithms try to find a node ordering that has a small bandwidth. Although finding a smaller bandwidth is not our goal, we believe our algorithm night still produce a matrix with a small bandwidth. How the bandwidth of our matrix is comparing to others is an interesting question. Except BFS-based vertex ordering algorithms, there are other vertex reordering algorithms. As Mueller (Mueller, Martin et al. 2007) mentioned in his paper, the DFS spanning tree-based vertex ordering might also be good for graph visualization and exploration. We would like to explore their potentials in graph visualizations also.

137

8

Literature Cited

Abelson, D., S.‐H. Hong, et al. (2001). A Group Theoretic Method for Drawing Graphs Symmetrically. The 10th International Symposium on Graph Drawing, LNCS. 2528: 86‐97. Agrawal, R. and R. Srikant (1994). Fast Algorithms for Mining Association Rules. 20th International Conference on Very Large Data Bases. J. B. Bocca, M. Jarke and C. Zaniolo, Morgan Kaufmann Publishers Inc.: 487‐499. Babai, L. (1981). Moderately Exponential Bound for Graph Isomorphism. Proceedings of the Fundamentals of Computing Science. LNCS 117: 34‐50. Babai, L., P. Erdos, et al. (1980). "Random Graph Isomorphism." SIAM Journal on Computing 9(3): 628‐635. Barabasi, A.‐L. and A. R. (1999). "Emergence of scaling in random networks." Science 286 (5439): 509–512. . Battista, G. D., P. Eades, et al. (1998). Graph Drawing: Algorithms for Visualization of Graphs. Upper Saddle River, NJ, USA, Prentice Hall. Becker, R. A. and W. S. Cleveland (1987). "Brushing scatterplots." Technometrics 29(2): 127‐142. Bederson, B. B., B. Shneiderman, et al. (2002). "Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies " ACM Trans. Graph. 21(4): 833‐854. Bertin, J. (1984). Semiology of Graphics: Diagrams, Networks, Maps, The University of Wisconsin Press. Bhowmick, S. and P. Hovland (2006). A Polynomial Time Algorithm for the Detection of Axial Symmetry in Directed Acyclic Graphs. ANL/MCS‐P1314‐0106, Argonne National Laboratory, Illinois. Biggs, N. L., E. K. Lloyd, et al. (1976). Graph Theory 1736‐1936. Oxford, England, Oxford University Press. Biggs, N. L., E. K. Lloyd, et al. (1986). Graph Theory 1736‐1936, Oxford University Press, USA.

138 Borgelt, C. (2005). On Canonical Forms for Frequent Graph Mining. Workshop on Mining Graphs, Trees, and Sequences Porto, Portugal: 1‐12. Brandes, U., M. Eiglsperger, et al. (2001). GraphML Progress Report Structural Layer Proposal. Proceedings 9th International Symposium on Graph Drawing, Springer‐Verlag. Brandes, U., M. Eiglsperger, et al. "GraphML Primer." from http://graphml.graphdrawing.org/primer/graphml‐primer.html. Buchheim, C. and M. Junger (2001). Detecting Symmetries by Branch & Cut. Graph Drawing: 178‐188. Chao, C.‐Y. (1971). "On the Classification of Symmetric Graphs with a Prime Number of Vertices " Transactions of the American Mathematical Society 158(1): 247‐256. Chartrand, G. (1985). The Königsberg Bridge Problem: An Introduction to Eulerian Graphs. Introductory Graph Theory. New York, Dover Publications. Chen, H.‐L., H.‐I. Lu, et al. (2000). On Maximum Symmetric Subgraphs. The 8th International Symposium on Graph Drawing, Springer‐Verlag. Coble, J., R. Rathi, et al. (2005). "Iterative Structure Discovery in Graph‐Based Data." International Journal on Artificial Intelligence Tools 14(1‐2): 101‐124. Cook, D. J. and L. B. Holder (2000). "Graph‐Based Data Mining." IEEE Intelligent Systems 15(2): 32‐41. Cormen, T. H., C. E. Leiserson, et al. (1990). Introduction to Algorithms The MIT Press. Corneil, D. G. and D. G. Kirkpatric (1980). "A Theoretical Analysis of Various Heuristics for The Graph Isomorphism Problem." Society for Industrial and Applied Mathmatics 9(2): 281‐297. Corporation, T. S. S. (2003). Graph Editor Toolkit for Java Developer’s Guide, Tom Sawyer Software Corporation. Cuthill, E. and J. McKee (1969). Reducing the Bandwidth of Sparse Symmetric Matrices. The 1969 24th national conference ACM: 157‐172 DoÊrusöz, U., B. Madden, et al. (1996). Circular Layout in the Graph Layout Toolkit, Springer. Eiglsperger, M., S. P. Fekete, et al. (2001). Orthognal Graph Drawing. Drawing Graphs: Methods and Models. M. Kaufmann, J. Hartmanis, G. Goos, J. V. Leeuwen and D. Wagner, Springer‐Verlag New York, LLC. 2025: 121‐171.

139 Euler, L. (1736). "Solutio problematis ad geometriam situs pertinentis." Comment. Acad. Sci. U. Petrop 8: 128-140. Fan, X. and J. Han (2002). gSpan: Graph‐Based Substructure Pattern Mining. IEEE Intl. Conf. on Data Mining ICDM. Maebashi City, Japan: 721‐723. Fortin, S. (1996). The Graph Isomorphism Problem. Technical Report TR 96‐20. Edmonton, Alberta, Canada, University of Alberta. Foster, R. M. (1932). "Geometrical Circuits of Electrical Networks." Transactions of the American Institute of Electrical Engineers 51: 309‐317. Fraysseix, H. D. (1999). An heuristic for graph symmetry detection. The 7th International Symposium on Graph Drawing. 1731: 276‐285. Frishman, Y. and A. Tal (2004). Dynamic Drawing of Clustered Graphs. IEEE Symposium on Information Visualization, Austin, TX, IEEE Computer Society Gansner, E., E. Koutsofios, et al. (2006). Drawing graphs with dot: 40. Gansner, E. R., E. Koutsofios, et al. (1993). "A Technique for Drawing Directed Graphs " IEEE Transactions on Software Engineering 19(3): 214‐230. Gansner, E. R. and S. C. North (2000). "An open graph visualization system and its applications to software engineering " Software—Practice & Experience 30(11): 1203‐1233. Garey, M. R. and D. S. Johnson (1979). Computers and intractability: A guide to the theory of NP‐ completeness, W. H. Freeman Garg, A. and R. Tamassia (1994). "On the Computational Complexity of Upward and Rectilinear Planarity Testing." SIAM Journal on Computing 31(2): 601‐625. Garg, A. and R. Tamassia (2002). "On the Computational Complexity of Upward and Rectilinear Planarity Testing." SIAM Journal on Computing 31(2): 601‐625. Gee, A. G. (2004). A Universal Visualization Platform. Dept. of Computer Science. Lowell, University of Massachusetts Lowell. Sc.D.: 201. George, A. and J. W.‐H. Liu (1981). Computer Solution of Large Sparse Positive Definite, Prentice Hall Professional Technical Reference. George, A. and J. W. H. Liu (1978). "Algorithms for matrix partitioning and the numerical solution of fimte element systems." SIAM Journal on Numerical Analysis 15(2): 297‐327.

140 George, A. and J. W. H. Liu (1978). "An automatic nested dissection algorithm for irregular finite element problems." SIAM Journal on Numerical Analysis 15(5): 1053‐1069. George, A. and J. W. H. Liu (1979). "An Implementation of a Pseudoperipheral Node Finder " ACM Transactions on Mathematical Software 5(3): 284‐295. Ghoniem, M., J. Fekete, et al. (2004). A Comparison of Readability of Graphs Using Node‐Link and Matrix Representations. IEEE symposium on information visualization 2004, Austin, TX. Gibbs, N. E., W. G. Poole, et al. (1976). "An algorithm for reducing the bandwidth and profile of a sparse matrix." SIAM Journal on Numerical Analysis 13(2): 236‐250. Gilbert, J. R., C. Moler, et al. (1992). "Sparse Matrices in MATLAB: Design and Implementation." SIAM Journal on Matrix Analysis and Applications 13(1): 333‐356. Giuşcă, B. (2005). Seven Bridges of Königsberg. Konigsberg_bridges.png. 302 x 238. GraphMLTeam. (2007). "The GraphML File Format." from http://graphml.graphdrawing.org/. Gross, J. L. and J. Yellen, Eds. (2004). Handbook of Graph Theory, CRC Press. Grunwald, P. (2005). A tutorial introduction to the minimum description length principle. Advances in Minimum Description Length: Theory and Applications. P. Grunwald, I. J. Myung and M. Pitt. Cambridge, Massachusetts, MIT Press. H. L. Crane, J., N. E. Gibbs, et al. (1976). "Algorithm 508: Matrix Bandwidth and Profile Reduction [F1]." ACM Transactions on Mathematical Software (TOMS) 2(4): 375‐377. Harary, F. (1999). Graph Theory, Westview Press. Heggernes, P., S. C. Eisenstatz, et al. (2001). The computational complexity of the Minimum Degree algorithm. 14th Norwegian Computer Science Conference. University of Troms, Norway. Henry, N. and J.‐D. Fekete (2006). "MatrixExplorer: a Dual‐Representation System to Explore Social Networks." IEEE Transactions on Visualization and Computer Graphics 12(5): 677‐684. Henry, N., J.‐D. Fekete, et al. (2007). "NodeTrix: a Hybrid Visualization of Social Networks." IEEE Transactions on Visualization and Computer Graphics 13(6): 1302‐1309. Holder, L. B., D. J. Cook, et al. (1994). Substructure Discovery in the SUBDUE System. The Workshop on Knowledge Discovery in Databases. Hong, S.‐H. and P. Eades (2003). Symmetric Layout of Disconnected Graphs Algorithms and Computation, Springer Berlin / Heidelberg. 2906/2003: 405‐414.

141 Huan, J., W. Wang, et al. (2003). Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism The 3rd IEEE International Conference on Data Mining Melbourne, Florida, USA, 549: IEEE Computer Society. Hubbard, J. R. (2000). Schaum's Outline of Data Structures with Java, McGraw‐Hill. IMDB (2008). The Internet Movie Database (IMDB), IMDB.com, Inc. Ingram, S. (2006). Minimum Degree Reordering Algorithms: A Tutorial. Inokuchi, A., T. Washio, et al. (2000). An Apriori‐based Algorithm for Mining Frequent Substructures from Graph Data. The 4th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD). Lyon, France.: 13‐23. Ketkar, N. S., L. B. Holder, et al. (2005). Subdue: Compression‐Based Frequent Pattern Discovery in Graph Data. The 1st international workshop on open source data mining: frequent pattern mining implementations Chicago, Illinois ACM Press: 71‐76. Krempel, L. (2003). "Structures of World Trade ", from http://www.mpi‐fg‐ koeln.mpg.de/~lk/netvis/trade/WorldTrade.html. Kuramochi, M. and G. Karypis (2001). Frequent Subgraph Discovery. 1st IEEE Conference on Data Mining. San Jose, California, USA: 313‐320. Liu, J. W. H. (1985). "Modification of the minimum‐degree algorithm by multiple elimination " ACM Transactions on Mathematical Software (TOMS) 11(2): 141‐153. Mäkinen, E. (1988). "On circular layouts." International Journal of Computer Mathematics: 29– 37. Manning, J. (1990). Geometric Symmetry in Graphs. Department of Computer Science, Purdue University. Ph.D. Manning, J. and M. Atallah (1988). "Fast Detection and Display of Symmetry in Trees." Congressus Numerantium 64: 159‐169. Manning, J., M. Atallah, et al. (1995). "A System for Drawing Graphs with Geometric Symmetry." International Symposium on Graph Drawing, LNCS 894: 262‐265. Martía, R., M. Lagunab, et al. (2001). "Reducing the Bandwidth of a Sparse Matrix with Tabu Search." European Journal of Operational Research. 135(2): 450‐459.

142 Masuda, S., T. Kashiwabara, et al. (1987). On the NP completeness of A Computer Network Layout Problem. 1987 IEEE Intl. Symp. Circuits and Systems, Los Alamitos. McKay, B. D. (1981). "Practical Graph Isomorphism." Congressus Numerantium 30: 45‐87. McKay, B. D. (2007). Nauty User's Guide (version 2.4), Computer Science Department, Austrialian National University. Miyazaki, T. (1997). "The Complexity of McKey's Canonical Labeling Algorithm." DIMACS Series in Discrete Mathematics and Theoretical Computer Science 28(239‐256). Mueller, C., B. Martin, et al. (2007). "A comparison of vertex ordering algorithms for large graph visualization." Visualization, 2007. APVIS apos;07. 2007 6th International Asia‐Pacific Symposium on: 141‐148. Munson, T. S. (2004). Mesh shape‐quality optimization using the inverse mean‐ratio metric. ANL/MCS‐P1136‐0304. Argonne National Laboratory, Illinois. Munzner, T. (2000). Interactive Visualization of Large Graphs and Networks. Computer Science, Stanford University. Ph.D. North, S. C. (1996). Incremental layout in DynaDAG. Graph Drawing, Springer Berlin / Heidelberg. 1027/1996. North, S. C. and G. Woodhull (2001). Online Hierarchical Graph Drawing. Vienna, Austria, Springer‐Verlag, Berlin Heidelberg. Press, W. H., B. P. Flannery, et al. (1992). Sparse Linear Systems. Numerical Recipes in FORTRAN: The Art of Scientific Computing, Cambridge University Press: 63‐82. Read, R. and D. Corneil (1977). "The Graph Isomorphism Disease." Journal of Graph Theory 1: 339‐363. Robertson, M., J. D. Mackinlay, et al. (1991). Cone Trees: Animated 3D Visualizations of Hierarchical Information. Conference on Human Factors in Computing Systems (CHI'91), ACM Press: 189‐194. Six, J. M. and I. G. Tollis (1999). Circular Drawings of Biconnected Graphs, Springer Berlin/Heidelberg. Skiena, S. (1990). Graph Isomorphism. Implementing Discrete Mathematics: Combinatorics and Graph Theory With Mathematica Reading, MA, Addison‐Wesley: 181‐187.

143 SmartMoney.com. (2007). "Map of the Market." Retrieved Nov. 26, 2007, 2007, from http://www.smartmoney.com/marketmap/. Smyth, W. F. and I. Arany (1976). Another algorithm for reducing bandwidth and profile of a sparse matrix. AFIPS 1976 NCC. Montvale, N.J, AFIPS Press: 987‐994. Ullman, J. D. (1976). "An algorithm for subgraph isomorphism." Journal of the ACM 23(1): 31‐42. Wijk, J. J. v. and H. v. d. Wetering (1999). Cushion Treemaps: Visualization of Hierarchical Information IEEE Symposium on Information Visualization (INFOVIS’99), San Francisco. Yan, X. and J. Han (2003). CloseGraph: mining closed frequent graph patterns The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Washington, D.C. , ACM Press: 286‐295

144

9

Biographical Sketch of Author

Hongli Li received her Bachelor of Engineering degree in Mechanical Engineering with high honors from Xian University of Technology, China, in 1995. She received a Master of Engineering degree in Biomedical Engineering from Xi’an Jiaotong University, China, in 1999. Hongli came to Massachusetts in the year of 2000. She then became a member of the Institute for Visualization and Perception Research. She has been under the direction of Professor Georges G. Grinstein since 2001. During her study, she participated in a variety of projects which gave her opportunities to learn different problems in different areas. She worked as a part-time data miner at AnVil, Inc, from 2001 to 2002; where she worked on several projects in the area of bioinformatics. She also worked as a summer intern in Merck & Co., Inc. (WestPoint, PA) in 2005, where she helped to build a knowledge database that allow scientists to compare their studies with published literatures and existing studies. Her research interests include areas of information visualization, data mining, graph theory, database, user-interfaces, and bioinformatics.

a visual canonical adjacency matrix for graphs

a visual canonical adjacency matrix for graphs

Suggest Documents

Spectral Moments of the Edge Adjacency Matrix in Molecular Graphs ...

Spectral analysis for adjacency operators on graphs

Canonical Operators on Graphs

Role of Adjacency Matrix & Adjacency List in Graph Theory - CiteSeerX

Nilpotent Adjacency Matrices, Random Graphs, and Quantum ...

Combinatorics on Adjacency Graphs and Incidence

Application of Improved Adjacency Matrix ... - ScienceDirect

Affine layer segmentation and adjacency graphs for vortex ... - USC IRIS

DETERMINANT OF ADJACENCY MATRIX OF SQUARE CYCLE

Affine layer segmentation and adjacency graphs for ... - Google Sites

Canonical problems and a generalized matrix technique

Canonical binary matrices related to bipartite graphs

Interval Graphs: Canonical Representations in Logspace - CiteSeerX

Nilpotent Adjacency Matrices and Random Graphs - Semantic Scholar

Rethinking (k, â)-anonymity in social graphs:(k, â)-adjacency ...

LCD codes from adjacency matrices of graphs - College of ...

Cycles and Components in Geometric Graphs: Adjacency Operator ...

Application of Improved Adjacency Matrix Multiplication in Distribution ...

Automatic generation of adjacency matrix of single-wall carbon ...

singular matrix pencil, Kronecker canonical form

Automatic generation of adjacency matrix of single-wall carbon

GENERALIZED CANONICAL FACTORIZATION OF MATRIX AND ...

Robust Covariance Matrix Estimation with Canonical Correlation ...

Navigating Wikipedia with the Zoomable Adjacency Matrix ... - HAL-Inria

a visual canonical adjacency matrix for graphs