the publicly available Facebook API, we mined information regarding one of the authors from .... A.O. Artero, M.C.F. de Oliveira, and H. Levkowitz. Uncovering ...
Multivariate Graph Drawing using Parallel Coordinate Visualisations Ross Shannon, Thomas Holland and Aaron Quigley University College Dublin Technical Report UCD-CSI-2008-06 September, 2008
Abstract. Graph drawing is increasingly considering the embedding and drawing of multivariate or highly attributed graphs. The direct application of classical layout methods is difficult due to limited space and encoding options as the number of attributes (dimensions) related to nodes of information increases. Data from domains including bioinformatics (metabolic networks, protein-protein interaction) and social science (social networks, phone-call networks, disease transmission networks) consists of relational data which also possess a large number of individual attributes. Here we present our visualisation method featuring a combination of a graph drawing coupled with an adapted parallel coordinates visualisation. This technique makes the relations between multivariate data explicit, while preserving the expressiveness of existing techniques. These layout methods are implemented in an interactive Java-based visualisation tool. Examples of the use of this technique are shown with their application to interactive visual data analysis of a social network data set.
1
Introduction
Ongoing research has shown there remain classes of graphs (such as social and biological networks) for which even the fastest layout algorithms leave room for improvement due to interaction, topological and hidden attribute issues [9,12]. In parallel to this, the visualisation of large and complex multivariate data sets is becoming increasingly necessary to give people an insight and understanding of large numbers of data cases with many dimensions of interest [2]. Interactive visualisation tools help the viewer perform visual data analysis tasks: exploring trends and identifying outliers in the data set, highlighting and filtering sections of the view, and finding interesting relations and patterns. In contrast to the well-studied problems of large and dynamic graph layout, here we focus on the problems of the layout and drawing of multivariate graphs where each node is associated with several attributes of information [18]. With simple graph models the nodes and edges can be drawn with node link, matrix or even hybrid approaches [10]. Attributes for each node such as age, gender, height, weight or number of friends in a social network for example, can be encoded as visual attributes within the graph drawing. However, as the number
of dimensions steadily increases the number of visual attributes which can be reliably used to encode each dimension steadily falls off. Parallel Coordinates as shown in Figure 1 are a statistical visual analytics technique which facilitate the plotting and investigation of large data sets with high dimensionality [13]. The technique is useful for finding correlations among cases with an arbitrary number of attributes, as many data dimensions can be encoded in successive horizontal axes, and dimensions can be easily appended or removed from an existing diagram. Traits and non-obvious patterns of similarity between data cases can be clearly seen [4], and interaction such as selection, brushing and range queries allows a user to control what aspects of the data are displayed.
Fig. 1. A traditional Parallel Coordinates drawing, showing 191 cases over 8 data dimensions (1528 polylines) from a server error log. Parallel Coordinate Visualisations (PCVs) visually represent the distribution of values for the attributes of a set of cases (e.g., characteristics of a group of people, including their height, weight, age, salary). PCVs excel at visually clustering cases, as entities which share similar attribute values across a number of ordinal or quantitative dimensions can be identified through the distribution of case lines within the visualisation. The user can see the full range of the data’s many dimensions and is not required to choose a subset of representative dimensions or to project down into a lesser number of dimensions using methods such as principal components analysis, mutli-dimensional scaling or linear local embeddings. However, a clear limitation of PCVs is that any relationships between individual data cases are not represented in the final view. For example, any familial relationships between the group of people being visualised would be lost in this view. In contrast, graph drawings not only display the data cases (nodes) but also the relationships (edges) between entities in a network such as proteins in a protein-protein interaction network [6]. Nodes within graphs can represent
a variety of entities, from routers to software components [14] to people, and the edges can represent a range of relationships between them. Graphs facilitate analyses of individual neighbourhoods of certain nodes, and more theoretic measures allow viewers to analyse degree distributions and patterns of connections throughout the network. Properties of the inter-node relationships can be shown through drawing techniques such as stroke width and line style. However, encoding properties of the nodes themselves (by using size, colour, shape and so on) is more difficult due to limited available space; visualisation designers quickly find that they run out of dimensions to encode information with. Some of the information must therefore be excluded from the representation. Due to this shortage of encoding options, attempting to add extra data dimensions to traditional node-link graphs is a significant challenge. Here we present a method of pairing Parallel Coordinates with graph drawings, a hybrid approach for multivariate graph drawing which is capable of representing relationships between cases in data sets. We present to the viewer an adapted PCV tightly-coupled with a graph layout, which can simultaneously show pairwise relationships between nodes while retaining the multivariate expression of traditional PCV. Section 2 describes some of the related work in this domain pertaining to combining multiple visualisation techniques into a single layout. Additional visual cues have been added to the PCV to associate the relational information, and interactions carried out on one view are reflected in both. In Section 3 we present the specifics of the technique itself and detail the implementation. Section 4 contains some applications of the technique, and Section 5 presents discussion of some of the tasks that this approach makes easier to accomplish. In Section 6 we present some conclusions based on this work.
2
Background
The interactive, visual representation of the nodes and edges from abstract relational data is the key research challenge for much of graph drawing [11]. However, the data arising from various domains including bioinformatics (metabolic networks [3], protein-protein interaction [6]), social science (social networks, phonecall networks, airline routes, disease transmission networks) and ICT (computer networks, software calls [14], neural networks) brings with it many attributes of interest for each node. We refer to such graphs as multivariate graphs and to the problem of representing such graphs as multivariate graph drawing. Classical node-link graph drawings typically use a range of visually distinguishable features to encode information about the nodes within. These features include the colour (hue), size, shape, texture, orientation, curvature and so forth. However, there is clearly a limit after which no more dimensions of the entity can be visually represented in this manner. Whatever information is shown in the view will furthermore typically require the viewer to refer to a key to decode what each of the encoded properties mean. This makes the visualisation less
immediate and hampers understanding and knowledge acquisition. It is for this reason that we have paired the graph, which shows relationships among the entities, to an adapted Parallel Coordinates view, which can easily represent many dimensions for each entity, with interaction guiding the coupling of the views. Parallel Coordinates are a standard tool which give users a global view of trends in the data while allowing clusters and subsets to be viewed when necessary. They are a two-dimensional presentation method for multidimensional data. A set of n-dimensional tuples are drawn as a set of polylines bridging the gaps between parallel vertical axes. Each axis encodes values for the quantitative or ordinal properties that each tuple can have. As all the polylines are being drawn in the same area, the technique has been shown to scale well to large data sets up to a certain point, presenting a compact view of the entire data set. As Parallel Coordinates have a tendency to become crowded as the size of the data set grows larger, techniques have been designed to cluster or elide sub-sets of the data to allow the dominant patterns to be seen [1]. Hierarchal clustering [7] uses colour to visually distinguish cases that share a certain range of values into a number of sets, increasing the readability of the diagram. The technique does not use hierarchal relationships in the data itself however. Further techniques such as polyline averaging [16], are less computationally demanding while offering similar benefits. This technique dynamically summarises a set of polylines, encouraging experimentation to discover additional information. By showing fewer lines, some of the data’s integrity is lost, but this can be partly assuaged by displaying the data’s standard deviation atop the plot. Interaction methods within graph drawing are well studied with methods based on zooming, semantic zooming, focus+context, overview and detail, clustering, visual exploration, animation or node re-ordering employed [10] . PCV plots offer less freedom in some respects than classical interactive graph drawing as the layout of attribute values within a dimension affects all the other values within that dimension [16]. However, any change to this (such as flipping the order of the axis) affects at most the display of two other dimensions to the immediate left and right. In addition, PCV affords the opportunity to re-order the dimensions displayed to reduce edge crossings, support a particular user’s tasks or to help explore or emphasise a particular pattern to be studied within the data [1]. Within a PCV the polylines may be tightly packed and could represent thousands of cases so selecting individual cases is difficult. Performing a selection is generally accomplished by “brushing” a line across a swath of the display, which then highlights all cases that the line intersects. More fine-grained access to the cases is generally not available, as the focus has historically been on statistical aggregation. The newly-attached graph view affords a clear selection mechanism which allows individual and clustered selection of cases. We will return to this topic in Section 3.1 wherein we discuss the expanded interactivity options available in this presentation.
Fig. 2. A view of the Paired Parallel Coordinates visualisation tool, consisting of a PCV enhanced with a corresponding node-link graph drawing, allowing selections in one view to be represented in another. This is a view of a social network from Facebook, with 89 nodes, 434 edges and 10 data dimensions. Existing research has demonstrated an increase in a user’s understanding of a data set by pairing two distinct and often conceptually different visualisation techniques in multiple coordinated views of data [10,8]. Siirtola described Parallel Coordinates combined with a reorderable matrix view of the data, and tested the limits of combining different techniques to improve comprehension [17]. The related user study reports an improvement in the participant’s knowledge acquisition tasks after an initial period of learning.
3
Multivariate Graph Drawing
To explore the design considerations for our layout method we began with a traditional Parallel Coordinates view and added a node-link drawing of a graph adjacent to it on the right. This arrangement can be seen in Figure 2. In our tool we are using a standard force-directed graph layout algorithm [5]. When a new data set is loaded, the graph iteratively lays itself out according to the physical model, and any clusters extant in the data will gradually be revealed. Our visualisation tool is built using Processing [15], a Java-based visualisation framework which supports rapid prototyping of visualisation techniques. Moving beyond this simple coupled view, we explored painting the polylines translucently in the PCV view. This makes the strongly correlated groupings
Fig. 3. Here we can see a selection in the graph drawing being represented in the PCV.
more prominent as the opacity of these clusters builds up as they group together on the display. Labels for each of the attributes are displayed below the axes, and values at various points along the axes can be exposed by hovering along its length. Next, we explored the visual display of the relationships amongst the data by drawing additional lines and then B´ezier curves between the polylines to attach them visually to each other. Though this resulted in some of the underlying cases being slightly obscured, the amount of visual clutter and crossings introduced in the view made the adapted PCV more difficult to follow and interpret. We later designed an improved visual metaphor where edges in the graph drawing take the form of translucent silver shading between case lines in the PCV view using alpha-blending, as shown in Figure 3. It is important to note that these translucent areas accumulate, so that a cluster of nodes that share similar properties will appear brighter than related nodes that do not share many similarities. We will return to this point later when discussing our examples in Section 4. If the edges in the graph are weighted with a value between zero and one, we can apply this number to the opacity of the area shading, so that nodes that are strongly connected to each other in the graph are also strongly visually connected in the second view.
3.1
Interaction between the views
Tightly coupled views have been proven to reduce the user’s cognitive load in understanding multiple facets of a complex data set [10,8]. Visually demarcating a selection in two or more views at once helps to situate the viewer’s attention on subsets of the data. Particularly when the new techniques offer additional views of aspects of the data that the original technique either ignored or was not ideal for showing, there is an opportunity for new insights not available from the original methods alone. By default we do not show any edges in the PCV view unless directed to by the viewer. When presented in parallel, selections in either view can be expressed in both simultaneously. Individual nodes can be clicked and highlighted in the graph drawing, which also selects their immediate neighbours. Further nodes can be selected, all of which will appear highlighted in the PCV, facilitating comparisons of these specific cases. This provides a more natural interaction with the PCV when accuracy is required, as the interaction is more precise than screen scrubbing or brushing the case lines in the diagram. As the user adds more nodes to their selection, case lines in the PCV that are not involved with the clusters of nodes that are selected progressively fade from view. This allows the user to focus more intently on the nodes which they have expressed an interest in through direct manipulation. In contrast, selections performed on the PCV can similarly affect the graph drawing, by fading out nodes that do not share properties with those cases which are selected. In this way, the user can take a task-oriented approach by leveraging the unique capabilities of the views in tandem, to perform more effective data exploration.
4
Examples
Consider the social graph of people affected by an infectious epidemic (both those infected and those with contact but asymptomatic). The graph may be composed of nodes representing individuals; the edges between nodes representing contact between individuals. Edges may be weighted, based on contact time and contact proximity; and directed (for parents and their children). Nodes may be annotated with the individual’s current condition, time passed since contraction, along with personal details (like their age, weight, etc.). There are many other potential dimensions of an individual which would be beneficial in a visualisation to aid in the identification of a pattern of infection and potential immunity of some individuals: places frequently visited, places recently visited, meals consumed, previous and existing medical conditions, current medication, vaccinations and so forth. The following case study involves an analysis of social networking data. Using the publicly available Facebook API, we mined information regarding one of the authors from the social networking site. This was composed of undirected edges indicating friendships between the available set of friends, giving us a graph,
which we anonymised by applying random two-character node labels. We refer to these as friends and friendships but these relationships encompass a variety of different relationships other than purely friendship (family members, partners, work colleagues, etc.); the level of friendship is also not represented. The author is also excluded from the graph since they are linked to all of the nodes. The ten data dimensions for each node are extracted from the profile of each friend on the site. Of these 10 attributes, 4 are purely statistical in nature: Numerical Attributes Number of wall posts made on their profile page Number of groups which the user has joined Number of photos in which the user has been tagged Total number of friends We can use these as indications of a user’s engagement with the social networking website. Statistical measures were chosen because of the low correlation between entries in fields (counts of identical items in fields such as books, activities, interests, movies and TV shows were very low in number). Groups joined were represented through a count due to similar reasons (lack of correlation and a large number of possible values). The other 6 attributes were discretely valued: Discrete Attributes Network (optional) Home country (optional) Political views (optional) Relationship status (optional) Gender (can be unspecified) Timezone (relative to UTC, optional) Networks on Facebook are structured in nature, being based on location or institution attended and chosen from a list of possible values. The nature of the data extracted also means there is a greater correlation within the sample set, since friendships are likely the result of a shared environment with an individual at some point. The use of numeric counts gives measures of usage rather than correlation of values between specific users. This measure is also subject to the length of time since the user joined the site; low counts alone do not accurately indicate low levels of interaction. However, we expected to see patterns in the data dimensions which correlated to the node link diagram. Those users with a large number of edges (in this set) would most likely be more active in the social network (both in terms of the site and in social gatherings in the real world), attracting more wall posts and being tagged in more photos than those users with a smaller number of edges. In Figures 4, 5 and 6 the node link diagram shows five distinct clusters. The largest of these clusters represents people from two Universities where the author studied; they are connected by a single node (a colleague of the author
who also studied at both Universities). The second largest cluster represents people from a secondary/high school education institution. The cluster with five fully-connected nodes represents members of staff from a former employment; one of the clusters of three nodes, members of staff from a different former employment. The final cluster of three nodes represents people from a former shared dwelling. Figure 4 shows a number of individuals (in red) in three separate clusters who are also members of the same Facebook network (network names have been anonymised). Figure 5 highlights an interesting example. Person “CV” had the highest total number of friends (which includes friendships not represented here), as well as being featured in a large number of photos on the site. However in the visualised social network, the individual was only connected to a single node (apart from the author), indicating a peripheral relationship to this network of people. Figure 6 represents an examination of some of the outliers from the two main clusters. The cumulative effect of partially opaque shading is apparent; the selection of three nodes (in red) and the progressive opacity reduction of case lines outside of the highlighted range gives greater focus to the relevant individuals.
Fig. 4. Unconnected individuals in the main clusters who are part of the same Facebook network.
Fig. 5. Individual “CV”, who had the highest tagged photo and total friends counts, but is peripheral to this social network.
5
Discussion
Here we present three facets of our technique that are improved over a tool presenting a graph drawing or Parallel Coordinates plot in isolation: selection, correlation and filtering. 5.1
Selection
The ability to select individual or groups of nodes by using the graph drawing as a selection interface is a significant boost in capability over plain PCV. Natural groupings of related nodes—which may be most likely to have correlated attributes, depending on the data—can be easily selected together. This also allows interesting subsets of the data to be viewed. For example, we can show only cases that have some relationship with another case within the data set, unburdened by the rest of the unrelated data. Brushing is possible in both views of the tool, with traditional swipes across ranges in the PCV coupled with the ability to select edges (and thus the nodes they connect) by drawing a line that intersects them. 5.2
Correlation
In our case study of Facebook data, we can see correlations in the interests of groups of friends by analysing the translucent overlay. An interesting result occurs when two nodes are selected which come from disconnected areas of the social graph. Occasionally a strong correlation between the attributes of these
Fig. 6. Three outliers highlighted from the main clusters. cases can be observed, suggesting for example, similar personality traits and the potential for these people to get along well. 5.3
Filtering
We have approached this challenge by allowing the view of the data represented in the PCV to be tailored according to the structure of the underlying graph. Those tuples that do not relate to any others in the data can be removed from the view, and the selection can be iteratively narrowed from large trees to individual parent or sibling nodes.
6
Conclusions
We have presented our visualisation technique which combines the benefits of a Parallel Coordinates view with a graph drawing, which is designed to encode additional relationships between cases within the traditional Parallel Coordinates view. By combining a graph drawing with a Parallel Coordinates view, we can express the relationships between nodes in a data set while also making large amounts of data regarding each of the nodes’ attributes available in parallel, exploiting the strengths of each of these two techniques. The coupling of these two distinct but complementary views allows a wider range of interactions than was possible before, which can be used to drill down into the data to access rich details about individual items in the data set, and aggregate information over natural subsets for enhanced visual analytics opportunities.
Acknowledgements: This work is partially supported by an EMBARK Scholarship from the Irish Research Council in Science, Engineering and Technology and Science Foundation Ireland under grant number 03/CE2/I303-1, “LERO: the Irish Software Engineering Research Centre.”
References 1. A.O. Artero, M.C.F. de Oliveira, and H. Levkowitz. Uncovering clusters in crowded parallel coordinates visualizations. Information Visualization, 2004. IEEE Symposium on, pages 81–88, 2004. 2. Philip Ball. Data visualization: Picture this. Nature, 418(6893):11–13, 2002. 3. Ulrik Brandes, Tim Dwyer, and Falk Schreiber. Visualizing related metabolic pathways in two and a half dimensions. In Graph Drawing, pages 111–122, 2003. 4. S.K. Card, J.D. Mackinlay, and B. Schneiderman. Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann, 1999. 5. P. Eades. A heuristic for graph drawing. Congressus Numerantium, 42(149160):194–202, 1984. 6. Robert D. Finn, Mhairi Marshall, and Alex Bateman. ipfam: visualization of protein-protein interactions in pdb at domain and amino acid resolutions. Bioinformatics, 21(3):410–412, 2005. 7. Ying-Huey Fua, Matthew O. Ward, and Elke A. Rundensteiner. Hierarchical parallel coordinates for exploration of large datasets. In VIS ’99: Proceedings of the conference on Visualization ’99, pages 43–50, Los Alamitos, CA, USA, 1999. IEEE Computer Society Press. 8. Benoit Gaudin and Aaron Quigley. Interactive structural clustering of graphs based on multi-representations. In 12th International Conference on Information Visualisation IV08, July 2008. 9. Stefan Hachul and Michael J¨ unger. An experimental comparison of fast algorithms for drawing general large graphs. In Graph Drawing, pages 235–250, 2005. 10. Nathalie Henry and Jean-Daniel Fekete. Nodetrix: a hybrid visualization of social networks. IEEE Transactions on Visualization and Computer Graphics, 13(6):1302–1309, 2007. 11. Ivan Herman, Guy Melan¸con, and M. Scott Marshall. Graph visualization and navigation in information visualization: A survey. IEEE Transactions on Visualization and Computer Graphics, 6(1):24–43, 2000. 12. Yifan Hu. Efficient and high quality force-directed graph drawing. The Mathematica Journal, 10:37–71, 2005. 13. Alfred Inselberg and Bernard Dimsdale. Parallel coordinates: a tool for visualizing multi-dimensional geometry. In VIS ’90: Proceedings of the 1st conference on Visualization ’90, pages 361–378, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. 14. Aaron J. Quigley. Experience with fade for the visualization and abstraction of software views. In IWPC ’02: Proceedings of the 10th International Workshop on Program Comprehension, page 11, Washington, DC, USA, June 2002. IEEE Computer Society. 15. Casey Reas and Benjamin Fry. Processing: a learning environment for creating interactive web graphics. In SIGGRAPH ’03: ACM SIGGRAPH 2003 Sketches & Applications, pages 1–1, New York, NY, USA, 2003. ACM.
16. H. Siirtola. Direct manipulation of parallel coordinates. Information Visualization, 2000. Proceedings. IEEE International Conference on, pages 373–378, 2000. 17. H. Siirtola. Combining parallel coordinates with the reorderable matrix. Coordinated and Multiple Views in Exploratory Visualization, 2003. Proceedings. International Conference on, pages 63–74, 2003. 18. Martin Wattenberg. Visual exploration of multivariate graphs. In CHI ’06: Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 811–819, New York, NY, USA, 2006. ACM.