Quality based guidance for exploratory ... - Semantic Scholar

Quality based guidance for exploratory dimensionality reduction Sara Johansson Fernstad1,2 , Jane Shaw2 and Jimmy Johansson1

Abstract High dimensional data sets containing hundreds of variables are difficult to explore, since traditional visualization methods often are unable to represent such data effectively. This is commonly addressed by employing dimensionality reduction prior to visualization. Numerous dimensionality reduction methods are available. However, few reduction approaches take the importance of several structures into account and few provide an overview of structures existing in the full high dimensional data set. For exploratory analysis, as well as for many other tasks, several structures may be of interest. Exploration of the full high dimensional data set without reduction may also be desirable. This paper presents flexible methods for exploratory analysis and interactive dimensionality reduction. Automated methods are employed to analyse the variables, using a range of quality metrics, providing one or more measures of ‘interestingness’ for individual variables. Through ranking, a single value of interestingness is obtained, based on several quality metrics, that is usable as a threshold for the most interesting variables. An interactive environment is presented where the user is provided with many possibilities to explore and gain understanding of the high dimensional data set. Guided by this, the analyst can explore the high dimensional data set and interactively select a subset of the potentially most interesting variables, employing various methods for dimensionality reduction. The system is demonstrated through a use-case analysing data from a DNA sequence-based study of bacterial populations.

Keywords High dimensional data, dimensionality reduction, quality metrics, visual exploration, interactive visual analysis.

1 C-Research, 2 Unilever

Link¨ oping University, Sweden Discover Port Sunlight, United Kingdom

Corresponding author: Sara Johansson Fernstad, Unilever Discover Port Sunlight, Quarry Road East, Wirral CH63 3JW, UK Email: [email protected] Telephone: +44 151 641 1187

1

Introduction

Data sets including hundreds of variables are increasingly common in various application areas, such as bioinformatics, simulations and social sciences. Analysis and exploration of this data is facilitated by the use of interactive systems and visual representations, but the number of variables effectively displayed using standard visual representations, such as parallel coordinates1, 2 and scatter plot matrices,3 is limited. A common way of addressing this problem is to reduce the number of variables prior to visualization. Numerous dimensionality reduction methods are available, most of which preserve one or a few specific structures when reducing the data set. For many tasks, and especially for exploratory analysis, several structures may be of interest and, hence, extraction of a subset of variables that are interesting due to their influence on more than one structure is desirable. This is addressed in Johansson and Johansson4 where an overall measure of variable interestingness is utilized, taking the form of a weighted sum of several quality metrics. In addition to this, most dimensionality reduction systems fail to provide overview and insight into the structures of the full high dimensional data set. For a small number of structures and moderate number of variables, an overview can be rather easily presented, for instance using a heat map as suggested by Guo.5 However, as the number of interesting structures and variables increase this becomes a more complex task. Johansson Fernstad et al.6 addresses this when presenting a system for exploratory analysis of high dimensional microbial data. Their system was designed in collaboration with microbiologists and based on the concept of combining quality metrics. By utilizing a ranking algorithm7 they provided an overall measure of interestingness based on a set of quality metrics relevant for the microbiology domain. The system incorporated interactive and visual features into the analysis process, which provided user influence, guidance and aid in gaining insights of the characteristics of the microbial data set. Although originally designed for the microbiology domain, many of the features presented in that paper may be just as useful in a broader context, if utilizing a set of generic quality metrics and visual representations. That is the focus of this paper, which presents a generic and highly customizable system for exploratory dimensionality reduction. The system is based upon the ideas of previous papers,4, 6 but includes an extended set of visual and exploratory features for analysis of high dimensional data sets. The concept of interestingness is in this paper defined by a set of metrics relating to general statistical properties of data, which can easily be extended with pre-calculated metrics defined by the context of problem domain. The user is also provided with functionality for interactively modifying the influence of individual metrics to the overall ranking. Many dimensionality reduction systems automatically retain only the most interesting variables according to some metric, or automatically replace groups of variables fulfilling some criteria. However, a high dimensional data set may quite possibly include variables of certain interest, due to structures or task context, that are not identifiable by an algorithm. The foundation of the techniques presented in this paper is to provide extensive user-control combined with quality metric guidance. This enables an interactive and exploratory dimensionality reduction, which is guided by variable ranks and quality metrics but not restricted to selecting only variables fulfilling a certain criteria. In an interactive system where dimensionality reduction is fully controlled by the user, it is important for the analyst to understand relationships within the high dimensional data set, to be able to make informed decisions. The system in this paper includes an interactive environment where variable ranks and quality metrics are used to provide an easily interpreted overview of the structures within the data set. The overview concurrently acts as a visual guidance and control feature for dimensionality reduction.

Figure 1: The primary window of the interactive environment displaying a bacterial population data set including 184 variables. The rank and quality metric profiles of the variables are displayed in the Ranking and Quality view (top left view) and in the glyph view (top right view). The high dimensional data set is displayed in the bottom view.

Figure 1 displays the primary window of the interactive environment. In the top half of the window, variable structures and relationships are displayed using parallel coordinates and a two-dimensional glyph plot. In the parallel coordinates, axes represent ranks and quality metrics, while polylines represent the variables of the high dimensional data set. In the glyph plot, the quality metric profile of each variables is represented by a glyph, laid out using Principal Components Analysis (PCA).8 The quality metrics are used as input vectors to the PCA, which facilitates the understanding of variable similarities and dissimilarities in terms of quality metric profiles. To summarize, the main contributions of this paper is a generic system for exploratory dimensionality reduction that includes: • Several methods for interactive and quality guided dimensionality reduction, including: – variable filtering based on quality metrics and ranking; – creation of representative variables replacing groups of similar variables; – manual and quality guided selection of individual variables. • Customizable variable ranking using a combination of quality metrics, which provides an overall measure of variable interestingness based on several structures. • An interactive environment that provides: – overview of the structures within the high dimensional data set, through this enabling analysis of variable relationships in terms of quality metric profiles;

– possibility of identifying variables with specific, desirable or unusual characteristics; – visual exploration of selected variable subsets using common visualization methods.

The remainder of this paper is organized as follows. The next section presents the background and related work, this is followed by a section that provides a brief overview of the system. In the section Algorithmic analysis the quality metrics and ranking are described in detail, and the section Interactive exploration and dimensionality reduction presents the interactive environment and dimensionality reduction features. The Use-case section demonstrates the usefulness of the presented techniques and the final section concludes and discusses the presented work.

2

Background and related work

Visualization of high dimensional data sets, sometimes including hundreds of variables, is a major challenge in information visualization. It has, in the context of scalability, been identified as one of the top ten challenges within information visualization.9 This section presents previous research related to high dimensional data visualization and presents the contribution of this paper in the context of previous work.

2.1

Visualization

Common visualization methods, such as parallel coordinates,1, 2 scatter plot matrices3 and table lenses10 are often useful for analysis of multivariate data sets of moderate sizes. However, as described by Eick and Carr,11 both human visual capabilities as well as hardware and interactivity highly affect the efficiency of visual representations. The efficiency of most visual representations is rapidly reduced as the number of variables increases. However, some visual representations designed specifically for visualization of data sets with a large number of variables are available. Examples are the pixel based methods presented by Keim,12 where each pixel in the display represents a data value and where different pixel layouts are used to facilitate identification of different structures. Although able to display more variables than many traditional visualization methods, the pixel based methods are also limited by display size and resolution. They are furthermore perceptually constrained to viewing overall trends, as the limited size of a pixel makes it almost impossible to perceive details. A visual representation related to the glyph view presented in this paper is the Value and Relation (VaR) display, presented by Yang et al.13 The VaR display is a two dimensional plot where each variable is represented by a glyph, designed as a pixel plot and representing the values of the data items. The glyph layout within the plot is based on relationships between variables. Related to this is also the work by Turkay et al.14 where high dimensional data sets are represented in both item space and dimension space. In the dimension space, various statistical properties of variables are represented. It is, thus, in spirit similar to the quality metric profile representations used in both this and the preceding paper bu Johansson Fernstad et al.6 In addition to visual representations, some systems for exploration of high dimensional data are also available. For instance Barlowe et al.15 describe a system where partial derivatives are visually represented to the user, aiming to detect correlation among variables. To reduce clutter they utilize a step-wise visual exploration. Another example is the ClusterSculptor system, presented by Nam et al.,16 which combines interactive features and clustering algorithms for clustering of data sets including hundreds of

variables. Both these systems focus on a specific task or structure and their utility is hence more task dependent than the utility of the system presented here.

2.2

Dimensionality reduction

A common approach to visualization of high dimensional data is to employ dimensionality reduction prior to visualization. Through this, a lower dimensional projection is extracted which represents the most important structures within the data, which may be more straightforward and efficient to analyse. By using automated methods such as Multi-dimensional Scaling (MDS),8 Principal Components Analysis (PCA)8 and Self-organizing Maps,17 data sets including hundreds of variables can be efficiently projected onto a low dimensional space where the new variables often are linear combinations of the original variables. Other examples are the projection pursuit algorithm18 which aims at identifying linear projections based on a measure of usefulness, and the family of linear transformations presented by Koren and Carmel,19 where several different properties of the data are made use of, such as similarities and cluster structures. Their dimensionality reduction approach was later utilized by Engel et al.,20 who combined it with a hierarchical clustering method and a visual representation based on star coordinates,21 to create a structural decomposition tree for high dimensional data. A slightly different approach to dimensionality reduction is the grand tour,22 where the analyst moves through a sequence of two-dimensional projections of a multivariate data set. Although efficient, the user often has very little influence over the result of the automated methods and thus the user’s knowledge is not incorporated. An attempt to overcome this was presented by Williams and Munzner,23 where the user through an interactive system can guide and influence the MDS process. A disadvantage of many of the aforementioned methods is that the relationship between original and new set of variables is not always straightforward to the user. A more intuitive relationship can for instance be obtained by selecting a representative subset of the original variables, using some quality metric, or by separating the original variables into groups, using a representative variable for each group. In Ivosev et al.24 the Principal Components Variable Grouping was presented, which is an extension of PCA where the principal components are used to group the original variables. Various interactive systems for dimensionality reduction are available, focusing on different structures within the data. In Jeong et al.25 a system aiming to assist the user in understanding the result of PCA is presented, using an interactive multiple coordinated views approach. Guo5 presents a system where variable pairs are ranked through a ‘goodness of clustering’ metric, enabling identification of potentially interesting subspaces using an entropy matrix to represent the variable pair ranks. A similar approach is available in the rank-by-feature framework, presented by Seo and Shneiderman,26 where variable pairs are ranked based on a selected ranking criteria. Artero et al.27 presents an approach performing dimensionality reduction based on a similarity metric. In Sips et al.28 different projections of high dimensional data sets are ranked using class consistency as a quality metric. Similar approaches are presented by Tatu et al.29 and Albuquerque et al.,30 using various metrics to rank projections of different visual representations. Ferdosi and Roerdink31 presented methods for variable ordering in parallel coordinates and scatter plot matrices, focusing on multivariate structures and aiming to find relevant subspaces for clustering. Although these interactive systems use different metrics and focus on different structures, they all consider only one metric at a time. The Visual Hierarchical Dimension Reduction (VHDR) system32 is a system that creates hierarchical structures based on the similarities between pairs of variables. The hierarchy can be interactively modified and through it subsets of interesting variables can be selected. In 2003, Yang et al.33 presented the Dimension Ordering, Spacing and Filtering Approach (DOSFA), which is derived from VHDR. DOSFA performs dimensionality reduction based on a combination of similarity and a second importance value, such as variance. The reduction is carried out by removing

all but one of highly similar variables and by removing variables with low importance, hence partially introducing the idea of combining several quality metrics for dimensionality reduction. Considering the problem of how a subset of variables might be best selected for analysis, with selection guided by a set of metrics of interest, Johansson and Johansson4 demonstrates how an overall measure of variable interestingness can be used to indicate to the user which variables should be displayed, and why they might be of interest. The measure takes the form of a weighted sum of metrics, with the user defining the weights pre-analysis. The sum provides a threshold for selecting a subset of interesting variables. The user decides on the number of variables to retain while investigating the trade-off between number of variables and loss of quality metrics in an interactive display. The most interesting variables, based on the weighted sum, are automatically retained. Another system, focusing on information loss, is presented by Schreck et al.34 They use a projection precision measure for comparison of original and reduced data and visually incorporates it into the visualization. In terms of incorporating user guidance into the dimensionality reduction process, the DimStiller system, presented by Ingram et al.,35 is related to the system presented in this paper. In DimStiller, data analysis and dimensionality reduction are carried out as a chain of step-wise data transformations. The transformations are controlled by the user, who is guided at a global level by workflows aiding to find useful chains of transformations, as well as on a local level through visual feedback, facilitating parameter tuning and identification of the most informative settings for a single transformation. Johansson Fernstad et al.6 continued developing the concept of user guidance when presenting a system for exploration of high dimensional microbial data. That system used visual representations as guidance in an interactive dimensionality reduction process, and combines quality metrics into a single measure of interestingness using a non-dominated ranking approach.7 Nondominated ranking is more commonly used in multiple objective optimization algorithms.7 Related work is found in the area of decision-making in optimization problems where, rather than metrics of interest, several objectives are considered and handled simultaneously. Examples using non-dominated ranking algorithms include Srinivas and Deb36 drawing on Goldberg’s approach, and Fonseca and Fleming,37 who demonstrate an extensive framework for interactive decision-making around several objectives and preference exploration in this context. Subsequently, refinements to ranking approaches for considering multiple objectives have been developed and applied to many problem domains.

2.3

Summary and contributions

As described above, various dimensionality reduction methods and systems are available. Although useful for many different tasks, these methods may not always provide a suitable environment for exploratory dimensionality reduction and analysis of high dimensional data sets. The majority of the systems focus on only one or a few structures and few provide any overview of the structures within the whole data set. Moreover, many are based on automated methods and do not provide the analyst full control over the reduction. Using a weighted sum of quality metrics enables extraction of variable subsets which are important for preserving more than one structure. It has a powerful advantage of providing a single value as a threshold for the most interesting variables. Feedback from industrial data analysts has indicated that users may find this simplicity for decision-making attractive. However, by solely utilizing a single value of interestingness, information is lost regarding values of the original metrics, and the user is unable to explore explicitly the contribution of each metric to the single measure. The user may also be interested in exploring the effect of individual metrics upon the overall view. Additionally, users may find it difficult to estimate an appropriate set of weights, given they must consider the relative importance of a set of statistical concepts. This is addressed both here and in the previous paper6 by extending user

Algorithmic Analysis Input

Metric Setup

Variable Ranking

Quality Metric Analysis

Visual Analysis

Reduce Dimensionality

Explore HDD PCA

Explore Reduced Data

Quality Metric Output Subset Output

Figure 2: The workflow of the system includes an algorithmic analysis where variables are analysed and ranked using a range of quality metrics, and an interactive environment which provides a visual overview of the structures in the data set and enables interactive selection of subsets of variables.

interaction with both individual and summarized metrics. The previous system was tailored to specific tasks required by microbiologists, and incorporated biologically-motivated quality metrics along with visual features specific for the microbiology domain. The system presented in this paper is a generic system based on the previous ideas but designed to be useful within any domain. It is extended with metrics relating to general statistical properties and with additional exploratory and guiding features, facilitating flexible high dimensional data analysis and interactive dimensionality reduction.

3

System overview

The workflow of the proposed system can be separated into two main parts, as displayed in Figure 2. The initial part includes various algorithmic analyses, which are described in detail in the next section. The algorithmic analysis has the high dimensional data set as input, along with a set of selected quality metrics. Within this part the data set is analysed, based on the set of metrics, and for each variable in the data set a value is extracted for each quality metric. This is followed by computation of variable ranks and PCA, both based on the extracted quality values. The second part of the system includes an interactive environment for visual analysis and exploratory dimensionality reduction, as described in detail in the section Interactive exploration and dimensionality reduction. Here the ranks and quality metric profiles of the variables, as well as the original data set, are displayed to the user using linked visual representations. Within the interactive environment the user is provided with various possibilities to explore and gain insights from the structures within the high dimensional data set. Guided by this, the user is provided with several methods for interactively selecting a subset of potentially interesting variables. In parallel with the reduction, the selected subset of variables is displayed using a common visual representation, providing basic functionality for an initial visual analysis of the reduced data set. The reduced data set as well as quality values and variable rank can be exported and analysed using other appropriate methods and systems. The quality metric analysis is the only computationally demanding part of the workflow. All interactions and computations performed subsequently are based on the results of the analysis. Due to that, features such as variable ranking and dimensionality reduction are instantly performed once the quality metric analysis has been carried out. The time complexity of the quality metric analysis depends entirely on the quality metrics used. Table 1 provides examples of computation times in milliseconds for the quality metrics described in the Quality metric analysis section. Further comments on computation times are provided along with the description of corresponding metric. The performance is measured for four data sets of varying sizes. These exemplify the performance of the metrics both for sparse high dimensional

QPearson QSpearman Qvariance Qskewness Qlinear Qcluster (fast) Qcluster QoutlierLR QoutlierED

Synthetic (100 x 1320) 264 508 20 21 4669 1089 1104 9 3100

Bacterial (184 x 50) 39 300 2 2 548 69 5011 16 160

Ozone (73 x 2534) 239 500 29 31 4521 750 7020 43 3849

Crime (124 x 1994) 527 1170 41 41 9955 748 7020 195 6395

Table 1: Computation times in milliseconds of quality metrics for the four data sets. Number of variables and items are shown within parentheses.

data sets, where the number of dimensions is several times higher than the number of items, as well as for data sets where the number of items by far exceeds the number of dimensions. The data sets used are: a synthetic data set, including 100 variables and 1320 items; a bacterial population data set described in the Use-case section, including 184 variables and 50 items;38 and two data sets from the UCI Machine Learning Repository.39 The UCI data sets are the eight hour peak Ozone Level Detection data set, including 73 variables and 2534 items, and the Communities and Crime data set including 127 variables and 1994 items. The tests are run on an HP EliteBook 8540p laptop with an Intel i5 2.53GHz CPU, 4GB RAM and an Nvidia NVS 5100M graphics card.

4

Algorithmic analysis

All dimensionality reduction and exploration within the system are based on an algorithmic analysis of the full high dimensional data set, which is performed prior to visual analysis. The initial step of the algorithmic analysis is the quality metric analysis which is described in detail in the next section. This is followed by a variable ranking, described in the Variable ranking section, and application of PCA to the variables and quality metrics. The visualization of PCA in this context is described in more detail in the section Glyph view.

4.1

Quality metric analysis

The goal of the quality metric analysis is to identify structures in the high dimensional data set, and to assign individual quality values to each of the variables. The quality values represent the strength of the variables’ involvement in a specific structure. For each variable one quality value is computed for each of the quality metrics. The output of the quality metric analysis is, hence, a set of vectors where each vector represents the quality values of one variable. In this paper a set of eleven quality metrics are used to demonstrate the presented techniques, three of which are identical to the quality metrics presented in Johansson and Johansson.4 The metrics have been selected through a review of the quality metrics used and suggested by previous research in the area, literature on statistical data exploration, and industrial data analysts. In addition to the eleven metrics included in the system, pre-calculated metrics can be loaded into the system. This provides flexibility in terms of letting the user focus the analysis to structures relevant for a specific domain. It also enables the utilization of computational powers and strengths of existing software packages. It is worth emphasizing that the ranking, variable selection and exploration features presented in this paper are generic and do not rely on the specific

quality metrics described in this section. They are primarily used as a proof of concept of how several quality metrics can be used concurrently to reduce and analyse a high dimensional data set. Any other quality metric from which it is possible to extract an individual quality value for each variable could be used just as well. It should also be noted that although the metrics described in the paper are all designed for numerical data, the basic concepts of the approach could be used for other types of data as well, providing the quality metrics used are properly designed for the corresponding data type. When computing the quality values of a variable there is sometimes a risk that involvement in a large number of insignificant structures might add up to what appears to be involvement in significant structures. To avoid this, various thresholds are used to define whether a structure is significant enough to be included. Throughout the remainder of this section the following notations will be used: a data set X, includes M variables and N items, ~xi is an item where i = 1, ..., N , ~xj and ~xk are variables where j, k = 1, ..., M and xi,j is the data value for item ~xi in variable ~xj .

4.1.1

Clusters

A density based approach is taken to analysis of cluster structures, defining a cluster as a region with higher density than its surrounding regions. To identify multi-dimensional clusters the MAFIA cluster algorithm,40 which has evolved from the CLIQUE cluster algorithm,41 is used. The algorithm initially identifies one dimensional dense units (clusters). It then iteratively extracts higher dimensional clusters by combining lower dimensional clusters, retaining only clusters with density above a given threshold. This is an approach similar to Apriori reasoning.42 Additional details on the clustering algorithm and computation of cluster quality values can be found in Johansson and Johansson.4 The cluster metric is designed such that high cluster quality values are assigned to variables that are included in subspaces with high quality clusters. A cluster is considered to be of high quality if it has high density, high coverage and exists in a subspace which includes a large number of variables. As seen in Table 1, the cluster quality metric is one of the three most computationally demanding metrics, along with the linearity metric and the outlier metric based on Euclidean distance. The computation time of the clustering algorithm is highly dependent on the number of sub-clusters and on their dimensionality.40, 41 Hence, complex cluster structures may result in long computation times even if the total number of variables is not very high. To speed-up the cluster analysis the maximum dimensionality of a cluster is limited by a threshold that is defined by the user prior to analysis. Furthermore, a default option for fast cluster analysis is provided, where only cluster structures within individual variables are analysed. For data sets with a large number of one-dimensional cluster structures, indicating a long computation time, a warning is presented to the user, suggesting that the faster one-dimensional analysis should be used instead. For comparison, the computation time for both the fast cluster analysis of individual variables and a multi-dimensional cluster analysis are presented in Table 1.

4.1.2

Correlation

Two quality metrics are available for analysis of correlation, the Pearson correlation coefficient,43 r, and the Spearman rank correlation coefficient,44 ρ. For both metrics the correlation coefficients of all variable pairs are calculated initially. Individual correlation quality values are then calculated for all variables, PM where the Pearson quality value for variable ~xj is defined as QPearson (~xj ) = k=1 |r(~xj , ~xk )| for k 6= j and |r(~xj , ~xk )| ≥ where typically is in the range 0.05 to 0.8. The Spearman correlation quality value,

QSpearman (~xj ), is computed in a corresponding way using |ρ(~xj , ~xk )|. Using this method, high quality values are assigned to variables which are strongly correlated with a large number of variables. If using a high threshold only the strongest correlations are taken into consideration. In addition to correlation analysis where positive and negative correlations are considered equally important, as above, quality values focusing either on positive PM or negative correlations are available as well. These are extracted by calculating QPearson (~xj ) = k r(~xj , ~xk ) with r(~xj , ~xk ) ≥ and r(~xj , ~xk ) ≤ − focusing on positive and negative correlation respectively, and similarly for QSpearman . The computation time for the correlation metrics, as presented in Table 1, includes the total computation time for absolute, positive and negative correlation metrics, since they are all based on the same analysis of pairwise correlations.

4.1.3

Distribution

Data distribution is analysed through two quality metrics; variance and skewness. The quality values, Qvariance (~xj ) and Qskewness (~xj ), of a variable, ~xj , are extracted by analysing the distributions within ~xj , using standard methods as described in Wackerly et al.45 and Kendall et al.46

4.1.4

Linearity

The linear relationships within the data set are analysed based on linearity between pairs of variables. For each variable pair a line of best fit is identified, using linear regression methods as described in Draper and Smith.47 To determine whether the relationship between two variables is linear a runs test45, 47 is used. The runs test examines whether the deviation of the residuals are random, indicating a linear relationship, or systematic, indicating a possibly non-linear relationship. A test statistics, z, is computed to tests the null hypothesis that the residuals are randomly distributed. The rejection region for the null hypothesis in a two-tailed test is |z| ≥ 1.96, using a 95% confidence interval.45 Hence, |z| ≥ 1.96 indicates a non-linear relationship for the variable pair and a small |z| value indicates a high probability of a linear relationship. Within the presented system, high quality values are assigned to variables that are part of many variable pairs whereP the probability of linearity is high. Defining the M linearity quality value for variable ~xj as Qlinear (~xj ) = k=1 (ζ − |zj,k |) where k 6= j, and |zj,k | ≤ ζ where ζ typically has a value of 1.96, based on the confidence interval. As previously mentioned, and as seen in Table 1, linearity is one of the three most computationally demanding metrics.

4.1.5

Outliers

Two methods are provided for detection of outliers. One is based on the outlier definition used within linear regression analysis,47 where an outlier is defined as a residual whose absolute value lies a number of standard deviations from the residual mean. To identify two-dimensional outliers the residuals of all variable pairs are analysed during the linear regression analysis, described in the previous section. An item, ~xi , is defined as an outlier if its distance, δi , from the residual mean is greater than τ standard deviations. Higher order outliers are defined as items which are outliers for a set of variable pairs. High significance is assigned to outliers of high dimensionality and with large δi . An outlier value is computed PK for each higher order outlier, defined as oi = l=1 δi (l), where K is the set of variable pairs for which the item is a two-dimensional outlier.

The second method is a density and grid based approach using the Euclidean distance, described in detail in Johansson and Johansson.4 Using this method an item, ~xi , is defined as an outlier if the number of neighbour items, ψi , within a given radius, φ, around ~xi does not exceed a threshold, ς. Similar to the outliers based on residual distances, an outlier value, oi , is computed for each outlier, with high significance being assigned to outliers of high dimensionality and with few neighbouring items. The quality values of the outlier metrics, QoutlierLR (~xj ) and QoutlierED (~xj ), for variable ~xj is computed by summing corresponding oi for all outliers belonging to ~xj , where oi ≥ $ and $ typically has a value of 1. In terms of computation times, as displayed in Table 1, the outlier analysis based on Euclidean distance is one of the three most computationally demanding metrics. For outlier analysis based on linearity, on the other hand, some of the required computations are computed during the linearity analysis. Hence, the presented computation time of QoutlierLR corresponds only to the outlier detection part, and does not include identification of the line of best fit.

4.2

Variable ranking

The aim of utilizing variable ranking is to extract a single value of interestingness for each variable, based on the values obtained during quality metric analysis. Through this the importance of several structures are taken into consideration, with the benefit of not requiring any prior knowledge from the analyst regarding the structures existing in the data set or regarding which structures are of most interest. The ranking algorithm used here was proposed by Goldberg,7 however, others are available and may be suitable for similar use. The rank provides a related filter which ensures that all variables with the same rank achieve a level of equivalence in their metric profile, by use of the non-domination principle. Vectors containing the quality values of variables are used as input vectors for the ranking algorithm. To obtain a variable ranking that is based on a subset of the available metrics, different combinations of quality metric vectors can be used. The ranking algorithm is formally defined as:

(~qj < p~qk ) ⇐⇒ (∀m )(qm,j ≤ qm,k ) ∧ (∃m )(qm,j < qm,k )

(1)

where ~qj and ~qk are vectors containing the quality values of variable ~xj and ~xk .7 It is said that ~qj is partially less than ~qk , defined as (~qj < p~qk ), if the condition stated in eq. (1) is fulfilled, where qm,j is the quality value of variable ~xj for metric m, and where m = 1, ..., Q and Q is the total number of quality metrics to rank by. Thus, ~qj is defined as partially less than ~qk if ~qj is less than or equal to ~qk for all metrics, and if there exists at least one metric for which ~qj is less than ~qk . If ~qj is partially less than ~qk it is said that ~qj dominates ~qk . A variable that is not dominated by any other is said to be non-dominated. For illustration, six variables (~xa – ~xf ) are assigned ranks based on two metrics (m1 , m2 ), as displayed in Figure 3. The analyst is interested in variables with high values for both metric, and hence want a high rank value to be assigned to those. The metric values for each variable ~xa to ~xf are calculated and compared with each other. Variable ~xf has the lowest values for both m1 and m2 , and receives a rank of 1. In the remaining unranked points, ~xe has the lowest value of m2 , and ~xd the lowest value of m1 , but neither has smallest values in both metrics than the other. ~xe and ~xd are thus non-dominated by each other and both receive rank 2. Similarly, points ~xb and ~xc receive rank 3. Point ~xa has higher values

m2 ~ xb

~ xa

~ xc ~ xd ~ xe ~ xf

m1 Figure 3: An illustration of the non-dominated ranking process, based on Goldberg,7 for six variables (~ xa – ~ xf ) and two quality metrics (m1 and m2 ).

in both metrics than all other points and receives a rank of 4. All variables now have a rank, and the process is complete. Due to the system structure, the variable ranking is based on the quality values, which were extracted during quality metric analysis. The time complexity of the variable ranking is thus dependent on the number of quality metrics used and the number of variables in the data set. Using eleven quality metrics, the computation time for variable ranking in the four data sets described earlier was only a few milliseconds.

5

Interactive exploration and dimensionality reduction

Within the interactive environment of the system various functionalities are provided for selecting subsets of potentially interesting variables. The selection is guided by visual representations of the algorithmic analysis results. The visual representations both provide possibilities to explore the data and to gain insight into the overall characteristics of the high dimensional data set. They also facilitate identification of variables with specific, desirable or unusual characteristics. Prior to quality metric analysis the user is presented with the metric selection window, as displayed in Figure 4. In the left half of the window is an interface for selection of quality metrics to use. The left list box displays metrics selected by the user and the right list box displays metrics that are available but not yet selected. The interface also provides possibilities to save and load a metric setup, and to load pre-calculated metrics. The right part of the metric selection window displays a two dimensional PCA-plot of the data set. This plot is included to provide a fast initial overview of the data, which may act as a guidance for selection of quality metrics and for focus of subsequent analysis. The interactive environment includes several linked windows and views, described in detail in the following sections. In the data view in the bottom of the primary window, displayed in Figure 1, the high dimensional data set is displayed using a visual representation selected by the user. The current implementation of the system includes three common visual representations for the data view: parallel coordinates, scatter plot matrix and table lens. Due to its ability of displaying multivariate patterns, such as clusters, parallel coordinates are used as the default representation. The visual representation in the data view is interactively linked to the dimensionality reduction, which is described in the sections Variable merging and Variable selection, and only displays the currently selected subset of variables when the dimensionality is reduced. The interactive linking is designed such that the visual representation is reduced or expanded by adding or removing variables, in an accordion like way, when variables are

Figure 4: The metric selection window for selection of quality metrics to use. A PCA-plot to the right provides an initial overview of structures which guides the user in selection of metrics. Here two clusters are clearly visible.

removed from or added to the selected subset. This provides an initial visual analysis of the selected variable subset, enabling fast confirmation as to whether the current subset is of interest for further analysis, or if another subset should be selected instead. It also provides a visual feedback for deciding on the appropriate number of variables to display, since the analyst will be able to perceive when patterns become visible while reducing the data. The variables in the data view can be interactively re-ordered through drag and drop features. However, since the focus of this work is primarily on interactive and quality guided dimensionality reduction, the current implementation of the system does not include any automated variable ordering algorithms. Nevertheless, recognizing the influence of variable order on our ability of perceiving patterns in visual representations, the inclusion of automated variable ordering is considered for future implementations. If available, classification information can be loaded into the system along with the data set. The classification may then be used for colouring in the data view and as additional information displayed as mouse-over tool-tips. In addition to the visual representations of the primary view, a two dimensional PCA-plot of the full high dimensional data set is available in a separate window. This window is linked to the data view in the primary window in terms of colouring, selection and filtering. This plot uses the same PCA computation as the metric selection window.

5.1

Ranking and Quality view

The Ranking and Quality (RaQ) view is displayed in the top left part of the primary window (Figure 1). The aim of this view is twofold; firstly it is meant to provide an overview of structures within the full high dimensional data set, in terms of the quality metric profiles of the variables; and secondly it is meant to act as a control panel for interactive dimensionality reduction. Quality metric profiles may be thought of as multivariate patterns of variables, and interactive dimensionality reduction can be thought of as filtering of variables. Based on this, parallel coordinates were selected as visual representation to use in the RaQ view, since it enables both analysis of multivariate patterns and interactive subset selection through filtering along axes. In the parallel coordinates of the RaQ view, the variables of the high dimensional data set are represented by polylines, whereas the axes of the plot represent variable rank and quality metrics. Hence, a polyline in the RaQ view corresponds to an axis in the data view. To emphasize the relationship between the views, the polyline colour in the RaQ view matches the axis

colour in the data view. Through the RaQ view the analyst can obtain an understanding of the overall structures within the data set. Through highlighting of individual polylines it also provides possibilities of examining details of particular variable profiles. Furthermore, variables differing from the overall structures, or variables with specific desirable or undesirable properties, can easily be identified. By using the filter sliders of the axes it is possible to explore further the relationships between structures in the high dimensional data set. While examining the structures in the RaQ view, some of the quality metrics may be identified as not as good a basis for extracting a subset of the most interesting variables in the current data set. By deselecting such quality metrics, using the check box below the corresponding axis in the RaQ view, new variable ranks are instantly re-computed, excluding the deselected quality metrics from the ranking procedure. Similarly, to increase the flexibility of the ranking, individual quality metrics can be inverted using the button below corresponding axis in the RaQ view. For an inverted metric, a high rank will be assigned to variables with low quality values.

5.2

Glyph view

To the right of the RaQ view in the primary window is the glyph view, which is inspired by the VaR display.13 In this view each variable in the high dimensional data set is represented by a glyph. The glyphs are laid out in a two dimensional PCA-plot where the quality metrics are the input vectors of PCA (equivalent to the second through seventh axis in the RaQ view in Figure 1). The glyphs are made up of a number of squares, each corresponding to one quality metric, where the opacity of the square represent the quality value. To emphasize the connection between the glyph view and RaQ view, and to facilitate the interpretation of glyphs, the base colours of the glyph squares are the same as the axis colour of corresponding quality metric in the RaQ view. Equivalently, the colour of the glyph borders corresponds to the colour of polylines and axes in the RaQ and data views. Glyphs that are not part of the currently selected subset of variables are displayed without border. The glyph view complements the RaQ view through its focus on representing relationships between variables. Due to the use of PCA for glyph layout, groups of variables with similar quality metric profiles can be identified. An example is the small cluster of glyphs positioned in the top right part of the glyph view in Figure 1. The spatial proximity of glyphs also helps in identifying variables that are not selected but have similar profiles to a selected variable, indicating that they may be of interest. It furthermore enables identification of variables that are outliers in terms of quality metric profiles, as they are separated from the majority of glyphs in the plot. This too may indicate properties worth investigating further. To simplify the identification of individual variables, mouse over tool-tips displaying corresponding variable name are displayed while hovering over glyphs. A major issue when using glyphs in scatter plot displays is the problem of glyph overlap, meaning that glyphs are so closely positioned that they cover each other. The glyph view of this system includes a zooming functionality for focusing on a subset of glyphs, which to some extent overcomes the overlap issue. Future options for overcoming this may be to provide possibilities of replacing glyphs with smaller points, or to spread a group of selected glyphs using a re-positioning algorithm. The former would reduce the overlay and provide better detail regarding which points are actually laid out closest together, but it would not provide any additional details regarding the quality metric profiles of the variables. The second approach, on the other hand, would preserve details on quality metric profiles, while it to some extent would distort the representation of glyph relationships in the display. Both alternatives may be useful as selectable features for the user. The primary purpose of the glyph view is, however, to provide

Figure 5: The variable merging window, including a list of suggestions of variable groups to merge to the left and two visual representations to the right displaying the selected variable group together with its representative variable. The left view displays a group of three strongly correlated variables. The right view displays a group of six variables where the first and sixth (represented by black diagonal cells) are deselected and will not be included if the variables are merged.

overview of relationships between variables, for instance in terms of clusters and outliers, rather than providing detailed information regarding the quality metric profiles of individual variables.

5.3

Variable merging

Within a high dimensional data set there may be groups of variables which are very similar and, hence, may be more interesting as a group rather than as individual variables. To remove redundant information and provide more space for other variables, such groups may preferably be replaced by a representative variable. This is addressed in the presented system through the variable merging window (Figure 5). In this window groups of highly similar variables are automatically extracted, based on the previously computed Pearson correlation. A correlation threshold for extracting these groups is interactively set by the user through a slider. The variable groups, as displayed in the left part of the window, are suggestions to the user of variables possibly meaningful to merge, and can be selected for further examination. A selected group of variables is displayed in the right part of the window, along with a representative variable for the group. Here a combination of parallel coordinates and a scatter plot matrix is used for examination of correlation patterns. In the scatter plot matrix, blue coloured cells are used as an aid for distinguishing strong correlations from weaker correlations. The representative variable, positioned as the rightmost variable, represents the average of the variable group. Through the visual representations, the user gets an understanding of relationships within the group, aiming to guide the decision whether to merge or not. Any variable within a group can be excluded from the merging, resulting in an instant re-computation of the representative variable. The right part of Figure 5 displays an example of this, where the first and sixth variables are excluded. Facilitated by the combination of automated extraction of strongly correlated variables, and guidance through visual

representations, an analyst is able to quickly make informed decisions whether or not groups of variables are better represented by a single variable. While groups of strongly correlated variables may be hard to extract manually from a high dimensional data set, a fully automated method would not have been able to distinguish variables that should not be merged. Reasons for not merging variables may, for instance, be a difference in meaning or due to correlation being the result of a strongly skewed distribution. When a group of variables are merged they are replaced by the representative variable within all views of the primary window. Quality values and ranks are computed for the new variable, employing the algorithms described in the Algorithmic analysis section and by using the average values of any pre-calculated metrics. To make the representative variable distinguishable from the original variables it is represented by red colour in all views.

5.4

Variable selection

The primary dimensionality reduction within the system is controlled through the RaQ view. The filter sliders of the axes are used to select a subset of variables, enabling an interactive dimensionality reduction which is guided by quality metrics and ranking. The variable selection is instantly performed and reflected in the bottom view, where only the selected subset of variables is displayed. This provides an instant visual feedback regarding the number of variables that can be effectively displayed in the currently used visual representation. It also provides a clear indication of when interesting structures become visible. Many dimensionality reduction systems automatically retain a subset of the most interesting variables, not making full use of the knowledge of the user. Selection through filtering provides user control and guidance in identifying potentially interesting variable subsets. However, a data set may still include variables of interest which may not have been assigned high ranks. To enable a fully user controlled reduction, this system allows for manual selection of variables. Any manually selected variable will be unaffected by filtering and retained in the displayed variable subset until deselected. The manual selection includes selection through picking of polylines in the RaQ view, selection of axis headers in the data view and selection of glyphs in the glyph view, as well as selection from a list including variable names and rank. Polylines, glyph borders and axes representing manually selected variables are highlighted in blue. Likewise, a variable assigned to a high rank may be less interesting than its rank indicates, and can hence be manually removed by the user in a similar way. Through the ranking, an overall measure of interestingness based on several quality metrics is provided. Hence, when filtering along the rank axis (leftmost axis in the RaQ view in Figure 1) a subset of the most interesting variables based on several quality metrics is selected, which aids the user in quickly selecting a potentially interesting subset. However, an issue arising is that the number of unique ranks tend to decrease as the number of quality metrics increase, concentrating many variables into few ranks and sometimes making the rank filtering too broad a tool for reduction. Due to this a subdivision of ranks may be desirable. This is provided through a possibility of spreading the polylines within a rank according to one of the quality metrics. Furthermore, variables with similar quality metric profiles may sometimes be separated into different ranks. This is addressed through the possibilities of filtering on individual quality metrics as well as through the possibility of identifying and selecting variables with similar profiles within the glyph view.

6

Use-case

This section will demonstrate some of the presented techniques and describe how they may be used to explore a high dimensional data set for identification of interesting structures and forming of hypotheses. The use-case demonstrates how an analyst may use methods in the presented system to analyse a bacterial population data set, but is not intended to provide an exhaustive account of how a full analysis might be completed. The data set analysed is from a 16S ribosomal DNA sequence-based study of bacterial populations. Such studies can generate high dimensional data sets which require both exploratory and confirmatory analysis. In this study, data represent levels of 184 OTUs (operational taxonomic units), indicative of different bacteria species in the human mouth. It comprises 95% of the cumulative total population found in fifty samples, taken from ten healthy panellists in five sites of the mouth. The goal of such a study is to define bacterial ecology of the mouth and thereby develop innovative products for oral health.38 The analyst would like to explore the profile of the samples across bacterial populations in the context of the full data set, and by this identify possible differences between sample sites in terms of bacterial counts and explore which bacteria may be more or less commonly occurring within different sites or subjects. When analysing this data an analyst would expect to apply a range of statistical and multivariate techniques, including familiar dimensionality reduction methods such as PCA or MDS,8 as well as data visualization approaches. The techniques presented in this paper can accompany such analysis by enabling visual exploration and guidance. In the context of this use-case, data will be referred to using the following terminology; samples correspond to data items, and variables are referred to as OTUs. Thus, OTUs are represented by polylines and glyphs in the RaQ and glyph views. In the data view, in the bottom of the primary window, OTUs are represented as axes and samples as polylines. The OTU value of a sample relates to the bacterial count of corresponding OTU.

6.1

Initial overview

The data set is loaded into the system. Through the initial PCA-plot, as displayed in Figure 4, an initial overview of structures is presented to the analyst. It can be seen that the majority of data samples belong to one of two clearly separated clusters. The analyst wants to examine this pattern further, to identify which structures and combinations of OTUs that may drive the separation, and uses the identified pattern as a basis for her continued analysis and quality metric selection. Firstly the three Pearson correlation metrics are selected, since correlations may indicate co-occurrence of combinations of OTUs within samples, which is an important aspect of understanding bacterial populations. Secondly the cluster metric is selected together with the variance and skewness metrics, since these through their analysis of sample distribution and grouping within OTUs may help in examining sample group separation. When the automated quality metric analysis and ranking are performed, the high dimensional data set, as well as quality metrics and OTU ranks, are displayed in the primary view of the system, as shown in Figure 1. It is apparent that the lower view, displaying the full high dimensional data set, is too crowded for exploration to be conducted without some reduction. However, in the RaQ view, where the quality metric profiles of the variables are represented, the analyst notices some interesting patterns. For instance, it can be seen that only a small number of OTUs include strong cluster structures and only two are strongly skewed, as can be seen from the few polylines representing OTUs with high values for the skewness and cluster quality metrics (first and third axis from the right in the upper left view

Figure 6: Highlighting quality metric profiles of three OTU clusters selected in the glyph view. OTUs are represented by polylines in the left view and by glyphs in the right view.

in Figure 1). Moreover the RaQ view reveals a negative correlation between the variance and skewness metrics (the two rightmost axes). In the glyph view, some additional relationships between OTUs are identified. Some smaller glyph clusters, separated from the majority of glyphs, are identified through their spatial proximity and similarity in colour. These represent highly similar OTUs that differ from the majority of OTUs, in terms of quality metric profiles, and may, hence, include patterns that are important for understanding the bacterial population. By selecting the clusters in the glyph view the OTUs are highlighted in blue in all other views and their individual quality metric profiles may be examined in the RaQ view, as displayed in Figure 6. The top and middle views both include OTUs strongly involved in cluster structures, but differing in terms of involvement in correlation patterns. The bottom view, on the other hand, include OTUs with weak cluster structures but strongly involved in the negative and overall correlation of the data set. As demonstrated, groups of OTUs involved in similar structures can quickly be identified in the glyph view and may, together with the structures found in the RaQ view, act as guidance for the analyst in subsequent exploration.

6.2

Filtering using rank

As all selected metrics represent structures that may be involved in driving differences between samples and separating them into groups, the analyst has, initially, no strong preference for one metric over another. At this point the rank axis (leftmost axis in the RaQ view in Figure 1) provides a useful tool as it offers an overall measure of interestingness taking all quality metrics into consideration. As such, it

enables a straightforward method for selecting a possibly interesting OTU subset, to give a clearer view of the sample profiles in the bottom view. Initially, five ranks are available and, as visible in Figure 1, a relatively large number of variables are assigned to each rank. Thus, the rank axis alone may be too broad a tool for reduction. This may be approached by subdividing the ranks based on one of the metrics. However, in this particular case the analyst prefers to utilize the option of modifying the quality metrics’ influence on rank, which often provides a different ranking. Due to the negative correlation between variance and skewness, as found in the RaQ view, these metrics may counteract in terms of ranking. The analyst speculates that variance may possibly be of more interest than skewness since the initial PCA-plot, where the axes represent the directions explaining the largest and next largest amounts of variation in the data, clearly displayed a separation between samples. She further examines the skewness patterns in the data set by filtering along the skewness axis in the RaQ view, selecting different subsets of OTUs to display in the bottom view based on their skewness, as shown in Figure 7. Through this she identifies that a majority of the samples, represented by polylines in the bottom view, have very low counts for OTUs with high skewness, as displayed in the top window in Figure 7. This indicates that the selected subset of OTUs represent bacteria only existing in a small number of samples. However, the bacterial counts of the samples in OTUs with low skewness, as displayed in the bottom window in Figure 7, seems to be higher. Hence, OTUs with low skewness may be more interesting to examine and the analyst therefore inverts the skewness metric using the button below the axis. This instantly results in re-computation of rank, now assigning high ranks to OTUs with low skewness. This relationship would not have been found as easily using an automated method to assign variable interestingness. Due to the visual representations of algorithmic analysis results, relationships among quality metrics that may influence the ranking can be identified, examined and dealt with as considered most appropriate by the analyst. Moreover, the analyst is concerned that correlation may possibly be given too much significance in the ranking, since three correlation metrics are used. Hence she decides to exclude the positive and negative correlation metrics from the ranking algorithm using the check boxes below the axes. The ranking is again re-performed within a few milliseconds, providing a new measure of OTU interestingness. As a result of the modification and re-computation, the number of unique ranks increases. The analyst selects a subset of OTUs, using the filter sliders of the leftmost axis in the RaQ view, to retain the nineteen OTUs that are in the four highest ranks, as displayed in Figure 8. As visible from the ’X-like’ structures between OTU pairs in the lower view, some of the OTUs appear to be strongly negatively correlated.

6.3

Examination of correlation patterns

For a bacterial population, correlations between OTUs may reveal patterns of symbiosis in terms of groups of OTUs that commonly co-habit, represented by a positive correlation, as well as OTUs that rarely co-habit, represented by negative correlation. The analyst has several options for exploring correlation between OTUs and chooses to start off by examining groups of positively correlated OTUs using the variable merging window. In this particular case the analyst is not interested in merging the OTUs, as the individual OTUs and their involvement in driving processes is the main concern of the exploration. Nonetheless, the variable merging window extracts groups of OTUs which may be likely to co-habit and is hence an efficient tool for a first examination of symbiosis patterns. In other analysis situations it would enable fast extraction of potentially redundant variables. Using a correlation threshold of 0.9, two groups are extracted (as displayed in Figure 9). The first group includes OTUs A12 and A39 (left view) and the second includes OTUs A132 and A180 (right view). The analyst notes that for the second group, most samples appear to have a count of zero, whereas the samples are more evenly distributed for the first group. This may possibly indicate that the correlation of OTU A132 and A180 is mainly

Figure 7: Filtering to select a subset of OTUs using the skewness metric (rightmost axis in the RaQ view). OTUs with high skewness (selected in top window) appears to have low counts for most samples, as visible in the data view, whilst the bacterial counts within OTUs with low skewness (selected in the lower window) generally are higher.

Figure 8: A subset of the highest ranked OTUs, represented by polylines in the top parallel coordinates and by axes in the bottom parallel coordinates, after inverting the skewness metric and excluding positive and negative correlation from rank computation.

due to a small number of atypical samples, rather than due to co-habitance of OTUs. In the context of this use-case, not only positive correlations are of interest but also negative correlations, as they may indicate OTUs that rarely co-habit. The occurrence of negatively correlated OTUs is already indicated through the previously identified ’X-like’ structures. The analyst decides to explore these patterns further using a scatter plot matrix, as displayed in Figure 10, as it indicates a clear quantitative measure and direction of the associations in the data. In the scatter plot matrix, the Pearson correlation coefficient of OTU pairs are represented by coloured cells in the top left part of the matrix, red representing negative correlation and blue representing positive correlation. In Figure 10 two groups of OTUs are visible, with strong positive correlation within the groups and strong negative correlation between the groups. The smallest group includes four OTUs (A6, A20, A26 and A32) as marked with black dots in corresponding diagonal cells in Figure 10. The analyst checks the IDs of the OTUs through tool-tips and makes a note to examine them further using various statistical methods. As the groups seems to be part of multivariate correlation patterns, the analyst returns the lower plot to displaying parallel coordinates to examine this further.

6.4

Examination of sample groups

In the parallel coordinates, displayed in the bottom view in Figure 8, the analyst notices that there appears to be some separation between two groups of samples across the axes. This is investigated further, selecting by hand in the lower parallel coordinates, polylines representing the group of samples linking high values of A32 to low values of A35, as this group is clearly concentrated and easy to select manually. The selected items are highlighted in black in the lower view, as displayed in the top part of Figure 11. Once all are selected, it is visible that two different profiles emerge across several axes. For example, the highlighted sample group is generally high in the previously identified group of A6, A20,

Figure 9: The variable merging window displaying two groups of OTUs with correlation above 0.9. The left window displays a group including OTUs A12 and A39 and the right window displays a group including OTUs A132 and A180.

Figure 10: The nineteen highest ranked OTUs displayed using a scatter plot matrix where positive and negative correlations are represented by blue and red cells respectively.

Figure 11: Through visible correlations a sample group is selected and highlighted in black to explore consistent sample profile across OTUs. While re-introducing OTUs some profile differences remain visible.

A26, and A32, whereas the remaining samples are not. The analyst is interested to explore this difference further, and starts to lower the filter on rank, to re-introduce more OTUs. It is visible in the middle and bottom part of Figure 11 that some profile differences remain visible, even as many additional OTUs are re-introduced. The labels of the selected items are checked, through a mouse-over tool tip. It is found that the highlighted set of rows only includes two sites of the mouth of the five sampled, sites A and C. Domain specialists confirmed that these two sites are different in nature to the other three sampled. The parallel coordinates in Figure 11 indicates that this site difference may also support different population profiles. Analysis conducted independently to this process, using statistical and multivariate techniques, support hypotheses that differences may exist between the populations in these sites.38 Through the various features presented in this paper, the analyst then continues to select different OTU subsets for examining relationships between the groups of samples and for establishing groups of OTUs that may be involved in separating the sample groups and hence driving differences between the two sample sites.

6.5

Summary of use-case

The presented techniques have given the analyst methods for exploring interactively the sample profiles across groups of OTUs, which is normally not as easily and quickly done. The analysts’ workflow for analysing this data would commonly include statistical tests, multivariate approaches such as PCA, and network visualization methods such as graph software Cytoscape.48 Furthermore, the QIIME pipeline,49 which is used to process the sample data, provides a range of unlinked visual outputs such as pie-charts, heatmaps, trees and networks. These methods provide a wide range of analysis, but do not enable exploration and examination of various OTU subsets interactively, where the analysis route itself is driven by insights gained during analysis, to the same extent as is possible using the system presented in this paper. As an initial analysis, the exploration described in this use-case has acted as a means for identifying interesting structures and suggesting focus of subsequent analysis. Moreover, it provides an illustration of differences in sampled sites for further study, communication and discussion with the analyst’s team. In the full course of the exploration, ideas around other possible structures in the data have been noted for further investigation. In general, the exploration could continue in a similar fashion, combining different quality metrics to obtain new OTU ranks, selecting and deselecting OTUs that appear interesting or uninteresting. Subsets of potentially interesting OTUs can be extracted and explored using other visualization methods as well, and interesting relationships and patterns within the high dimensional data set can be identified for further exploration, experimentation and discussion with the analyst’s team.

7

Discussion and Conclusions

This paper presents a generic system enabling exploration of high dimensional data sets and interactive dimensionality reduction. The reduction is guided by visual representations combined with quality metrics and a quality-based variable ranking. A main advantage of the techniques presented in the system lies in their ability to provide an easily interpreted overview of the structures and relationships between variables in the high dimensional data set. By using this overview, the understanding and identification of possibly interesting structures is facilitated, as well as providing guidance when exploring the data. Furthermore, the combined visual and algorithmic techniques provide various features for interactive and exploratory dimensionality reduction, which is controlled by the user and guided by variable ranks and visual overview. The techniques have been demonstrated in a use-case, as an example of the potential and usefulness of this kind of system for guided exploration of high dimensional data sets. The use-case also provided examples of some of the analysis routes that might be used in practise. It has been illustrated how exploratory work might generate hypotheses around bacterial population differences that can be supported by further analysis using confirmatory analysis techniques.38 The presented techniques and analysis routes similar to the ones described can also be useful within other domains dealing with high dimensional data sets. In the use-case example, the benefit of the system is showing the difference in bacterial populations visually within various data subsets, for interactive exploration by analysts and assisting identification of potentially interesting structures, which can guide subsequent analysis. It has also been demonstrated how the analyst, facilitated by the combination of visual and algorithmic guidance, is able to make informed decisions within an efficient dimensionality reduction process. Furthermore, the analysis carried out enabled the analyst to explore multivariate patterns in the data, which is normally not as easily done, and through it gain insights. It also provided the analyst’s team with new ways of thinking about

the data and, through this, generating ideas about other possible structures and metrics of interest. A potential issue with the presented system is related to its flexibility and complexity, in relation to its ease of use. For straightforward analysis tasks, an intuitive interface with a limited set of analysis options and functionality may often be most useful. However, as tasks grow more and more complex, systems are required to provide more complex analysis options, often with the trade-off of less intuitive interfaces. The interface of the presented system is not fully intuitive, in the sense that some sort of introduction is required for a new user to understand how to interpret it and make use of its functionality. There is also a potential risk of getting confused regarding which aspects of the data set is displayed in which view, especially when parallel coordinates are used in the bottom view, since it is also the representation of the RaQ view. This has, however, not been an issue with the end users that have been introduced to the techniques so far, which includes a small group of microbiologists and informaticians. A longitudinal user study with domain experts will be the subject of future work and will provide further information on the utility and usability of the techniques. Another area of future work is the application of the techniques to different domains, to further establish its generality and usefulness. The presented techniques are in many ways customizable and, hence, suitable for a range of domains. Firstly, the overall approach of using quality metrics to extract the potentially most interesting variables, and to combine quality metric analysis and interactive visualization, to provide flexible and usercontrolled dimensionality reduction, could be applied to almost any type of data, provided the quality metrics used are appropriate for corresponding data. Secondly, the system in itself, as currently implemented, is flexible in providing possibilities of including pre-calculated metrics along with the metrics available in the system. Through the use of pre-calculated metrics, analysis can be adapted to focus on domain specific issues. Thirdly, the analyst is able to design and select relevant metrics and modify their influence on the overall measure of interestingness and is able to make informed decisions based on identified data structures. More options of selection and interaction with metrics may be desirable and is the subject of future work.

Acknowledgement This work was supported by Unilever Discover Port Sunlight; the Visualization Programme coordinated by the Swedish Knowledge Foundation; and the Swedish Research Council in the Linnaeus Centre CADICS.

References [1] Inselberg A. The plane with parallel coordinates. The Visual Computer 1985; 1(4): 69–91. [2] Wegman EJ. Hyperdimensional data analysis using parallel coordinates. Journal of the American Statistical Association 1990; 85(411): 664–675. [3] Becker RA and Cleveland WS. Brushing scatterplots. Technometrics 1987; 29(2):127–142. [4] Johansson S and Johansson J. Interactive dimensionality reduction through user-defined combinations of quality metrics. IEEE Transactions on Visualization and Computer Graphics 2009; 15(6): 993–1000.

[5] Guo D. Coordinating computational and visual approaches for interactive feature selection and multivariate clustering. Information Visualization 2003; 2(4): 232–246. [6] Johansson Fernstad S, Johansson J, Adams S, Shaw J and Taylor D. Visual exploration of microbial populations. In: Gehlenborg N, Machiraju R, M¨oller T, editors. Proceedings of the 1st IEEE Symposium on Biological Data Visualization (BioVis); 2011 October 23-24; Providence RI, USA; 2011. [7] Goldberg DE. Genetic algorithms in search, optimization and machine learning. 1st ed. Boston: Addison-Wesley Longman Publishing Co, 1989. [8] Cox T. Introduction to multivariate analysis. 1st ed. Hodder Arnold Publication, 2005. [9] Chen C. Top 10 unsolved information visualization problems. IEEE Computer Graphics and Applications 2005; 25(4): 12–16. [10] Rao R and Card SK. The table lens: merging graphical and symbolic representations in an interactive focus + context visualization for tabular information. In: Adelson B, Dumais S, Olson J, editors. Proceedings of the SIGCHI conference on Human factors in computing systems: celebrating interdependence; 1994 April 24-28; Boston MA, USA; 1994; 318–322. [11] Eick SG and Karr AF. Visual scalability. Journal of Computational and Graphical Statistics 2002; 11(1): 22–43. [12] Keim DA. Designing pixel-oriented visualization techniques: theory and applications. IEEE Transactions on Visualization and Computer Graphics 2000; 6(1): 59–78. [13] Yang J, Patro A, Huang S, Mehta N, Ward MO and Rundensteiner EA. Value and relation display for interactive exploration of high dimensional datasets. In: Ward M, Munzner T, editors. Proceedings of the 10th IEEE Symposium on Information Visualization; 2004 October 10-12; Austin, TX, USA; 2004; 73–80. [14] Turkay C, Filzmoser P and Hauser H. Brushing Dimensions – A Dual Visual Analysis Model for High-Dimensional Data. IEEE Transactions on Visualization and Computer Graphics 2011; 17(12): 2591–2599. [15] Barlowe S, Zhang T, Liu Y, Yang J and Jacobs D. Multivariate visual explanation for high dimensional datasets. In: Ebert D, Ertl T, editors. Proceedings of IEEE Symposium on Visual Analytics Science and Technology; 2008 October 21-23; Columbus, OH, USA; 2008; 147–154. [16] Nam EJ, Han Y, Mueller K, Zelenyuk A and Imre D. ClusterSculptor: a visual analytics tool for high-dimensional data. In: Ribarsky W, Dill J, editors. Proceedings of IEEE Symposium on Visual Analytics Science and Technology; 2007 October 30 - November 1; Sacramento, CA, USA; 2007; 75–82. [17] Kohonen T. The self-organizing map. Neurocomputing 1998; 21(1–3): 1–6. [18] Friedman JH and Tukey JW. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers 1974; 23(9):881–890. [19] Koren Y and Carmel L. Robust linear dimensionality reduction. IEEE Transactions on Visualization and Computer Graphics 2004; 10(4): 459–470. [20] Engel D, Rosenbaum R, Hamann B and Hagen H. Structural decomposition trees. Computer Graphics Forum 2011; 30(3): 921–930.

[21] Kandogan E. Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Lee D, Schkolnick M, Provost M, Srikant R, editors. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2001 August 26-29; San Francisco, CA, USA; 2001; 107–116. [22] Asimov D. The grand tour: a tool for viewing multidimensional data. SIAM Journal on Scientific and Statistical Computing 1985; 6(1):128–143. [23] Williams M and Munzner T. Steerable, progressive multidimensional scaling. In: Ward M, Munzner T, editors. Proceedings of the 10th IEEE Symposium on Information Visualization;2004 October 10-12; Austin, TX, USA; 2004; 57–64. [24] Ivosev G, Burton L and Bonner R. Dimensionality reduction and visualization in Principal Component Analysis. Analytical Chemistry 2008; 80(13): 4933–4944. [25] Jeong DH, Ziemkiewicz C, Fisher B, Ribarsky W and Chang R. iPCA: an interactive system for PCA-based visual analytics. Computer Graphics Forum 2009; 28(3): 767–774. [26] Seo J and Shneiderman B. A Rank-by-Feature framework for unsupervised multidimensional data exploration using low dimensional projections. In: Ward M, Munzner T, editors. Proceedings of the 10th IEEE Symposium on Information Visualization;2004 October 10-12; Austin, TX, USA; 2004; 65–72. [27] Artero AO, de Olivera MCF and Levkowitz H. Enhanced high dimensional data visualization through dimension reduction and attribute arrangement. In: Banissi E, Burkhard RA, Ursyn A, et al, editors. Proceedings of the 10th international conference on Infomation Visualization; 2006 July 5-7; London, UK; 2006; 707–712. [28] Sips M, Neubert B, Lewis JP and Hanrahan P. Selecting good views of high-dimensional data using class consistency. Computer Graphics Forum (Proc. EuroVis 2009) 2009; 28(3): 831–838 [29] Tatu A, Albuquerque G, Eisemann M, Bak P, Theisel H, Magnor M and Keim D. Automated analytical methods to support visual exploration of high-dimensional data. IEEE Transactions on Visualization and Computer Graphics 2011; 17(5): 584–597. [30] Albuquerque G, Eisemann M, Lehmann DJ, Theisel H and Magnor M. Improving the visual analysis of high-dimensional datasets using quality measures. In: MacEachren A, Miksch S, editors. Proceedings of IEEE Symposium on Visual Analytics Science and Technology; 2010 October 25-26; Salt Lake City, UT, USA; 2010; 19–26. [31] Ferdosi BJ and Roerdink JB. Visualizing high-dimensional structures by dimension ordering and filtering using subspace analysis. Computer Graphics Forum 2011; 30(3): 1121–1130. [32] Yang J, Ward MO and Huang S. Visual hierarchical dimension reduction for exploration of high dimensional datasets. In: Bonneau GP, Hahmann S, Hansen CD, editors. Proceedings of Eurographics/IEEE TCVG Symposium on Visualization; 2003 May 26-28; Grenoble, France; 2003; 19–28. [33] Yang J, Peng W, Ward MO and Rundensteiner EA. Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets. In: Munzner T, North S, editors. Proceedings of IEEE Symposium on Information Visualization; 2003 October 19-21; Seattle, WA, USA; 2003; 105–112. [34] Schreck T, von Landesberger T and Bremm S. Techniques for precision-based visual analysis of projected data. In: Park J, Hao MC, Wong PC, Chen C, editors. Proceedings of IS&T/SPIE Conference on Visualization and Data Analysis; 2010 January 18-21; San Jose, CA, USA; 2010

[35] Ingram S, Munzner T, Irvine V, Tory M, Bergner S and M¨oller T. DimStiller: workflows for dimensional analysis and reduction. In: MacEachren A, Miksch S, editors. Proceedings of IEEE Symposium on Visual Analytics Science and Technology; 2010 October 25-26; Salt Lake City, UT, USA; 2010; 3–10. [36] Srinivas N and Deb K. Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary Computation 1994; 2(3): 221–248. [37] Fonseca CM and Fleming PJ. Genetic algorithms for multiobjective optimization: formulation, discussion and generalisation. In: Forrest S, editor. Proceedings of the 5th International Conference on Genetic Algorithms; 1993 July 17-21; Urbana-Champaign, IL, USA; 1993; 416–423. [38] Adams SE, Lloyd AM, Brading MG, Cox TF, Taylor D, and Quince C. Measurement of bacterial diversity using 454-sequencing and Oral Microarray (HOMIM). Presented at International Association for Dental Research General Session; 2010 July; Barcelona, Spain; 2010. [39] Asuncion A and Newman DJ. UCI Machine Learning Repository. http://www.ics.uci.edu/∼mlearn/MLRepository.html (2007, accessed February 2010). [40] Nagesh H, Goil S and Choudhary A. Adaptive grids for clustering massive data sets. In: Grossman R, Kumar V ,editor.Proceedings of First Siam International Conference on Data Mining; 2001 April 5-7; Chicago, IL, USA; 2001. [41] Agrawal R, Gehrke J, Gunopulos D and Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Tiwary A, Franklin M, editors. Proceedings of ACM SIGMOD International Conference on Management of Data; 1998 June 1-4; Seattle, WA, USA; 1998; 94–105. [42] Agrawal R and Srikant R. Fast algorithms for mining association rules. In: Bocca JB, Jarke M, Zaniolo C, editors. Proceedings of the 20th International Conference on Very Large Data Bases; 1994 September 12-15; Santiago de Chile, Chile; 1994; 487–499. [43] Rodgers JL and Nicewander WA. Thirteen ways to look at the correlation coefficient. The American Statistician 1988; 42(1): 59–66. [44] Myers JL and Well AD. Research design and statistical analysis. 3rd ed. New York: HarperCollins Publishers Inc, 1991. [45] Wackerly DD, Mendenhall W and Scheaffer RL. Mathematical statistics with applications. 7th ed. Southbank: Thomson learning, Inc, 2008. [46] Kendall M, Stuart A and Ord JK. Kendall’s advanced theory of statistics, vol. 1, distribution theory. 5th ed. London: Charles Griffin & Company Limited, 1987. [47] Draper NR and Smith H. Applied regression analysis. 2nd ed. New York: John Wiley & sons, Inc, 1981. [48] Shannon P, Markiel A, Ozier O, Baliga S, Wand JT, Ramage D, Amin N, Schwikowski B and Ideker T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Research 2003; 13(11): 2498–2504. [49] Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pea AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J and Knight R. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 2010: 7(5): 335–336.