A Multi-Dimensional Data Visualization Tool for ... - CiteSeerX

0 downloads 0 Views 114KB Size Report
7. K. Parsaye and M. Chignell, Intelligent Database Tools &. Applications, John Wiley & Sons, 1993. 8. J. Ross Quinlan, C4.5: Machine Learning Programs,.
A Multi-Dimensional Data Visualization Tool for Knowledge Discovery in Databases Hing-Yan Lee, Hwee-Leng Ong, Eng-Whatt Toh, and Sieu-Kong Chan Information Technology Institute, National Computer Board 71 Science Park Drive, Singapore 0511 Republic of Singapore email: {hingyan, hweeleng, engwhatt, sieukong}@iti.gov.sg

Abstract This paper describes a multi-dimensional visualization (MDV) technique, a refined and improved form of a computational geometry method known as parallel coordinates, that has been realized by WinViz. It shows how MDV differs significantly from parallel coordinates as originally conceived by Alfred Inselberg. The paper also discusses WinViz's underlying design considerations. A primary principle is known as "seeing your data in a single picture." The novel visualization interface of WinViz supports both the display of multidimensional data and the visual formulation of query interactively. In particular, we compare WinViz's query operators with those in SQL to highlight its salient features realized through the MDV interface. Applications of WinViz include exploratory data analysis and knowledge discovery of databases for a diverse range of domains. Keywords: AI tool, data visualization, data mining, knowledge discovery in databases, intelligent user interfaces

1. Introduction With the pervasive spread of computers in every facet of business and industry, data is being routinely and almost mindlessly captured. Often beyond the current window of interest (e.g., to produce this month's bank statement), the data collected is archived away, often never to be used again. Knowledge discovery in databases (KDD), also known as data mining, has emerged as a popular solution to the data glut problem that plagues organizations that routinely capture but have little time or inadequate tool to understand the data

sufficiently. The process of KDD can be viewed as a succession of stages, as depicted in Figure 1. After the data has been prepared in some appropriate file format and rendered consistent (e.g., removal of noisy data, filling in of missing values, and deleting hopelessly corrupted records), trends and patterns can then be generated, from which hypotheses can be elicited for interpretation and analysis. Data Preparation

Data Cleaning

Generate Hypotheses

Interpretation & Analysis

Figure 1: Knowledge Discovery in Databases Process Research in KDD todate has focused largely on the adaptation of machine learning techniques, with some emphasis on statistical analysis. The use of visualization technology has largely been dismissed as being gimmicky in KDD work. This paper describes a multidimensional visualization (MDV) technique that allows a user to visually examine a tabular database and to formulate query interactively and visually. In particular, the MDV technique has been used in exploratory data analysis and KDD. MDV has been realized in a tool named WinViz, which a user can interactively explore a database and identify patterns and trends hidden in the data. The user can uncover new properties of the data and detect any deviations. We get a picture of 'what the data is trying to tell us' - the observations can then be confirmed using conventional statistical analysis.

2. Multi-Dimensional Data Visualization

WinViz has been developed to exhibit the power of applied business visualization techniques on today's voluminous data repositories. Its primary contribution lies in improving the multi-dimensional visualization of data using parallel coordinates [4, 5]. Numerous methods of "graphing" data in two, three, and even four dimensions exist, but the literature reveals little information of a general method which permits visualization of multivariate relationships (as opposed to merely relationships between small subsets of variables) in N-dimensional space. The technique used supports visualizing p-dimensional linear objects in Ndimensional space using a 2-dimensional display [2,3]. Parallel coordinates, as conceived by Alfred Inselberg, can be described as follows: "In parallel coordinates, the principle coordinate axis are parallel and equidistant to each other. That is, for an N-dimensional data set, N vertical axis are placed on a plane, so that every two successive axis are one unit part from each other. An N-dimensional data entry is represented by a broken (polygonal) line whose vertices lie on the parallel axis and whose height (y-position) is determined by the entry's attributes, i.e., the value of the first attribute determines the height of the vertex placed on the first vertical axis, etc. A data set, then, is represented by a collection of these polygonal lines" [1].

b) Group bars appear in the place of attribute values on each vertical axis. The group bars reduce the complexity of lines when the dataset gets too big. c) The concept of class is introduced, where a class is a subset of the data. The user can group subjects of interest into classes, and see how they are represented in the dataset and how they correlate with other attributes in the data. d) Horizontal histograms are provided on the right hand portion of group bars when the data is divided into classes. This allows the user to compare against other attributes.

3. The WinViz Interface The MDV technique accepts conventional tabular data as input. Each column is treated as an attribute, which can be either discrete or continuous. For purpose of illustration, a dataset has been downloaded from the Machine Learning Repository maintained by the University of California at Irvine. Each record in this credit screening dataset captures information of past credit applications (Table 1). Attribute application status job status item purchased

applicant's gender applicant's age marital status housing area savings monthly loan repayment no. of months to repay no. of years employed at current company January

June

ID (Year)

Figure 2: Inselberg's parallel coordinates for USA employment for females in the 16-19 age group between 1960-1981 (taken from [1]) Figure 2 shows that a dependency between the ID axis and the other two axis exists. Although some negative correlations exist in the data, the number of employed women rises generally over time. The MDV version of the parallel coordinates differs from Inselberg's in several aspects: a) While the polygonal lines exist, they no longer play a significant role. WinViz supports a polygonal line display toggle whereby the user can select between having a polygonal line to represent a tuple (or database record) or several tuples satisfying the attribute (or dimension) values specified by the user.

Name Granted Jobless Bought

Type/Domain Value {Yes, No} {Yes, No} {Stereo,PC, Bike, MediInstru,Jewel, Furniture, Car} Sex {Male, Female} Age continuous Married {Yes, No} Housing_Area {Good, Bad} Savings continuous Loan continuous Repay_Period continuous Months continuous

Table 1 : Credit screening dataset

3.1 MDV Display An MDV display is divided into three main regions: the workspace region, the total population region, and the status region (see Figure 3). The labels correspond to the attribute names. The initial order in which the attributes appear is dependent only on the order of the attributes in the dataset, i.e., leftmost attribute in the dataset is leftmost attribute in the display. Workspace region displays the data graphically. The values of each database attribute are represented by rectangles, known as group bars, on a vertical axis in the workspace region. For example, SEX is a discrete attribute with two values: Male and Female and hence have two group bars. AGE is a continuous attribute that

can be discretized into different age group bars. The width of each group bar corresponds to the number of records with the value of the group bar. The height of the group bar has no significance. attribute labels

P_SEX

P_AGE

Total

total population region

Attribute Class Query/Bar

status region

Bar/Total

workspace region

Figure 3: MDV display regions

Figure 4: Display of credit screening dataset When the credit screening dataset (or a subset) is loaded into WinViz, the MDV display appears as in Figure 4. We can see that there are more successful applicants (indicated by the width of the group bar GRANTED=Yes) than rejected ones (GRANTED=No). The display also allows us to deduce that most credit applicants are in the younger and middle age group (indicated by the width of group bars for AGE where the age value increases up the vertical axis). Immediately upon initial data loading, we already have an overview of the record distribution visually. This information is difficult to obtain using other techniques or tools. Total population region displays information with respect to the entire dataset. Initially, it is empty. It changes in values, according to the query or classes created. Status region provides statistical information of a group bar that the mouse cursor is pointing at. For example, when the mouse cursor (represented by the arrow) is placed over the group bar SEX=Male, the status region reports that 65 of the 125 applicants (52%) are male (Figure 4).

3.2 MDV Display Operations

To facilitate data analysis via visualization, several other operators are provided; they include the following: a) It is unnecessary to load all the attributes of a dataset in its entirety into WinViz. After all, the human mind can only handle a limited number of attributes and their inter-relationships at any one time. Studies such as [6] have shown that that magic number is 7, plus or minus 2. WinViz will, however, display all attributes (regardless of number) within the physical screen size, if so desired. b) Data mining and exploratory data analysis are very much dialectic processes. As the user moves through the dataset, he takes new attributes into consideration or discards others. WinViz supports the interactive and selective addition of attributes and dropping of attributes from MDV as the user includes or excludes them for/from his analysis. c) Where a detailed or focused view of certain attribute value range is desired, the zooming in and unzooming operations can be used. d) The order of attribute display may be changed by using a click-and-drag on the attribute label to its intended position on the MDV display. e) Attribute values may be sorted in ascending, descending or a user-specified order to bring out trends with respect to a particular ordering of attribute values. f) A continuous attribute such as AGE can be partitioned into several group ranges such as lower, middle or upper age group. This allows the user to identify patterns that may be particular to a specific group range. g) By default, the initial display of the workspace is the unnormalized mode. The width of each group bar represents the actual sizes of the dataset. In the normalized mode, all the group bars for an attribute have the same size. The normalized mode is useful when the user is interested in comparing percentages between groups for an attribute.

3.3 Design Principles Several principles underlie the design of the MDV interface in WinViz; they include: Seeing your data in a single picture. This does not merely refer to a visual and graphical display of the data, as the picture metaphor would typically conjure. It also imposes the requirement of displaying all the specified attributes within the confines of the physical screen area. Thus the user continues to have a complete view to detect and identify trends, patterns, and relationships. A corollary of this principle is obliviating the use of panning and scrolling over an MDV display.

Same interface for data display and for query formulation. This means that the data display interface has to be harnessed for query formulation. Further discussion is given in Section 4. Assumes minimal knowledge of statistics. Only an appreciation of the notion of percentage is needed to understand and use MDV. Knowledge of intricate and complex statistical concepts is useful but unnecessary. Requires no knowledge of a query language. To manipulate and interact with data, the user should not have to remember query language syntax and commands such as those in SQL. Resulting from the above design decisions and principles are several shortcomings; they are: a) When the number of attributes is huge, WinViz will display the vertical axis and associated group bars corresponding to all the selected attributes even though they may appear very small. In such cases, zooming in and un-zooming from an attribute axis can be used to obtain a detailed view. b) The existence of horizontal histograms (bar chart) on the right side of group bars arises from using the vertical axis of the parallel coordinates. For many users, comparison using horizontal histograms is less intuitive than using vertical histograms (column chart).

4. Visual Query Capability Parsaye and Chignell (1993) categorize query interfaces in several basic classes: command oriented (such as dBase, SQL), table or form based (such as QBE, Paradox), graph oriented (such as Metaphor, Quest), icon based (such as Iconix Query), and hypertext oriented (such as Iconix Query). The paradigm that is embedded in WinViz does not fit into any of the these categories. In [7], the authors indicate that a visual query system should be interactive; be progressive instead of nested; clearly distinguish between information selection and manipulation; and represent a query as an analog of a record rather than a table. We now compare and contrast the capabilities of WinViz with these characteristics.

4.1 Interactive query WinViz enables the identification of relationships among the different database attributes through visual query. The MDV display interface of multi-dimensional data is also the database query interface. WinViz is in query mode when the mouse cursor changes from being an arrow to being a question-mark (see Figure 5) as the cursor is moved near any of the attribute axis. A user can formulate a query interactively using a point-and-

click metaphor. There is no need to learn any commandbased query language. In SQL, the result of a query is a relation. In WinViz, query results can be visualized by observing the shaded group bars or the polygonal lines.

Figure 5: Query on applicants who are employed, normalized mode A visual query in WinViz is done by clicking the group bars corresponding to the conditions used in the query. For example, clicking the mouse cursor over the group bar where JOBLESS=No selects all records relating to non-jobless applicants (Figure 5). A hatched patch appears around the group bar indicating the query condition chosen. Such a query causes the group bars for the all other attributes to become partially shaded indicating the number of records satisfying the current query. In the normalized mode where the sizes of all the group bars are the same and the shading in each group bar corresponds to the percentage of records satisfying the query within the group, the relationship between the attributes can be identified. In this case, we observe that 97.65% (third box of the Status region of Figure 5) of successful applicants have jobs, indicating that employment is an important requirement for credit approval. Specifying primary key. Since the key attribute values are unique, this can be specified by clicking the mouse when the cursor is over that value on the vertical axis representing the key attribute. Specifying conditions. Attribute values are selected by clicking the mouse cursor on the group bar labeled with the value. To specify a range of values, the mouse cursor is dragged over the desired value range on the vertical axis representing the database attribute in question. Compound queries. When the user clicks on several group bars, a compound query is formulated. The effects of OR and AND conditions can be achieved, as illustrated by the following examples. AND conditions are permitted across different attributes. Figure 6 shows the query (JOBLESS=Yes) AND (SEX=Female) AND

(GRANTED=Yes). In this case, the attributes are JOBLESS, SEX and GRANTED. OR conditions are allowed across the values of an attribute, e.g., (BOUGHT=Stereo) OR (BOUGHT=PC). A query involving a combination of AND and OR has to be handled using the class concept. For example, the query ((JOBLESS=Yes) AND (BOUGHT=Bike)) OR ((JOBLESS=No) AND (BOUGHT=Car)) when formulated directly by selecting the group bars will also cover the cases for ((JOBLESS=Yes) AND (BOUGHT= Car)) and ((JOBLESS=No) AND (BOUGHT=Bike)) resulting in a bigger and incorrect answer. The right formulation is to create two classes: one for ((JOBLESS =Yes) AND (BOUGHT=Bike)) and another for ((JOBLESS=No) AND (BOUGHT=Car)). The user then uses a merge function to join the two classes together. Further research is being conducted to improve the formulation of compound queries.

4.2 Progressive query formulation

loan to buy a stereo while the other to buy jewelry, as indicated by the polygonal lines intesecting at BOUGHT=Stereo and BOUGHT=Jewel respectively.

4.3 Information selection vs manipulation WinViz also supports classifying the records and comparison of classes visually. In Figure 7, two classes are formed for GRANTED=Yes (shown as darker boxes on the right of the axis) and GRANTED=No (shown as lighter boxes on the right of the axis). For all types of items bought except bikes, the chances of getting a credit approval is higher, evident by the longer darker boxes than the lighter boxes along the vertical axis for attribute, BOUGHT. We also observe that although the number of males and females are about the same (as indicated by the widths of the group bars on the left of the vertical axis for SEX), males are more successful at getting credit approvals (as evident by the longer darker box on the right of the axis at SEX=Male). The manipulation aspect is dealt with in Section 4.5.

SQL queries can be nested using the where clause. Conceptually nesting can be perplexing. An alternate approach to achieve the same result is to formulate a succession of queries with the output of one being the input of the next. Such an approach is adopted in WinViz. As the user drills down the database through successive queries, the records of interest are identified by choosing an option to display the selected records. Polygonal lines are drawn to show the records satisfying the current query.

Figure 7: Classification

4.4 Query as record analog vs table analog

Figure 6: Display of selected records Figure 6 shows the query result on jobless females who are granted credit, as indicated by the hatched patches on attribute values (GRANTED= Yes) AND (JOBLESS=Yes) AND (SEX=Female). Displaying the selected records (indicated by the polygonal lines) shows that there are only two applicants (evident by the statistic 2/125 in the status region) who satisfied the query; they are both married and living in good areas. One uses the

This requirement is based on the assertion that "while tables are a very good way of visualizing databases, they are not the most natural of visualizing queries. This is because queries are descriptions of the conditions that relevant data must match, rather than a visualization of all the records that meet that description" [7]. They argue that forms corresponding to paper version used manually for an application domain provides the best way to express queries. This requirement, however, is not applicable to WinViz because KDD and exploratory data analysis are two very unstructured and dialectic activities. The use of application-specific forms will unnecessarily constrain the analysis endeavor.

4.5 Comparison with SQL We now discuss the common database query operations such as projection, selection, join, and other operators (such as those in SQL) and show how they can be achieved in WinViz. Projection The select clause in SQL is used to list the data attributes in the relation representing the query result. This is achieved in WinViz by selecting and/or de-selecting attributes from the MDV display. Selection In SQL, the from clause specifies a list of relations to be used in executing the query. WinViz currently only supports query of a single dataset. Join The where clause in SQL is often used in queries involving more than one relation through the use of predicates on the attributes of such relations. An indepth knowledge of the table structure and commands is necessary to formulate a query spanning more than a single relation. Joining relations is non-trivial and is beyond the scope of most except expert SQL users. Because WinViz only supports query of a single dataset, the effects of a join operation has to be explicitly executed. Not supporting the join operator is not necessarily a handicap as the requirement to think in terms of join paths is more a design shortcoming than something relevant from a user's viewpoint. Other operators SQL operators such as average and sort are not supported. While WinViz does not support many SQL-like operators, it scores well in several areas: there are no commands to remember and to compose; and the visual interface removes the need to remember attribute names, as mistypings and misspellings are common sources of frustrations.

5. Implementation Status, Future Directions & Conclusion WinViz has been implemented in C++ for the Microsoft Windows platform on the PC. It currently supports and accepts dBase, Lotus 1-2-3 spreadsheet, and IBI Hold file formats. Work is underway to add an intelligent analysis capability involving unsupervised (clustering) and supervised (inductive concept) learning capability to WinViz. To achieve the latter, we have integrated the C4.5 machine learning program [8] with WinViz to allow the if-then rules generated by the former to be stepped through and displayed on MDV. Instead of having to manually explore the data using MDV, WinViz now provides the user with a hypotheses generation capability. To handle very large databases, a client-server version of WinViz is also being developed to provide ODBC support and to download computation

to the server. Further work is also being done to handle complex compound queries. A new and innovative approach to multidimensional data visualization (MDV) has been presented. The MDV interface has been designed for the display of multi-dimensional data and also for visual interactive query formulation. We have contrasted MDV with the original parallel coordinates concept. The two key aspects of MDV, display and query formulation, are both visually and interactively based. We also described the WinViz's query operators by comparing them to those in SQL. WinViz has been used in a number of KDD projects to identify patterns and trends hidden in a variety of databases. Compared to KDD tools employing inductive machine learning techniques, WinViz enjoys several advantages. While the former are good for automatically generating generalized knowledge, in the form of decision trees or production rules, the latter's visual multivariate display, visual query capability, and interactivity facilitate data exploration and information presentation. Synergy can be derived by combining the respective strengths of each approach. Acknowledgments The contributions of Lee-Hian Quek and Li-Ling Yit to the WinViz project is gratefully acknowledged.

References 1. Tuval Chomut, Exploratory Data Analysis Using Parallel Coordinates, MSc Thesis, UCLA Computer Science Dept., IBM LA Sc. Cen. Rep. No. 1987-2811. 2. J. S. Eickemeyer, Visualizing p-Flats in N-Space Using Parallel Coordinates, PhD Thesis, UCLA Computer Science Dept, 1991. 3. J. S. Eickemeyer, A New Approach to Multi-Dimensional Visualization Using Parallel Coordinates, Proc. IT Works '91, pp. 79 - 90, Singapore, 1991. 4. A. Inselberg and B. Dimsdale, Parallel Coordinates for Visualizing Multi-Dimensional Geometry, Proc. Computer Graphics Intl. Conf., 1987. 5. A. Inselberg and B. Dimsdale, Parallel Coordinates: A Tool for Visualizing Multi-Dimensional Geometry, Proc. IEEE Conf. on Visualization, pp 361 - 378., 1990. 6. G.A. Miller, The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information, Psychological Review, Vol. 63, pp. 81-97, March 1956. 7. K. Parsaye and M. Chignell, Intelligent Database Tools & Applications, John Wiley & Sons, 1993. 8. J. Ross Quinlan, C4.5: Machine Learning Programs, Morgan Kaufmann Publishers, 1993.

Suggest Documents