Joint Statistical Meetings - Section on Statistical Graphics
SELECTING AMONG CATEGORIES Interactive Statistical Graphics working on Databases Martin Theus Department of Computational Statistics and Data Analysis University of Augsburg, 86159 Augsburg, Germany
[email protected] Key Words: Interactive Statistical Graphics, Direct Manipulation Interfaces, Databases, SQL. Abstract: Most statistical graphics and statistical methods do not scale well to more than thousands or tens of thousands of observations, but large databases exceed these limits easily. One exception are graphs for visualizing categorical data, e.g. counts represented by barcharts or mosaic plots. Fortunately the data in corporate databases are mostly categorical. This allows for a visualization of even millions of records. Obviously classical analysis software is not able to handle files of that size, and an analyst is tempted to dump only a subgroup of the data to be able to use his/her analysis tool of choice. But choosing the right subset of the data for a particular analysis is usually a hard task. This paper highlights how to work on large databases, by facilitating displays and selection tools and techniques for categorical data. These techniques require a careful design of both the data in the database, as well as the access via SQLqueries. Using a two level data access (do not extract data from the database until the subset is small enough to handle) combined with hot set selections (as implemented in DataDesk) the analyst can work seamlessly on even very large databases within one tool.
1.
The Curse of Flat Files
Statistics, especially mathematical statistics, grew up about 100 years ago. In those days data sets were obviously very small, because any processing of these data was done manually. When statistical computing arose, mathematical statistics was complemented with EDA methods and lost its predominance. Still electronic storage was very expensive and data set sizes were very limited. Since statistics is traditionally a topic ”owned” by mathematics and not by computer science, the primary focus of statistics is on the underlying mathematical theory
and less on the computing skills. This is certainly the right balance. Nevertheless, when designing and writing statistical software, a profound knowledge of computer science is necessary. So far, statistical software is usually good a reading proprietary data formats, respectively importing flat ASCII files. Computational statistics unfortunately ignored the presence of data sitting in databases for too long. In the early 90s, the number of huge databases grew more and more. This data were mostly collected electronically or entered at distributed locations. Governmental agencies, big retailers or online traders now have to deal with huge databases, which are hard to analyze beyond simple summaries. At this point computer scientists took advantage of the lack of database knowledge of statisticians and ”invented” a discipline called KDD (Knowledge Discovery in Databases). But is KDD a concurrent discipline to statistics? Since KDD is about gathering data, sampling data, experimental design, analyzing data, modeling, discovery, visualization, forecasting and classification it ”lives” in the statistics domain. A lack of apropriate statistical tools has brought computer scientists into play. As Daryl Pregibon [4] put it:”KDD = Statistics at Scale and Speed”. The Datamining Process Although there are several ”all-in-one” datamining tools on the market, the typical ”Datamining Process” (of a statistician) is often still performed in four steps as illustrated in figure 1.
3453
1. Define the SQL query, which gets (probably) all the data out of the database we want to analyze. 2. Export the data to a flat file 3. (Optionally convert the flat file to a format the data analysis tool is inclined to import) 4. Import the data into the data analysis tool
Joint Statistical Meetings - Section on Statistical Graphics
Define Query SELECT age, wage, gender, product_id, … FROM FCT_ALL_CUSTOMERS WHERE contract_length = 24 AND …
Convert Data
?
Export Data
Redefine Query Import Data * ** * * * * *
Analyze Data
Computer Science Domain
Statistics Domain
Figure 1: The usual datamining process
2.
5. Analyze the data in your preferred tool. At the end of this process we might find out that we did not get all relevant data out of the database, so we are forced to redefine the query and to redo the whole process. This is also true if the underlying data changes in the database and we need to update the results. Although more and more packages can retrieve data from databases, they usually load all the data into the program’s memory, running into typical efficiency problems. Interactivity vs. Databases Exploratory Data Analysis (EDA) is an interactive process by its very nature. Thus software supporting EDA must help to enable this interactive process. When working with data inside databases the usual interaction is ”to wait”. To be able to achieve fast response times from databases, one must take a lot of care in setting up indexes and optimize queries carefully. But this is an operation which must take place inside the database and is hard to generalize to arbitrary data sets. Nevertheless, if the size of the datasets gets too big, one cannot handle it outside a database. Many graphical representations of data like barcharts, mosaic plots, histograms or even boxplots can be drawn with only a summary of the underlying data. The size of these summaries is constant and does not scale with the size of the data set and can usually be collected easily with conventional database queries.
Isn’t the World categorical anyway?
Most variables are measured at only a very limited resolution. E. g. age is usually only measured at integer values. Thus a customer data base will not hold more than 80 different levels of age. But 80 categories can still be handled in scrollable barcharts. And further, an analysis might only be interested in certain age groups like (0, 20], (20, 35], (35, 65] and (65, 100] which would result in a categorical variable with only four classes. A discretization like this must be done very carefully. A solid knowledge of the underlying distribution of age as well as domain knowledge is needed to find sensible class definitions. Census data are mostly categorical by nature. Information on gender, race, marital status, family type, number of children or education are all recorded at a small number of predefined levels. Only income related information is measured on a continuous scale. E. g. the1995 US Census Data from Freedman et al. [1] book at the UCLA stats department data set repository has 23 variables, from which 18 are categorical and only 5 continuous. The dataset can be accessed at: http://www.stat.ucla.edu/data/fpp/. If we look at the classical Star Model in database design, we find a fact table in the center with a lot of attribute data around it. These attributes are assumed to be categorical data as well.
3454
Joint Statistical Meetings - Section on Statistical Graphics
Database Access in Statistical Software inefficient Statistical Software
efficient Database
Statistical Software
Query complete variable
Calculate summary on whole data
Query summary of variable
O(n)
Return complete variable
Database Calculate summary on data
Return summary
Display results
Display results
Figure 2: Two ways of accessing a database. In the left schema the inefficient way is depicted, where all data is retrieved from the database and the summary is calculated within the statistics software. The right schema shows the more efficient way, where the database calculates the summary.
3.
Talking to Databases
As already stated above, database connectivity is not very widespread among statistical software packages. Those packages which can retrieve data from a database handle the data internally in the same way as data from flat files. But ”getting the data out” is usually not enough. Efficiency problems will persist as long as the software accesses data of order O(n) growing with the size of the data set. When plotting barcharts, mosaic plots or histograms, only a summary of the data is needed. But this summary can easily be generated inside the database. Thus only the result of a breakdown has to be retrieved from the database. ”Get out the data you need”, is obviously much more efficient, since the result of a query is now of constant size, no matter how big the data set is. These two ways of communication with databases are depicted in Figure 2. The left example shows the inefficient way, where the database is only used as a data storage, much data has to be transferred, and the statistical software has to calculate the summary. The right hand example shows the more balanced model, where only data chunks of constant size are transferred and the database calculates the summary. Some projects go even further in implementing the analysis as much inside the database as possible. Whereas the idea illustrated above only assumes
the use of standard SQL-commands to generate statistical summaries, Lang shows how to [3] implement parts of the statistical analysis software R inside the Postgres database system. R functions can then be used as 1st class SQL-functions. Although this is a much tighter coupling of analysis software and database, this solution is not very general. It only connects one specific statistical package to one specific database. Interfaces like ODBC/JDBC are not really fast. Nevertheless, these interfaces are universal and handle the connection to a variety of data sources. But this lack of speed is not a disadvantage as long as we only access summaries instead of complete variables. In contrast to flat files, which usually are private to the analyst, databases can be accessed by many people and processes having access to the database. Thus data in the database may change, and consistency problems will arise. E.g. plotting a barchart of gender may result in 988 cases in the bar for male customers. When meanwhile a customer representative entered 3 new male customers, selecting all male customers in the barchart will result in a bar of height 991 cases, which overshoots the initially drawn bar for all males, which only included 988 cases. There is no obvious solution to this problem, but
3455
Joint Statistical Meetings - Section on Statistical Graphics
Figure 3: The left plot shows a conventional scatterplot of 100,000 simulated standard normal vs. uniform [0, 1] data points. The right plot shows the same data binned on a 256 × 256 grid. several suggestions. One suggestion is to give every plot a time stamp, to indicate that it is temporally volatile. But linked highlighting will show up inconsistencies as illustrated in the example above. Another way would be to implement a trigger in the database, which flags changes in the data, which should result in an update of the dependent plots. The simplest suggestion is to recalculate each plot, whenever the highlighting changes, to avoid inconsistencies inside a single plot, although inconsistencies between different plots can not be avoided with this simple solution.
4.
Making Graphs work on categorical data
All graphs which only need a summary of the data to be plotted work well on databases. Barcharts and mosaic plots only need a breakdown of categorical variables. Histograms, which plot continuous data, also only need a breakdown of the data on the different bins of the histogram. Although box plots are based on just a few parameters, these parameters are not easily obtained via standard SQL-functions. Additionally the number of outliers in a box plot grows with the number of cases and thus may cause volume problems in the
database interface. Even more problematic are plots, which plot a single glyph for each observation. For these plots, all data points need to be retrieved from the database. Obviously this would be very inefficient. One solution to escape from this limitation for scatter plots, is to introduce binned plots. In a binned plot not the raw data is plotted, but discretized data distributed over a 2-dimensional grid. A binned plot can be seen as a 2-dimensional histogram. To visualize the number of cases in a bin, the gray shade of the bin is used. The more cases, the darker the bin is plotted. The finer the grid, the more closely we approximate the original data. Figure 3 shows a raw scatterplot of 100,000 simulated uniform [0, 1] vs. standard normal points. The right hand of Figure 3 shows the corresponding binned plot on a 256 × 256 grid. Whereas the number of points to retrieve and plot in the left plot grows with the number of cases, the number of bins to retrieve and plot in the right plot is a constant, only depending on how fine the grid is chosen. Although binned plots are a good approximation of scatterplots for large amounts of data, we cannot find a solution for all plots. Parallel coordinate
3456
Joint Statistical Meetings - Section on Statistical Graphics
plots need every value of every variable in order to be plotted. Additionally parallel coordinate plots disintegrate to a black band of lines, when more than 10,000 cases are plotted. One solution is then to only plot subgroups of the data, if specific plots can’t hold all the data. This leads to the idea of the two level data access.
5.
Two Level Data Access
Working on summaries of the data is often to much of a restriction. Whenever it is necessary to identify a specific observation, summaries can’t help. Two level data access borrows the concept of ”hot selection sets”, as found in DataDesk Version 6.1 [7], and works as follows: As long as the user works with the complete data set, only plots are available, which can work on summaries of the data, i.e. barcharts, mosaic plots, histograms and binned plots. All other plots are only available for the currently selected subset. If the number of points currently selected is small enough, i.e. below a certain threshold which depends on the plot, the graph can be plotted for this subset. This procedure raises the question, if it should be permissible to make selections in the second level plots, which would introduce a second set of selected points. Figure 4 shows an example of two level data access. The complete data set is a sub sample of the 1990 census data, which can be downloaded at http://www.ipums.umn.edu/. The three upper plots select all cases from California, being older than 65 years and having more than 2 children. This selection criteria is met by 304 cases out of the 2,500,052 cases in the complete sample. For these 304 cases a scatterplot can easily be plotted and interpreted. The selection condition is automatically composed by the software and reads as follows: ... WHERE (age >= 65 AND age < 75 OR age >= 75 AND age < 85 OR age >= 85 AND age < 95) AND (stateicp = ’California’) AND (nchild = ’3’ OR nchild = ’4’ OR nchild = ’5’ OR nchild = ’6’ OR nchild = ’7’ OR nchild = ’8’ OR nchild = ’9’) Obviously this query can be optimzed to: ... WHERE age >= 65 AND stateicp = "California" AND nchild > 2
Figure 4: 2level data-access from 2.5mio down to 304 Optimization is much easier when performed for each kind of plot separately. The main goal of optimization is an increase of query speed. E.g. the optimization above already yields a speed increase of factor 2.
6.
Implementation in Mondrian
Most of the concepts discussed in this paper are implemented in the software Mondrian, see http://stats.math.uni-augsburg.de/Mondrian. There are two key points, which are essential for the implementation of an advanced use of database connections in an interactive statistical graphics tool. The first important point is the way data are selected. If the selection mechanism relies on a casewise selection, i.e. a selection is converted into a set of case-ids, this mechanism can not be translated into a database query easily. Mondrian keeps a parametric representation of the user selection of
3457
Joint Statistical Meetings - Section on Statistical Graphics
data. This parametric description can easily be converted into a WHERE-clause of an SQL-statement, as presented in the last section. A more detailed description of the advanced selection mechanism can be found in Hofmann and Theus [2] , [5] and Theus [6]. A second challenge in using databases as the data-source for interactive statistical graphics is the way data is accessed. Obviously the software should be able to work on both kinds of datasources: flat files and databases. The usual way to handle various data sources is to have abstract data classes. Mondrian supports a hierarchy of three different data classes. A table is accumulated data of a dataset, and a dataset is a collection of variables.
quite well with data in databases. Speed issues are no longer a problem of the front-end software, but depend solely on the database engine. The key issue is not to work on a copy of the whole data from the database, but to only get specific summaries of constant size out of the database. Plots which need more than just summaries to be plotted, can often be modified in order to make them work on database queries delivering results of constant size. For plots which cannot be modified the two level data access still allows analysis of individual cases.
References [1] D. Freedman, R. Pisani, and Purves R. Statistics. Norton & Co., 1997.
Table
[2] H. Hofmann. Selection sequences in MANET. Computational Statistics, 13(1):77–87, 1998. [3] Duncan Temple Lang. Embedding S in other languages and environments. In K. Hornik and F. Leisch, editors, DSC 2001 Proceedings of the 2nd International Workshop on Distributed Statistical Computing, 2001.
Data Set
Var 1
[4] Daryl Pregibon. 2001: A statistical odyssey. KDD Conference ’99, 1999.
…
[5] M. Theus, H. Hofmann, and Wilhelm A. Selection sequences — interactive analysis of massive data sets. In Proceedings of the 29th Symposium on the Interface: Computing Science and Statistics, 1998.
DB Figure 5: The data abstraction layer in Mondrian When accessing data inside Mondrian, the different functions access data only through the table or dataset class. These classes then automatically select the right data source in the most efficient way. Figure 5 sketches the data abstraction layer in Mondrian.
7.
[6] Martin Theus. Interactive Data Visualization using Mondrian. Journal of Statistical Software, submitted. [7] Paul F. Velleman. DataDesk Version 6.0 — Statistics Guide. Data Description Inc., Ithaka, NY, 1997.
Conclusions
Many statistical graphics like barcharts, mosaic plots or histograms only need a summary of the underlying variables, which can easily be calculated inside the database. Although the usually perceived speed of a database and the speed needed for interactive statistical graphics could seem to be incompatible, interactive statistical graphics works
3458