Generating Synthetic Data to Match Data Mining ... - Semantic Scholar

22 downloads 0 Views 225KB Size Report
Match Data Mining Patterns. Josh Eno and Craig W. Thompson • University of Arkansas. Synthetic data sets can be useful in a variety of situations, including ...
Architectural Perspectives

Generating Synthetic Data to Match Data Mining Patterns Josh Eno and Craig W. Thompson • University of Arkansas

S

ynthetic data sets can be useful in a variety of situations, including repeatable regression testing and providing realistic — but not real — data to third parties for testing new software.1,2 Researchers, engineers, and software developers can test against a safe data set without affecting or even accessing the original data, insulating them from privacy and security concerns as well as letting them generate larger data sets than would be available using only real data. For different purposes, it’s desirable for synthetic data sets to exhibit more or less realism, reflecting selected properties of the original data sets. Synthetic data generators or simulations can achieve simple forms of realism using domain sampling within a field or preserving cardinality relationships among fields. Techniques for adding degrees of realism are explored in other work.3 But what about preserving hidden complex patterns within a data set? Using a mapping to transform a real data set into a synthetic one might (or might not) preserve such patterns, but at the risk that, if the transformation can be discovered, the original data set can be recovered. Model-driven generators can only preserve the patterns they encode — and, so far, using today’s synthetic data generators, these models have been limited to simpler data distributions. Practitioners use data mining technology to discover patterns in real data sets that aren’t apparent at the outset. If it’s desirable for a synthetic data set to realistically exhibit these kinds of hidden patterns, how can we accomplish this? This column explores how to combine information derived from data mining applications with the descriptive ability of synthetic data

78

Published by the IEEE Computer Society

generation software. Our goal is to demonstrate that at least some data mining techniques (in particular, a decision tree) can discover patterns that we can then use to inverse map into synthetic data sets. These synthetic data sets can be of any size and will faithfully exhibit the same (decision tree) patterns. Our work builds on two technologies: Synthetic Data Definition Language and Predictive Model Markup Language.

Synthetic Data Definition Language

SDDL, a language developed at the University of Arkansas,2,3 is an XML format that describes data generated by synthetic data generation software. While SDDL doesn’t specify the generator, our work uses the Parallel Synthetic Data Generator (PSDG) to generate SDDL-specified data in grid computing environments.3 An SDDL file consists of a database element followed by zero or more pool elements and one or more table elements. Pools provide a flexible way to specify values and probabilities for those values. For instance, a pool of possible colors at a traffic light might look like the following: 12 2 6

Weights are automatically normalized and converted to probabilities by the data generator. In

1089-7801/08/$25.00 © 2008 IEEE

IEEE INTERNET COMPUTING

Generating Synthetic Data

this example, the light would be red, yellow, or green with probability .6, .1., and .3, respectively, reflecting that a stoplight is red more often than it is green because, in this case, it faces a minor street crossing a major one. Choices can have other attributes beyond weight, which are referenced using an object-like notation. If the stoplight example had a position attribute with values of “top,” “middle,” and “bottom” for red, yellow, and green, respectively, the formula colors[red].position would resolve to the value “top.” Choices can also have pools nested within them, allowing references like states[CA]. cities to resolve to one of several cities, each of which might in turn have its own attributes and sub-pools. Table elements contain one or more field elements and zero or more variable elements. The only difference between field and variable elements is that variable elements won’t result in a generated value in the output. In all other respects, they’re the same, so that anything that holds for fields also holds for variables. Each field element contains sub-elements that specify how a value is generated for the field. This can take the form of minimum and maximum values, formulas that specify operations on other fields or variables, pool references, tiered distributions, queries from a database, or iterations similar to “for” loops. Additionally, a field may reference fields generated earlier in the table, so that a random pool selection can be used in pool references later so all of the fields draw from the same pool choice, rather than randomly selecting for each field. One goal of synthetic data generation software is to simulate identified patterns. The PSDG natively supports only uniform distributions but has a flexible plug-in mechanism that we have used to implement Gaussian univariate distributions. In addition to SDDL/PSDG, there are several commercial and experimenMAY/JUNE 2008

tal synthetic data generators that try to model the underlying system as well as simulating various distributions.4,5 These use semantic graphs, database schemas, or text representations to define fields and relationships. Most can generate various statistical distributions, and some can operate in parallel, but none provides a way to inject a pattern discovered through data mining back into a synthetic data set.

Predictive Model Markup Language

PMML is an open standard developed by the Data Mining Group, an independent, vendor-led group (www. dmg.org). Its purpose is to represent and transfer data mining models between software packages. A PMML file consists of a data dictionary followed by one or more mining models. The data dictionary contains information about field types and data ranges and is independent of any specific mining model. Specific information about the distribution of values in a field is stored in mining models, not in the data dictionary. Entries in the data dictionary take the form of data field elements with name, optype, and dataType attributes. Name is a string value denoting the field name in the data set. Optype refers to the type of operations that can be done on the field and may be categorical, ordinal, or continuous. The dataType attribute reuses names and semantics from the W3C XML schema atomic types.6 The mining model schema varies by model. Model types include decision trees, association rules, cluster models, regressions, naïve Bayes, neural networks, rule sets, sequences, text models, and vector machines. A single PMML file can include multiple models. Additionally, a PMML file can specify output from one model as input for a different model, allowing storage of data transformations.

Converting PMML to SDDL Our work is a first step toward determining whether we can use patterns found by data mining models and reverse map them back into synthetic data sets of any size that will exhibit the same patterns. Our initial work focuses on PMML decision tree models. We developed an algorithm (described elsewhere7) that scans a decision tree stored as PMML to create an SDDL file that describes the data to be generated. The top layer provides a user interface and acts as a driver for the lower levels. It takes user options to determine input and output files as well as additional information that can be passed to lower levels for further processing. The middle layer handles the PMML parsing, creating model handlers as necessary. The bottom layer contains objects that implement the ModelHandler interface. This mirrors PMML, which provides several different XML specifications for various model types. In our implementation, a ModelHandler interface lets the PMML parser interact with several different mining model handlers without knowing any specifics of the individual models, similar to the Java SAX (Simple API for XML) parser interface. A TreeModel handler is designed to generate SDDL based on a decision tree classification model. Within the TreeModel element are the mining schema and tree nodes. The mining schema classifies the fields as active, predicted, or supplementary. The algorithm treats the active and predicted fields differently, while supplementary fields are for internal use by the data mining algorithm and thus are ignored. To convert a PMML tree model to SDDL, the algorithm first performs a depth-first search of the decision tree while collecting the information needed. Then it builds a Pool and Table object, which generates the required SDDL. Field constraints 79

Architectural Perspectives

are passed through the paths in the tree down to the leaf nodes, and the fully constrained leaf nodes generate pool nodes. The SDDL table logic selects a node, then constrains its fields based on auxiliary data and sub-pools of that node. Building the SDDL involves creating table and pool objects, then adding them to the database object and calling the database.writeSDGLFile method.

Experiment To evaluate the algorithms and implementation, we used two data sets, a simpler flower classification (the Iris data set, described later in this section) and a more complex heart disease data set.7 Both data sets are publicly available, and the Data Mining Group has analyzed both using SPSS Clementine, a commercial data mining software package. Clementine created a decision tree model for each data set, which was stored as a PMML file. We then used these files to test the software using a four-step process: 1. Our parsing software used the PMML file to create an SDDL file. 2. PSDG generated a large data set based on the SDDL File. 3. We loaded the generated data into a relational database for analysis. 4. We analyzed the data through a series of SQL queries, which determined how many rows were generated for each tree node and whether the records generated for a leaf node have the correct ratio of correct to incorrect classifications. The Iris data set is one of the most often cited data sets in the field of data mining. First published by Ronald Fisher in 1936,8 it consists of 150 measurements of flowers including petal width, petal length, sepal width, and sepal length divided evenly between three species. Because the generating process for the data is the measurements of flowers, it’s not surprising that the four mea80

surements are each normally distributed within each species and aren’t independent. Of the three species, Iris Setosa is linearly separable from the other two based on petal length, but Iris Virginica and Iris Versicolor aren’t linearly separable from each other based on any single attribute. Figure 1 shows the decision tree model generated for the Iris data set.9 The tree is reasonably simple, with all the splits being binary and a depth of only four levels. The tree has five leaf nodes, with 96 percent (144/150) of the records falling into three of the nodes. The conversion algorithm traverses this tree, propagating active field predicate constraints from parent nodes down to leaf nodes. Because of the surrogate predicates used, some child predicates contradict constraints imposed by parent nodes. Child node predicates are given priority in such cases. Each node includes five items of interest for this application: score, record count, child nodes, score distribution, and predicate. An example node from the Iris decision tree PMML (from the middle of the tree with some extraneous tags removed) is illustrative of the various parts: www.computer.org/internet/

… …

The SDDL file generated for this decision tree is similarly simple. The node pool contains five choices, one for each leaf node. The weights are the record counts for each respective leaf node. For ease of reading the pool, the choice names are the node identifiers in the decision tree, rather than simply sequential integers or some other scheme. Each choice contains eight auxiliary fields, specifying the minimum and maximum for each numeric field. The default minimum for each field is zero, and the default maximum is 10. In most nodes, any given field will have the default value for either the minimum or maximum. Each choice also includes a sub-pool, which specifies the weights for each species. The table itself consists of one variable field, four real fields, and one string field. The variable field chooses which node will be used to set minimum and maximum values for the real fields as well as specifying the pool weights for the species pool. After that, the constraints for the fields are easily expressed in terms of pool references and the node ID variable. The length of 150,000 records is chosen to create a large size with easily comparable numbers of rows. To analyze the data, we loaded it into a database and ran a series of queries to determine how well the IEEE INTERNET COMPUTING

Generating Synthetic Data

generated data matched the training data. We used the first set of queries to determine the overall misclassification rate for the new data. In the original data, there was a two percent (3/150) rate of misclassification. Of these, .67 percent (1/150) were Versicolor misclassified as Virginica, while 1.33 percent (2/150) were Virginica misclassified as Versicolor. With the generated synthetic data set, which is a thousand times as large as the original data set, we would expect around 3,000 misclassifications, with 1,000 Versicolor classified as Virginica, and 2,000 being the opposite. In fact, 1,019 Versicolor rows are misclassified, while 1,983 Virginica are misclassified, for a total of 3,002 misclassifications. This is well within the expected random variation and results in misclassification rates of .68 percent, 1.32 percent, and 2.00 percent for Versicolor, Virginica, and total misclassifications, respectively. The second test involved running a series of queries to determine the record counts at each node in the decision tree. This provides a much more granular view of where the misclassifications occur, as well as giving a large set of probabilities to examine for any irregularities that might exist in the data. Tables 1 and 2 summarize the results of this analysis. Table 1 includes the raw record counts for each node. From a quick examination, it’s clear that the counts in the generated data are all close to 1,000 times the original data record counts. Table 2 lists the record counts divided by total records to give the probabilities. Although the rates from the training data to the generated data do vary slightly, none of the differences is greater than 0.001, indicating the synthetic data set successfully preserves the tree classification pattern of the original data set.

O

ur work shows the viability of using data mining models and

MAY/JUNE 2008

ID: RC: Vi: Ve: S: ID: 1 Score: S RC: 50 Vi: 0 Ve: 0 S: 50 Predicate: PL: ≤ 2.45 PW: ≤ .8 SL: ≤ 5.45 SW: > 3.35

ID: 4 Score: Ve RC: 48 Vi: 1 Ve: 47 S: 0 Predicate: PL: ≤ 4.95 SL: ≤ 7.1 True

0 150 50 50 50

ID: 3 Score: Ve RC: 54 Vi: 5 Ve: 49 S: 0 Predicate: PL: ≤ 1.75 PW: ≤ 4.75 SL: ≤ 6.15 SW: ≤ 2.95

ID: 6 Score: Vi RC: 3 Vi: 3 Ve: 0 S: 0 Predicate: PW: ≤ 1.55 SW: ≤ 2.65 SL: ≤ 6.5 PL: ≤ 5.7 False

ID: 2 Score: Ve RC: 100 Vi: 50 Ve: 50 S: 0 Predicate: PL: > 2.45 PW: > .8 SL: > 5.45 SW: ≤ 3.35

ID: 5 Score: Vi RC: 6 Vi: 4 Ve: 2 S: 0 Predicate: PL: > 4.95 SL: > 7.1 False

S: Vi: Ve: RC: PL: PW: SL: SW:

Iris-Setosa Iris-Virginica Iris-Versicolor Record count Petal length Petal width Sepal length Sepal width

ID: 8 Score: Vi RC: 46 Vi: 45 Ve: 1 S: 0 Predicate: PL: > 1.75 PW: > 4.75 SL: > 6.15 SW: > 2.95

ID: 7 Score: Ve RC: 3 Vi: 1 Ve: 2 S: 0 Predicate: PW: > 1.55 SW: > 2.65 SL: > 6.5 PL: > 5.7 False

Figure 1. The Iris decision tree model, showing decision predicates and record counts.9 inverse mappings to inject realistic patterns into synthetic data sets. In a broader sense, by looking at data analysis techniques as data specification tools, researchers can find a wide range of tools to describe and generate data. System simulations have relied on simulated statistical distributions to generate values before, but this research provides a technique that could work when data that doesn’t conform

to simple univariate or multivariate distributions is needed. More work is needed. So far, we have only addressed tree models and their simpler form in rule set models. This leaves inverse mappings for nine classifications in the PMML standard as future research. Some, such as regression models, should be relatively easy to address, but models such as neural networks could prove more 81

Architectural Perspectives Table 1. Iris node record counts. Training Data

Generated Data

Node ID

Total

Setosa

Virginica

Versicolor

0

150

50

50

50

Total

Setosa

Virginica

Versicolor

150,000

50,015

50,091

49,894

1

50

50

0

2

100

0

50

0

50,015

50,015

0

0

50

99,985

0

50,091

49,894

3

54

0

5

4

48

0

1

49

53,905

0

5,030

48,875

47

47,890

0

993

46,897

5

6

0

4

6

3

0

3

2

6,015

0

4,037

1,978

0

3,047

0

3,047

0

7

3

0

1

8

46

0

45

2

2,968

0

990

1,978

1

46,080

0

45,061

1,019

Table 2. Iris node probabilities. Training Data

Generated Data

Node ID

Total

Setosa

Virginica

Versicolor

Total

Setosa

Virginica

Versicolor

0

1.000

0.333

0.333

0.333

1.000

0.333

0.334

0.333

1

0.333

0.333

0.000

0.000

0.333

0.333

0.000

0.000

2

0.667

0.000

0.333

0.333

0.667

0.000

0.334

0.333

3

0.360

0.000

0.033

0.327

0.359

0.000

0.034

0.326

4

0.320

0.000

0.007

0.313

0.319

0.000

0.007

0.313

5

0.040

0.000

0.027

0.013

0.040

0.000

0.027

0.013

6

0.020

0.000

0.020

0.000

0.020

0.000

0.020

0.000

7

0.020

0.000

0.007

0.013

0.020

0.000

0.007

0.013

8

0.307

0.000

0.300

0.007

0.307

0.000

0.300

0.007A

difficult. On the theoretical side, a real-world data set might exhibit multiple data patterns. Inverting the mappings of several different data mining algorithms simultaneously is a complex constraint satisfaction problem that remains a challenge.

5.

References 1. J. White, “American Data Set Generation Program: Creation, Applications, and Significance,” master’s thesis, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2005. 2. J. Hoag and C. Thompson, “A Parallel General-purpose Synthetic Data Generator,” ACM SIGMOD Record, vol. 36, no. 1,  2007, pp. 19–24. 3. J. Hoag, “Synthetic Data Generation,” doctoral dissertation, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2008.  4. P. Lin et al., “Development of a Synthetic 82

6.

7.

8.

9.

Data Set Generator for Building and Testing Information Discovery Systems,” Proc. 3rd Int’l Conf. Information Technology: New Generations, IEEE CS Press, 2006, pp. 707–712. K. Houkjaer, K. Torp, and R. Wind, “Simple and Realistic Data Generation,” Proc. 32nd Very Large Databases (VLDB 06), 2006, VLDB Endowment, pp. 1243–1246. XML Schema 1.1 Part 2: Datatypes, World Wide Web Consortium (W3C), Feb. 2006, www.w3.org/TR/xmlschema11-2/#built -in-primitive-datatypes. J. Eno, “Generation of Synthetic Data to Conform to Constraints Derived from Data Mining Applications,” master’s thesis, Computer Science and Computer Engineering Dept., Univ. of Arkansas, 2007. R.A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annual Eugenics, vol. 7, part II, 1936, pp. 179–188. Integral Solutions Limited, “Clementine www.computer.org/internet/

10.0, PMML 3.0 Heart Decision Tree,” 2007, w w w.d mg.or g /pm m l _ e xa mple s/ I R I S _TREE.xml. Josh Eno is a PhD candidate in computer science at the University of Arkansas. His interests include data mining, social networks, workflow and healthcare supply chain logistics. Eno received his MS in computer science from the University of Arkansas. Contact him at [email protected]. Craig W. Thompson is a professor and the Charles Morgan Chair in Database in the computer science and computer engineering department at the University of Arkansas. His research interests include software architecture, middleware, data engineering, and agent technologies. Thompson has a PhD in computer science from the University of Texas at Austin. He is an IEEE fellow. Contact him at [email protected]. IEEE INTERNET COMPUTING