Development of a Synthetic Data Set Generator for Building and ...

2 downloads 0 Views 205KB Size Report
Development of a Synthetic Data Set Generator for Building and Testing. Information Discovery Systems. Pengyue J. Lin. University of California. Riverside, CA ...
Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems Pengyue J. Lin University of California Riverside, CA 92521

Behrokh Samadi Alan Cipolone Lucent Technologies Bell Laboratories Holmdel, NJ 07733

Daniel R. Jeske Sean Cox, Carlos Rendón, Douglas Holt, Rui Xiao University of California, Riverside CA 92521

Abstract Data mining research has yielded many significant and useful results such as discovering consumer-spending habits, detecting credit card fraud, and identifying anomalous social behavior. Information Discovery and Analysis Systems (IDAS) extract information from multiple sources of data and use data mining methodologies to identify potential significant events and relationships. This research designed and developed a tool called IDAS Data and Scenario Generator (IDSG) to facilitate the creation, testing and training of IDAS. IDSG focuses on building a synthetic data generation engine powerful and flexible enough to generate synthetic data based on complex semantic graphs.

Keywords Client-Server, Data Generation, Data Mining, Java, Semantic Graph.

1. Introduction Information Discovery and Analysis Systems (IDAS) perform an important role in turning large quantities of collected data into meaningful information. IDAS are designed to correlate multiple sources of data and use data mining techniques to find relationships within disparate data sets that could be used to predict events. IDAS extract information from data by finding patterns, threads, and relationships. IDAS have been a major asset to business applications such as fraud prevention [1,6] and are in use in the medical field in a wide variety of applications including help in diagnosis [5] and analysis of medical videos [7]. A recent survey by the US General Accounting Office found that 52 Federal agencies are conducting or plan to conduct 199 separate data mining efforts, with 131 of these currently operational [2]. It is believed that IDAS could be equally effective for intelligence applications such as providing leading indicators of terrorist acts.

A critical technical issue with IDAS is their ability to provide accurate inference. Given the diversity of techniques used to develop IDAS, it is desirable to have a baseline approach for testing their ability to make accurate inferences as well as their ability to deal with large input data sets with varying degrees of accuracy. Training IDAS is also a popular research area. An important part of IDAS development is the generation of synthetic datasets for use in testing and training. Developing test and training cases for an IDAS requires background datasets into which hypothetical future scenarios can be overlaid. The IDAS can then be measured in terms of false positive and false negative error rates. Testing and training IDAS can be very challenging due to obstacles in obtaining test datasets that are protected by privacy laws as well as the time and cost associated with collecting a diverse set of data sources. This research intends to resolve this difficulty by designing IDSG for generating synthetic background datasets and overlaying them with test scenarios inserted by users. In [3] an architecture was proposed for an IDAS Data and Scenario Generator (IDSG) that generates synthetic background datasets and provides functions for mixing in user-designed scenarios. This paper reports on the development and implementation of the proposed architecture. The task of developing realistic background data for all possible IDAS is formidable, and is not our goal. Instead, our approach is to develop a tool to synthesize background datasets of sufficient quality to enable IDAS evaluation. IDSG synthesizes background data using all available information about the data, and using it in such a way that additional information can be easily incorporated if and when it becomes available. The rest of this paper is organized as follows. Section 2 describes the methodology used in our data generation.

Proceedings of the Third International Conference on Information Technology: New Generations (ITNG'06) 0-7695-2497-4/06 $20.00 © 2006

IEEE

Section 3 provides an overview of the IDSG architecture. Section 4 summarizes the IDSG features and benchmarks its performance. Section 5 presents the future plans. Finally, in Section 6 we summarize the principal findings and the status of this project.

2. Methodology As described, the objective is to generate sufficiently accurate data to be able to test a data mining tool that is developed to search for specific patterns, events, groups, etc., in the dataset. We define a dataset as a set of one or more files/tables, containing records of data. The records consist of attributes. Realistic or accurate data generation is a hard problem. There are numerous explicit and implicit rules that govern generation of a data element, or an attribute of a record. We categorize these rules as follows: Independent rules – These are rules governing values of the attributes independent from other attributes and other records. An example is the rule specifying the number of hours in a day. Intra-record (horizontal) rules – These exist between the attributes of a record of the dataset. The horizontal rules can relate one or more, or all attributes of a record together. Discovery, representation and manipulation of the multi-dimensional relationships for these rules can be extremely resource intensive. Example of a horizontal rule is one governing the generation of an individual’s income based on education, profession, age and gender of the person. Inter-record (vertical) rules – These are rules that define relationships amongst the various records and t create another dimension of complexity. An example is the total number of records of individuals living in the same town. In any of the above cases, the rules and dependencies are either socially or scientifically known or need to be derived from data. For example, the rules defining the relationship between zip code and home phone numbers are socially available rules. However, the distribution governing the income of a person as a function of his/her age, gender, education, work location, etc. is not available without extensive studies conducted by agencies interested in obtaining these rules. In this paper we discuss the problem of representation and manipulation of the rules and relationship and how it was done within IDSG. In order to do this, we define the structure used to represent the attributes and the rules

governing their values as well as the relationships they have with other attributes.

2.1 Data Representation and Generation Knowledge about object attributes and their relationships can be represented though a semantic graph, with vertices representing the attributes and links representing the fact that two attributes are associated. In Figure 1 we show a partial semantic graph for the attributes of a credit card transaction dataset. The independent attributes with no incoming links are shown in thicker border. Values of other attributes are determined by the values of the preceding ones on the graph. For example, the income of a person depends on the gender, age and occupation of that person. The idea is to provide as much connectivity in this graph as possible to obtain meaningful relationships. Furthermore, the tool should allow for easy addition of new relationships as they are discovered. Region

IEEE

Age Credit Card Limit

Name

Educatio n

Income

Occupation Purchase Amount Purchased Item

Figure 1– Partial semantic graph for credit card application The vertices of the graph represent the attributes and thus include the generation functions for that attribute. The generation function takes the attribute values of the incoming links as input. The generation function may be one or more distribution functions over a set of discrete or continuous values. For incorporation of vertical rules, often we need to define attributes/vertices which are not necessarily part of the dataset. The task of generating a record now becomes a simple traversal of the graph along the directed edges starting from the independent attributes. Typically, the values of the independent attributes are determined by user input or use of a distribution function. For example, in Figure 1, the value for the attribute Gender is derived from a default distribution that is locally or publicly known as the gender distribution. Alternatively, the user may input the distribution parameters for the gender.

Proceedings of the Third International Conference on Information Technology: New Generations (ITNG'06) 0-7695-2497-4/06 $20.00 © 2006

Gender

2.2 Semantic Graph Construction To support the needs from different IDAS, we define an Application as an IDSG choice that can generate a specific type of dataset. For example, a user can select the Credit Card Application to generate a data set that contains people, credit card, and/or credit card transaction records. Each Application has its own semantic graph to determine the relationships among the attributes and the order of data generation. It is clear that this order, or semantic graph, is a function of the dependencies of the different data generators. For example, Name depends upon Gender, which implies that gender should be generated first. For representation and manipulation of the semantic graph we used XML. This XML “Schema File” groups data to be generated as properties of objects. For example, a person has a gender, a name, and an address. A car has a make, a model, and a year. An address has a city, a state, and a zip code. The Schema File then defines a person with its properties, a car with its properties, and an address with its properties. The each property has a name and a generator associated with it. The generators are given parameters that might include other properties. For example, the generator that generates a name is given the gender property as a parameter. The generation engine generates the values for various properties starting from independent properties in the graph. Then, it will generate values for dependent properties for which the preceding properties are are already generated. This process continues until values of all properties are generated. .It isclear that the generator could be streamlined easily by going through this process only once, by recording what order properties are generated in, and then assuming that all dependencies will be met by generating data in that order. The Reference Implementation of the Java API processes the semantic graph for XML Processing. This processed document structure is then converted into a set of Classes that make the generation process described above very simple. Both the client and the server use the same method for interpreting the Schema File. So the Schema File is simple to use for communication between the client and server, and as the foundation for future work allowing the client to be able to dynamically create schemas and generate highly customized data sets beyond what is already possible.

A partial example of the Credit Card Application XML Schema File showing a person with the properties Name and Gender, illustrating their dependency follows: