Summarizing Relational Data Using Semi ... - Semantic Scholar

1 downloads 0 Views 240KB Size Report
becomes a tedious trial-and-error work and the classification result is often not very promising especially ... relational database, a single record, Ri, stored in the.
Journal of Computer Science 6 (7): 775-784, 2010 ISSN 1549-3636 © 2010 Science Publications

Summarizing Relational Data Using Semi-Supervised Genetic Algorithm-Based Clustering Techniques Rayner Alfred School of Engineering and Information Technology, University Malaysia Sabah, Locked Bag 2073, 88999, Kota Kinabalu, Sabah, Malaysia Abstract: Problem statement: In solving a classification problem in relational data mining, traditional methods, for example, the C4.5 and its variants, usually require data transformations from datasets stored in multiple tables into a single table. Unfortunately, we may loss some information when we join tables with a high degree of one-to-many association. Therefore, data transformation becomes a tedious trial-and-error work and the classification result is often not very promising especially when the number of tables and the degree of one-to-many association are large. Approach: We proposed a genetic semi-supervised clustering technique as a means of aggregating data stored in multiple tables to facilitate the task of solving a classification problem in relational database. This algorithm is suitable for classification of datasets with a high degree of one-to-many associations. It can be used in two ways. One is user-controlled clustering, where the user may control the result of clustering by varying the compactness of the spherical cluster. The other is automatic clustering, where a non-overlap clustering strategy is applied. In this study, we use the latter method to dynamically cluster multiple instances, as a means of aggregating them and illustrate the effectiveness of this method using the semi-supervised genetic algorithm-based clustering technique. Results: It was shown in the experimental results that using the reciprocal of Davies-Bouldin Index for cluster dispersion and the reciprocal of Gini Index for cluster purity, as the fitness function in the Genetic Algorithm (GA), finds solutions with much greater accuracy. The results obtained in this study showed that automatic clustering (seeding), by optimizing the cluster dispersion or cluster purity alone using GA, provides one with good results compared to the traditional k-means clustering. However, the best result can be achieved by optimizing the combination values of both the cluster dispersion and the cluster purity, by putting more weight on the cluster purity measurement. Conclusion: This study showed that semi-supervised genetic algorithm-based clustering techniques can be applied to summarize relational data with more effectively and efficiently. Key words: Data aggregation, clustering, semi-supervised clustering, genetic algorithm, relational data mining, data pre-processing may lose some information when the join operation is performed. In a relational database, a record stored in the target table is often associated with one or more records stored in another non-target table. We can treat these multiple instances of a record, stored in a non-target table, as a bag of terms. There are a few ways of transforming these multiple instances into bag of terms. Once we have transformed the data representation applicable to clustering operations (Gautam and Chaudhuri, 2004; Basu et al., 2002), we can use any clustering techniques to aggregate these multiple instances. The most common pattern extracted from relational database is association rules. However, to extract classification rules from relational database with more effectively and efficiently, taking into

INTRODUCTION Relational databases require effective and efficient ways to extract patterns from contents stored in multiple tables. In this process, significant features must be extracted from datasets stored in multiple tables with one-to-many relationships. In a relational database, a record stored in a target table can be associated with one or more records stored in another table due to the one-to-many association constraint. Traditional data mining tools require data in relational databases to be transformed into attribute-value format by joining multiple tables. However, with the large volume of relational data with a high degree of one-tomany associations, this process is not efficient as the joined table can be too large to be processed and we 775

J. Computer Sci., 6 (7): 775-784, 2010 consideration of multiple-instance problem, we need to aggregate these multiple instances. In this study, we use a genetic algorithm based clustering technique to aggregate multiple instances of a single record in relational database as a means of data reduction. Before a clustering technique can be applied, we transform the data to a suitable form.

In other words, a particular record stored in the target table that is related to several records stored in the non-target table can be represented as a bag of patterns, i.e., by the patterns it contains and their frequency, regardless of their order. The bag of patterns is defined as follows. Definition: In a bag of patterns representation, each target record stored in the non-target table, NT, is represented by the set of its pattern and the pattern frequencies. This definition follows the notion of an defined by Lachiche and Flach (2000), where the data is described as a collection of individuals and the induced rules generalize over the individuals, mapping them to a class. For instance, individual-centered domains include classification problems in molecular biology where the individuals are molecules. In our approach, an individual is represented as a bag of patterns. We use DARA algorithm (Rayner, 2008; Davies and Bouldin, 1979) to summarize data stored in non-target tables that have many-to-one relationships with data stored in the target table. In the DARA algorithm, these patterns are encoded into binary numbers. The process of encoding these patterns into binary numbers depends on the number of attributes that exist in the non-target table. For example, there are two different cases when encoding patterns for the data stored in the non-target table. In the first case (Case I), a non-target table may have a single attribute. In this case, the DARA algorithm transforms the representation of the data stored in a relational database without constructing any new feature to build the (n×p) TF-IDF (Salton and Michael, 1984) weighted frequency matrix, as only one attribute exists in the non-target table.

Data transformation for relational data: In a relational database, a single record, Ri, stored in the target table can be associated with other records stored in the non-target table, as shown in Fig. 1. Let R denote a set of m records stored in the target table and let S denote a set of n records (T1, T2, T3,...,Tn), stored in the non-target table. Let Si be a subset of S, Si ⊆ S, associated through a foreign key with a single record Ra stored in the target table, where Ra∈R. Thus, the association of these records can be described as Ra←Si. In this case, we have a single record stored in the target table, T, that is associated with multiple records stored in the non-target table, NT. The target and non-target tables are defined as follows. Definition: Target table, T, is a table that consists of rows of object where each row represents a single unique object and this is the table in which patterns are extracted. Definition: A non-target table, NT, is a table that consists of rows of objects where a subset of these rows can be linked to a single object stored in the target table. The records stored in the non-target table that correspond to a particular record stored in the target table can be represented as vectors of patterns. As a result, based on the vector space model (Salton and Michael, 1984), a unique record stored in non-target table can be represented as a vector of patterns.

Case I: Table with a single attribute: In this case, it is assumed that there is exactly one attribute describing the contents of the non-target table that is associated with the target table. In Fig. 2, the Trans attribute is the Primary Key (PK) of the Sales table and the Customer attribute is the Foreign Key (FK) of the table that associates records stored in this nontarget table (Sales Table) with records stored in the target table (consists of individual customer). First, the algorithm computes the cardinality of the attribute domain in the non-target table. Cardinality of an attribute is defined as the number of unique values that the attribute can take. If the data consists of continuous values, the data is discretized first and the number of bins taken as the cardinality of the attribute domain.

Fig. 1: A one-to-many association between target and non-target relations 776

J. Computer Sci., 6 (7): 775-784, 2010 Case II: Table with multiple attributes: In this case, it is assumed that there is more than one attribute that describes the contents of the non-target table associated with the target table. All continuous values of the attributes are discretised and the number of bins is taken as the cardinality of the attribute domain. After encoding the patterns as binary numbers, the algorithm determines a subset of the attributes to be used to construct a new feature. Here is an example of a simple algorithm to construct features without using feature scoring to generate the patterns that represent the input for the DARA algorithm. Alfred has discussed in detail about the process of data summarization with a genetic-based feature construction algorithm using feature scoring (Rayner, 2008). For each record stored in the non-target table, concatenate p number of columns’ values, where p is less than or equal to the total number of attributes. For example, let F = (F1, F2, F3,...,Fk) denote k field columns or attributes in the non-target table. Let dom(Fi) = (Fi,1, Fi,2, Fi,3, ..., Fi,n) denote the domain of attribute Fi, with n different values. So, one may have an instance of a record stored in the non-target table with these values (F1,a, F2,b, F3,c, F4,d, ..., Fk−1,b, Fk,n), where F1,a∈dom(F1), F2,b∈dom(F2), F3,c∈dom(F3), F4,d∈dom(F4), ..., Fk−1,b∈dom(Fk−1), Fk,n∈dom(Fk). Table 1 shows the list of patterns produced with different values of p. It is not natural to have features concatenated like F1,aF2,b but not F1,aF3,c, when we have p = 2, since the attributes do not have a natural order. However, the GA approach (Davies and Bouldin, 1979) can be applied to solve this problem. For each record, a bag of patterns is maintained to keep track of the patterns encountered and their frequencies. For each new pattern encoded, if the pattern exists in the bag, the counter for the corresponding pattern is increased, else the pattern is added to the bag and set the counter for this particular pattern to 1. The resulting bag of patterns can be used to describe the characteristics of a record associated with them. For instance, Fig. 3 shows the data transformation for data stored in non-target table with multiple attributes. In this example, the Trans attribute is the Primary Key (PK) of the Sales table and the Customer attribute is the Foreign Key (FK) of the table that associates records stored in this non-target table (Sales table) with records stored in the target table (consists of individual customer). Based on this example, the format of patterns produced depends on the parameter p (p = 1, p = 2 and p = k), where p is the number of attributes combined to generate these patterns and k is the total number of attributes. The algorithm is called PSingle when p = 1 and PAll when p = k respectively.

Next, in order to encode the values into binary numbers, the algorithm finds the appropriate number of bits, n, such that it can represent all different values of the attribute’s domain, where 2n1 n

Suggest Documents