similarity measures for multi-valued attributes for database ... - CiteSeerX

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING TAE-WAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas 77204-3475 {twryu, ceick}@cs.uh.edu ABSTRACT: This paper introduces an approach to cope with the representational inappropriateness of traditional flat file format for data sets from databases, specifically in database clustering. After analyzing the problems of the traditional flat file format to represent related information, a better representation scheme called extended data set that allows attributes of an object to have multi-values is introduced, and it is demonstrated how this representation scheme can represent structural information in databases for clustering. A unified similarity measure framework for mixed types of multi-valued and single-valued attributes is proposed. A query discovery system, MASSON that takes each cluster is used to discover a set of queries that represent discriminant characteristic knowledge for each cluster. INTRODUCTION Many data analysis and data mining tools, such as clustering tools, inductive learning tools, statistical analysis tools, assume that data sets to be analyzed are represented as a single flat file (or table) in which an object is characterized by attributes that have a single value. Person ssn name 111111111 Johny 222222222 Andy 333333333 Post 444444444 Jenny

Purchase age sex 43 M 21 F 67 M 35 F

ssn 111111111 111111111 111111111 222222222 222222222 333333333

Joined result location ptype amount date Warehouse 1 400 02-10-96 Grocery 2 70 05-14-96 Mall 3 200 12-24-96 Mall 2 300 12-23-96 Grocery 3 100 06-22-96 Mall 1 30 11-05-96

(a) ptype (payment type): 1 for cash, 2 for credit, and 3 for check

name age Johny 43 Johny 43 Johny 43 Andy 21 Andy 21 Post 67 Jenny 35

sex M M M F F M F

ptype amount location 1 400 Mall 2 70 Grocery 3 200 Warehouse 2 300 Mall 3 100 Grocery 1 30 Mall null null null

(b)

Figure 1.1: (a) an example of Personal relational database, the cardinality ratio between Person and Purchase is 1:n (b) a joined table from Person and Purchase Recently, many of these data analysis approaches are being applied to data sets that have been extracted from databases. However, a database may consist of several related data sets (e.g., relations in relational model1) and the cardinality ratio of relationships between data sets in such a database is frequently 1:n or n:m, which may cause significant problems when data that have been extracted from a database have to be converted into a flat file in order to apply the above mentioned tools. Flat file format is not appropriate for representing related information that is commonly found in 1

In this paper, we specifically focus on data sets from relational databases, although our approach can be easily extended to the object-oriented database model.

databases. For example, suppose we have a relational database as depicted in Figure 1.1: (a) that consists of Person and Purchase relations that store information about a person’s purchases, and we want to categorize persons that occur in the database into several groups that have similar characteristics. It is obvious that the attributes found in the Person relation alone are not sufficient to achieve this goal, because many important characteristics of persons are found in other “related” relations such as the Purchase relation that stores the shopping history for persons. This raises the question how the two tables can be combined into a single flat file so that traditional clustering and/or machine learning algorithms can be applied to it. Although some systems (Thompson91, Ribeiro95) attempt to discover knowledge directly from structured domains, it seems that the most straight forward approach for generating a single flat file is to join related tables (Quinlan93). Figure 1.1: (b) depicts the results of the natural join operation for the two relations in Figure 1.1: (a) and (b) using ssn as the join attribute. The object “Andy” in the Person relation is represented using two different tuples in the joined table in Figure 1.1: (b). The main problem with this representation is that many clustering algorithms or machine learning tools would consider each tuple as a different object; that is, they would interpret the above table of 7 person objects rather than 4 unique person objects. This representational discrepancy between a data set from a structured database and a data set in traditional flat file format assumed by many data analysis approaches seems to have been overlooked. This paper proposes a knowledge discovery and data mining framework to deal with this limitation of the traditional flat file representation. We specifically focus on the problems of structured database clustering and discovery of a set of queries that describe the characteristics of objects in each cluster. EXTENDED DATA SETS The example in the previous section motivated that a better representation scheme is needed to represent information for objects that are interrelated with other objects. One simple approach to cope with this problem would be to group all the related objects into a single object by applying aggregate operations (e.g., average) to replace related values by a single value for the object. The problem of this approach is that the user has to make critical decisions (e.g., which aggregate function to use) before hand; moreover, by applying the aggregate function frequently, valuable information is lost (e.g., how many purchases a person has made, or what the maximum purchase of a person was). Tversky (1977) gives more examples that illustrate that data analysis techniques, such as clustering, can benefit significantly considering set and group similarities. Name age sex p.ptype Johny 43 M {1,2,3} Andy 21 F {2,3} Post 67 M 1 Jenny 35 F null

p.amount p.location {400,70,200} {Mall, Grocery, Warehouse} {300,100} {Mall, Grocery} 30 Mall null null

Figure 2.1: A converted table with a bag of values We propose another approach that allows attributes to be multi-valued for an object to cope with this problem. We call this generalization of the flat file format, extended data set. In an extended data set objects are characterized through attributes that do not only have a single value, but rather a bag of values. A bag allows for duplicate elements unlike set but the value elements must be in the same domain. For example, the

following bag {200,200,300,300,100} for the amount attribute might represent five purchases 100, 200, 200, 300 and 300 dollars by a person. Figure 2.1 depicts an extended data set that has been constructed from the two relations in Figure 1.1: (a). In this table, the related attributes that are called structured attributes, p.ptype, p.amount, and p.location now contain path information (e.g., p stands for Purchase relation) for clearer semantics of related attributes and can have a bag of values. Basically, the related object groups in Figure 1.1: (b) are combined into one unique object with a bag of values for related attributes. In Figure 2.1 as well as throughout the paper, we use curly bracket to represent set of values (e.g., {1,2,3}), we use null to denote empty bags, and we just give its element, if the bag has a single value. Most existing similarity-based clustering algorithms can not deal with this data set representation because similarity metrics used in those algorithms expect that an object has a single-value for an attribute and not a bag of values. Accordingly, our approach to discover useful set of queries through database clustering faces following problems: • How to generalize data mining techniques (e.g., clustering algorithms in this paper) so that they can cope with multi-valued attributes. • How to discover a set of useful queries that describe the characteristics of objects in each cluster. We need more systematic and comprehensive approaches, to measure group similarity (e.g., similarity between bags of values) for clustering and to discover useful set of queries for each cluster. GROUP SIMILARITY MEASURES FOR EXTENDED DATA SETS In this paper, we broadly categorize types of attributes into quantitative type and qualitative type, and introduce existing similarity measures based on these two types, and generalize those to cope with extended data sets with mixed types. Qualitative type Tversky (1977) proposed his contrast model and ratio model that generalizes several set-theoretical similarity models proposed at that time. Tversky considers objects as sets of features instead of geometric points in a metric space. To illustrate his models, let a and b be two objects, and A and B denote the sets of features associated with the objects a and b respectively. Tversky proposed the following family of similarity measures, called the contrast model: S(a,b) = θf(A∩B) − αf(A − B) − βf(B − A), for some θ, α, β ≥ 0; f is usually the cardinality of the set. In the previous models, the similarity between objects was determined only by their common features, or only by their distinctive features. In the contrast model, the similarity of a pair of objects is expressed as a linear combination of the measures of the common and the distinctive features. The contrast model expresses similarity between objects as a weighted difference of the measures for their common and distinctive features. The following family of similarity measures represents the ratio model: S(a,b) = f(A∩B) / [( A∩B) + αf(A − B) + βf(B − A)], α, β ≥ 0 In the ratio model, the similarity value is normalized to a value range of 0 and 1. In Tversky’s set theoretic similarity models, a feature usually denotes a value of a binary attribute or a nominal attribute but it can be extended to interval or ordinal type. Note that the set in Tversky’s model means crisp set, not fuzzy set. For the qualitative type of multi-valued case, Tversky’s set similarity can be used since we can consider this case

as an attribute for an object has group feature property (e.g., a set of feature values). Quantitative type One simple way to measure inter-group distance is to substitute group means for the ith attribute of an object in the formulae for inter-object measures such as Euclidean distance (Everitt93). The main problem of this group mean approach is that it does not consider cardinality of quantitative elements in a group. Another approach, known as group average, can be used to measure inter-group similarity. In this approach, the between group similarity is measured by taking the average of all the inter-object measures for those pairs of objects from which each object of a pair is in different groups. For example, the average dissimilarity between n

group A and B can be defined as d(A,B) = ∑ d ( a , b ) i n , where n is the total number of i =1

object-pairs, d(a,b)i is the dissimilarity measure for the ith pair of objects a and b, a ∈ A, b ∈ B. In computing group similarity based on group average, decision on whether we compute the average for every possible pair of similarity or the average for a subset of possible pairs of similarity may be required. For example, suppose we have a pair of value set: {20,5}:{4,15} and use the city block measure as a distance function. One way to compute a group average for this pair of value set is to compute from every possible pairs, (|20-4|+|20-15|+|5-4|+|5-15|)/4, and the other way may be to compute only from corresponding pair of distance (|5-4|+|20-15|)/2 after sorting each value set. In the latter approach, sorting may help reducing unnecessary computation although it requires additional sorting time. For example, the total difference of every possible pair for value sets, {2,5} and {6,3} is 8, and the sorted individual value difference for the same set, {2,5} and {3,6} is 2. The example shows that computing similarity after sorting the value sets may result in better similarity index between multi-valued objects. We call the former one as every-pair approach, and the latter one as sorted-pair approach. This group average approach considers both cardinality and quantitative variance of elements in a group in computing similarity between two groups of values. A FRAMEWORK FOR SIMILARITY MEASURES A similarity measure that was proposed by Gower (1971) is particularly useful for such data sets that contain a variety of attribute types. It is defined as: m

m

i =1

i =1

S(a,b) = ∑ w i s i ( a i , bi ) / ∑ w i In this formula, s(ai,bi) is the normalized similarity index in the range of 0 and 1 between the objects a and b as measured by the function si for ith attribute and wi is a weight for the ith attribute. The weight wi can be also used as a mask depending on the validity of the similarity comparison on the ith attribute which may be unknown or irrelevant for similarity computation for a pair of objects. We can extend Gower’s similarity function to measure similarity for extended data sets with mixed-types. The similarity function can consist of two sub-functions, similarity for l number of qualitative attributes and similarity for q number of quantitative attributes. We assume each attribute has the type information since data analyst can easily provide the type information for attributes. The following formula represents the extended similarity function: l

q

l

q

i =1

j =1

i =1

j =1

S(a,b) = [ ∑ w i s l ( a i , bi ) + ∑ w j s q ( a j , b j )]/( ∑ wi + ∑ w j ) , where m = l + q. The functions, sl(a,b) and sq(a,b) are similarity functions for qualitative

attributes and quantitative attributes respectively. For each type of similarity measures, user makes the choice of specific similarity measures and proper weights based on attribute types and applications. For example, for the similarity function, sl(a,b), we can use the Tversky’s set similarity measure for the l number of qualitative attributes. For the similarity function, sq(a,b), we can use the group similarity function for the q number of quantitative attributes. The quantitative type of multi-valued objects has additional property, group feature property including cardinality information as well as quantitative property. Therefore, sq(a,b) may consist of two sub-functions to measure group features and group quantity, sq(a,b) = sl(a,b) + sg(a,b), where the functions sl(a,b) and sg(a,b) can be Tversky’s set similarity and group average similarity functions respectively. The main objective of using Tversky’s set similarity here is to give more weights to the common features for a pair of objects. AN ARCHITECTURE FOR DATABASE CLUSTERING The unified similarity measure requires basic information such as attribute type (i.e., qualitative or quantitative type), weight, and range values of quantitative attributes before it can be applied. Figure 3.1 shows the architecture of an interactive database clustering environment we are currently developing. Extended Data set

Clustering Tool

Data Extraction Tool

User Interface

Similarity measure

Similarity Measure Tool

A set of clusters DBMS

Default choice and domain information

Library of similarity measures

Type and weight information

MASSON A set of discovered queries

Figure 3.1: Architecture of a Database Clustering Environment The database extraction tool generates an extended data set from a database based on user requirements. The similarity measure tool assists the user in constructing a similarity measure that is appropriate for his/her application. Relying on a library of similarity measures, it interactively guides the user through the construction process, inquiring information about types, weights, and other characteristics of attributes, offering alternatives and choices to the user, if more than one similarity measure seems to be appropriate. In the case that the user cannot provide the necessary information, default assumptions are made and default choices are provided, and occasionally necessary information is directly retrieved from the database. For example, as default weight the unit vector (i.e., all the weights are equally one) can be used, and as default similarity measures, Tversky’s ratio model is used for qualitative types and Euclidean distance is used for quantitative types. The range value information (to normalize the similarity index) for quantitative type of attributes can be easily retrieved from a given data set by scanning the column vector of quantitative attributes. The clustering tool takes the constructed similarity measure and the extended data set as its input and

applies a clustering algorithm, such as Nearest-neighbor (Everitt93) chosen by the user to the extended data set. Finally, MASSON (Ryu96a) takes objects with only object-IDs from each cluster and returns a set of discovered queries that describe the commonalities for the set of objects in the given cluster. MASSON is a query discovery system that uses database queries as a rule representation language (Ryu96b). MASSON discovers a set of discriminant queries (e.g., a set of queries that describes only the given set of objects in a cluster not any other objects in other cluster) in structured databases (Ryu98) using genetic programming (Koza90). SUMMARY AND CONCLUSION In this paper, we analyzed the problem of generating single flat file format to represent data sets that have been extracted from structured databases, and pointed out its representational inappropriateness to represent related information, a fact that has been frequently overlooked by recent data mining research. To overcome these difficulties, we introduced a better representation scheme, called extended data set, which allows attributes of an object to have a bag of values, and discussed how existing similarity measures for single-valued attributes could be generalized to measure group similarity for extended data sets in clustering. We also proposed a unified framework for similarity measures to cope with extended data sets with mixed types by extending Gower’s work. Once the target database is grouped into clusters with similar properties, the discriminant query discovery system, MASSON can discover useful characteristic information for a set of objects that belong to a cluster. We claim that the proposed representation scheme is suitable to cope with related information and that it is more expressive than the traditional single flat file format. More importantly, the relationship information in a structured database is actually considered in clustering process. REFERENCES Everitt, B.S. (1993). Cluster Analysis, Edward Arnold, Copublished by Halsted Press and imprint of John Wiley & Sons Inc., 3rd edition. Gower, J.C. (1971). A general coefficient of similarity and some of its properties, Biometrics 27, 857872. Koza, John R. (1990). Genetic Programming: On the Programming of Computers by Means of Natural Selection, Cambridge, MA: The MIT Press. Quinlan, J. (1993). C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann. Ribeiro, J.S., Kaufmann, K., and Kerschberg, L. (1995). Knowledge Discovery from Multiple Databases, In Proc. of the 1st Int’l Conf. On Knowledge Discovery and Data Mining, Quebec, Montreal. Ryu, T.W and Eick, C.F. (1996a). Deriving Queries from Results using Genetic Programming, In nd Proceedings of the 2 Int’l Conf. on Knowledge Discovery and Data Mining. Portland, Oregon. Ryu, T.W and Eick, C.F. (1996b). MASSON: Discovering Commonalities in Collection of Objects using Genetic Programming, In Proceedings of the Genetic Programming 1996 Conference, Stanford University, San Francisco. Ryu,T.W. and Eick,C.F. (1998). Automated Discovery of Discriminant Rules for a Group of Objects in Databases, In Conference on Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh, PA, June 11-13. Thompson, K., and Langley, P. (1991). Concept formation in structured domains, In Concept Formation: Knowledge and Experience in Unsupervised Learning, Eds., Fisher, D.H; Pazzani, M.; and Langley, P., Morgan Kaufmann. Tversky, A. (1977). Features of similarity, Psychological review, 84(4): 327-352, July.

similarity measures for multi-valued attributes for database ... - CiteSeerX

similarity measures for multi-valued attributes for database ... - CiteSeerX

Suggest Documents

Nonâparametric Similarity Measures for Unsupervised ... - CiteSeerX