Artificial binary data scenarios - ePub WU Institutional Repository

Artificial Binary Data Scenarios Sara Dolnicar Friedrich Leisch Andreas Weingessel Working Paper No. 20 September 1998

September 1998

SFB ‘Adaptive Information Systems and Modelling in Economics and Management Science’ Vienna University of Economics and Business Administration Augasse 2–6, 1090 Wien, Austria in cooperation with University of Vienna Vienna University of Technology http://www.wu-wien.ac.at/am

This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (‘Adaptive Information Systems and Modelling in Economics and Management Science’).

Arti cial Binary Data Scenarios Sara Dolnicar

Friedrich Leischy

Andreas Weingessely

1 Introduction In order to evaluate the performance of dierent algorithms for data segmentation data sets have to be used, where the \correct" solution is known in advance. Since for real-world data sets the data generating process is not known, arti cial data with a prede ned structure have to be generated. In Dolnicar et al. (1998b) various arti cial data scenarios based on experience with empirical data sets have been de ned. They were used to compare the performance of several \classical" cluster algorithms. In Dolnicar et al. (1998a) some new cluster algorithms have been applied to these data. In this manual we describe the data sets used in these experiments together with some new data sets which complete the range of provided scenarios. Our goal is to set up a benchmark set of scenarios with dierent diÆculties which can be used by members of the SFB and other researchers to evaluate their cluster algorithms and compare them with previous results. The data sets described in this manual are available as packages for R (Splus) and as ASCII- les under http://www.ci.tuwien.ac.at/SFB/. The description of the scenarios (see Section 4) are written in the R-help format language and can be converted to LATEX (as in this manual) or html. Algorithms for generating arti cial data scenarios are described in Leisch et al. (1998b).

2 Description of the Scenarios The basic Scenario 1 consists of 12 binary variables which model answers to questions of a questionnaire. These questions are grouped in 4 groups of 3 variables. Each group corresponds to one latent variable (which could for example describe the general interest in cultural activities during holidays) which is represented by 3 manifest variables (like interest in museums, theaters, and opera). In the data there are 6 types of 1000 data points each which model dierent answer behaviors. Each type has a high probability (0.8) to answer \yes" to the questions of 2 latent variables and a low probability (0.2) for the 2 other latent variables. This scenario is \easy' in that sense that no cluster algorithm we considered thus far had diÆculties to nd the 6 clusters therein. The basic scenario can be made more diÆcult (and thus more realistic) by varying the following parameters Size of the clusters (Scenario 5 & 8) Dierent number of manifest variables per latent variable (Scenario 2) Smaller dierence between the \high" and \low" probabilities (Scenario 3) Institut f ur Tourismus und Freizeitwirtschaft, Wirtschaftsuniversit at Wien, Augasse 2-6, A-1090 Wien, Aus-

tria. email: [email protected], http://www.wu-wien.ac.at/inst/tourism/locale.html y Institut f ur Statistik, Wahrscheinlichkeitstheorie und Versicherungsmathematik, Technische Universitat Wien, Wiedner Hauptstrae 8{10/1071, A-1040 Wien, Austria. email: rstname.lastname @ci.tuwien.ac.at, http://www.ci.tuwien.ac.at

1

Assymetric distribution of 0's and 1's (Scenario 6 & 7)

Scenario 9 models the observation that in real-world data two common \yes"-answers provide more similarity between two persons than two common \no"-answers (Leisch et al., 1998a). Finally, Scenario 0 is random with no prede ned structure. This can be used as a test scenarios for algorithms which predict the number of clusters in a data set. In all these scenarios each variable has been modeled independently. The dependence between manifest variables belonging to one latent variable is only modeled by the mixture of the dierent types. For Scenarios 1-3, 5, & 6 there are also data sets with a high correlation between the variables belonging to one latent variable. Scenario 4 is only generated in the dependent form, it is the same as Scenario 1, but with a dierent correlation structure. Table 1 gives an overview over all scenarios and shows whether there is a dependent or independent (or both) version of them. Name Scenario 0 Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6 Scenario 7 Scenario 8 Scenario 9

Description Random Scenario Basic Scenario Unequal Latent Variable Scenario Medium Importance Scenario Bad Indicator Scenario Niche Segment Scenario Answer Tendencies Scenario Asymmetric Scenario Extreme Segment Size Scenario Important \Yes" Scenario

indep. X X X X X X X X X

dep. X X X X X X

Table 1: The Scenarios The variable-names in R and the lenames are of the form scen.x for the independent data sets and scendep.x for the dependent ones, where x is the number of the scenario. In Dolnicar et al. (1998a,b) slightly dierent names of the scenarios have been used, Table 2 gives the relation between the old and new names. Old 1a 1b 3a 3b

New 1 5 3 6

Table 2: Old/New Scenario Names

3 Results Tables 4-7 show the results of several cluster algorithms applied to the Scenarios 1-6. Table 3 lists these cluster algorithms. The values given in the table are the number of classes found and the number of those clusters which have never been found in 10 runs. class. rate map. gives the classi cation rate only computed for the cluster centers which have been found, class. rate all gives the classi cation rate in percent of all data. (center-type)2 gives the Euclidean distance from the center to the speci ed mean values and comp. range describes the variation of the resulting 2

HCL-ED HCL-AD k -means NGAS-ED NGAS-AD SOM TRN HCL-BD st-1 st-2 st-3 st-4

Hard Competitive Learning, Euclidean Distance Hard Competitive Learning, Absolute Distance K -Means Neural Gas, Euclidean Distance Neural Gas, Absolute Distance Self Organizing Maps Topology Representing Network Hard Competitive Learning, Binary Distance Improved Fixpoint Method, Euclidean Distance Improved Fixpoint Method, f (:) = k:k1:4 Improved Fixpoint Method, Absolute Distance Improved Fixpoint Method, f (:) = ln(cosh(k:k)) Table 3: Cluster Algorithms Used

prototypes. The values are given in the form minimum/average/maximum. More details about the algorithms and a discussion of the results can be found in Dolnicar et al. (1998a,b).

4 The Scenarios In the following we give a description of all the scenarios we have generated thus far. After a general description of the scenario, we give two empirical examples, where the structure of this scenario could be found in a real-world data set. In a summary section we give some basic description as the number of cases, the number of variables and how the manifest variables correspond to the latent variables, and the number of classes (=clusters) the scenario is made of. The Bayes classi cation rate is the optimal classi cation rate for the corresponding data generating process under the assumption that the class structure is known. It serves as a measure for the diÆculty of a scenario, see Dolnicar et al. (1998b) for details. The Section Class Distributions gives the sizes of the classes. Finally, we give a table for the probabilities that a certain variable is 1. For the independent data sets these are the probabilities of the data generating process, for the dependent data sets these are, due to computational reasons, the mean values of the generated data. The variables are named Lx Iy, where x gives the number of the latent variable and y gives the number of the manifest variable for the particular latent variable.

3

scen.0

Scenario 0: Random Scenario

Description

The Random Scenario is not the result of modeling empirical data. The segment memberships of the respondents (cases) are determined at random. Data generation: no correlations between variables modeled (see Working Paper # 7 for details) Summary

Type of variables: Number of cases: Number of variables: Number of classes: Number of latent variables: Manifest variables per latent variable: Bayes classi cation rate:

binary 6000 12 6 1 12

Class Distribution

Class Nr.: Number of cases:

1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types

(mean values of data generating distribution) L1I1 L1I2 L1I3 L1I4 L1I5 L1I6 1 0.5 0.5 0.5 0.5 0.5 0.5 2 0.5 0.5 0.5 0.5 0.5 0.5 3 0.5 0.5 0.5 0.5 0.5 0.5 4 0.5 0.5 0.5 0.5 0.5 0.5 5 0.5 0.5 0.5 0.5 0.5 0.5 6 0.5 0.5 0.5 0.5 0.5 0.5

4

L1I7 0.5 0.5 0.5 0.5 0.5 0.5

L1I8 0.5 0.5 0.5 0.5 0.5 0.5

L1I9 0.5 0.5 0.5 0.5 0.5 0.5

L1I10 0.5 0.5 0.5 0.5 0.5 0.5

L1I11 0.5 0.5 0.5 0.5 0.5 0.5

L1I12 0.5 0.5 0.5 0.5 0.5 0.5

scen.1

Scenario 1: Basic Scenario

Description

The basic scenario is not based on typical experiences from empirical data sets. It is completely symmetric with six dierent variables above and six below the average value of the entire data set. Furthermore the same amount of individuals (cases) is generated in each segment. Data generation: no correlations between variables modeled (see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios). Six segments of travelers with equal segment size are known to exist. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects). Six segments of travelers with equal segment size are known to exist. Summary


binary 6000 12 6 4 3-3-3-3 82.98%

Class Distribution


1 1000

2 1000

3 1000

4 1000

Types

(mean values of data generating distribution)

5

5 1000

6 1000

1 2 3 4 5 6

L1I1 0.8 0.2 0.2 0.8 0.2 0.8

L1I2 0.8 0.2 0.2 0.8 0.2 0.8

L1I3 0.8 0.2 0.2 0.8 0.2 0.8

L2I1 0.8 0.2 0.8 0.2 0.8 0.2

L2I2 0.8 0.2 0.8 0.2 0.8 0.2

L2I3 0.8 0.2 0.8 0.2 0.8 0.2

6

L3I1 0.2 0.8 0.8 0.2 0.2 0.8

L3I2 0.2 0.8 0.8 0.2 0.2 0.8

L3I3 0.2 0.8 0.8 0.2 0.2 0.8

L4I1 0.2 0.8 0.2 0.8 0.8 0.2

L4I2 0.2 0.8 0.2 0.8 0.8 0.2

L4I3 0.2 0.8 0.2 0.8 0.8 0.2

scendep.1

Scenario 1 (Dependent): Basic Scenario (Dependent)

Description

The basic scenario is not based on typical experiences from empirical data sets. It is completely symmetric with six dierent variables above and below the average value of the entire data set. Furthermore the same amount of individuals (cases) is generated in each segment. Data generation: autologistic model (correlations between variables modeled, see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. color, size of buttons, design are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \looks" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. Six segments of travelers with equal segment size are known to exist. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. the variables landscape, environmental protection, fresh air are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \nature" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. Six equally sized segments of travelers are known to exist. Summary


binary 6000 12 6 4 3-3-3-3 69.52%

7

Class Distribution


1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types

(mean values of classes) L1I1 L1I2 L1I3 1 0.84 0.85 0.86 2 0.18 0.19 0.18 3 0.20 0.18 0.19 4 0.84 0.84 0.85 5 0.19 0.18 0.18 6 0.85 0.86 0.85

L2I1 0.84 0.18 0.86 0.19 0.85 0.17

L2I2 0.86 0.17 0.86 0.18 0.85 0.17

L2I3 0.86 0.18 0.86 0.18 0.86 0.17

8

L3I1 0.17 0.85 0.85 0.19 0.18 0.83

L3I2 0.18 0.85 0.85 0.17 0.20 0.85

L3I3 0.20 0.86 0.85 0.19 0.20 0.84

L4I1 0.19 0.83 0.19 0.84 0.85 0.18

L4I2 0.19 0.84 0.18 0.85 0.85 0.19

L4I3 0.20 0.85 0.19 0.84 0.87 0.20

scen.2

Scenario 2: Unequal Latent Variable Scenario

Description

The Unequal Latent Variable Scenario is equivalent to the Basic Scenario as far as symmetry of variables and equality of segment sizes is concerned. As is the case with the Basic Scenario, groups of variables represent latent variables (or factors). In the Basis Scenario three variables represent latent variable indicators. The size of the homogeneous variable groups ranges from ve variables loading on one latent variable to one single variable representing the factor in the Unequal Latent Variable Scenario. Data generation: no correlations between variables modeled (see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. color, size of buttons, design are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \looks" etc.). As it is often the case with empirical data, some latent variables include more or less actually posed questions. Here we suppose that e.g. factor analysis indicated that ve questions are caused by the latent variable 1, four by the latent variable 2, two by the latent variable 3 and nally latent variable 4 is only represented by one single item of the questionnaire. Six equally sized segments of travelers are known to exist. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. the variables landscape, environmental protection, fresh air are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \nature" etc.). As it is often the case with empirical data, some latent variables include more or less actually posed questions. Here we suppose that e.g. factor analysis indicated that ve questions are caused by the latent variable 1, four by the latent variable 2, two by the latent variable 3 and nally latent variable 4 is only represented by one single item of the questionnaire. Six equally sized segments of travelers are known to exist.

9

Summary


binary 6000 12 6 4 5-4-2-1 82.99%

Class Distribution


1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types


10

L2I2 0.8 0.2 0.8 0.2 0.8 0.2

L2I3 0.8 0.2 0.8 0.2 0.8 0.2

L2I4 0.8 0.2 0.8 0.2 0.8 0.2

L3I1 0.2 0.8 0.8 0.2 0.2 0.8

L3I2 0.2 0.8 0.8 0.2 0.2 0.8

L4I1 0.2 0.8 0.2 0.8 0.8 0.2

scendep.2

Scenario 2 (Dependent): Unequal Latent Variable Scenario (Dependent)

Description

The Unequal Latent Variable Scenario is equivalent to the Basic Scenario as far as symmetry of variables and equality of segment sizes is concerned. As is the case with the Basic Scenario, groups of variables represent latent variables (or factors). In the Basis Scenario three variables represent latent variable indicators. The size of the homogeneous variable groups ranges from ve variables loading on one latent variable to one single variable representing the factor in the Unequal Latent Variable Scenario. Data generation: autologistic model (correlations between variables modeled, see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. color, size of buttons, design are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \looks" etc.). These groups of variables are correlated with each other. As it is often the case with empirical data, some latent variables include more or less actually posed questions. Here we suppose that e.g. factor analysis indicated that ve questions are caused by the latent variable 1, four by the latent variable 2, two by the latent variable 3 and nally latent variable 4 is only represented by one single item of the questionnaire. Six segments of travelers with equal segment size are known to exist. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. the variables landscape, environmental protection, fresh air are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \nature" etc.). These groups of variables are correlated with each other. As it is often the case with empirical data, some latent variables include more or less actually posed questions. Here we suppose that e.g. factor analysis indicated that ve questions are caused by the latent variable 1, four by the latent variable 2, two by the latent variable 3 and nally latent variable 4 is only represented by one single item of the questionnaire. Six equally sized segments of travelers are known to exist.

11

Summary


binary 6000 12 6 4 5-4-2-1 69.88%

Class Distribution


1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types


L1I4 0.83 0.15 0.17 0.84 0.17 0.83

L1I5 0.81 0.12 0.12 0.82 0.12 0.81

L2I1 0.84 0.11 0.83 0.10 0.83 0.10

12

L2I2 0.87 0.13 0.85 0.15 0.87 0.12

L2I3 0.87 0.16 0.85 0.16 0.87 0.14

L2I4 0.87 0.16 0.86 0.16 0.87 0.17

L3I1 0.17 0.81 0.81 0.16 0.18 0.82

L3I2 0.18 0.82 0.82 0.16 0.17 0.84

L4I1 0.16 0.81 0.16 0.83 0.85 0.16

scen.3

Scenario 3: Medium Importance Scenario

Description

The Medium Importance Scenario certainly comes closest to what reality is all about. Not giving up symmetry and equal segment size (Basic Scenario), the restriction of variables to either below or above average answers is abandoned. Each segment rates six (dierent) variables higher then the average (of the entire data), but the remaining variables do not aid the distinction by being far below average. Data generation: no correlations between variables modeled (see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios). Six segments of travelers with equal segment size are known to exist, with each group attaching more than average importance to 6 features, while attaching medium importance to the remaining attributes of transistor radios (as opposed to low importance statements in the Basic Scenario). Analysis might show (to produce some prejudice here), that heavy metal listeners do have some expectations about all features, but only the technical attributes like loudness and possibility to adjust the bass level are crucial to them. This assumption is more realistic than believing that every respondent answers in an extreme manner (don't care at all/ very important) to all attributes listed. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects). Six segments of travelers with equal segment size are known to exist, with each group attaching more than average importance to 6 questions, while attaching medium importance to the remaining travel aspects (as opposed to low importance statements in the Basic Scenario). For example the result of analysis could be, that culture tourists do not attach very high importance to comfort and relaxation, but cultural oers and interesting sights are crucial variables seen as extremely important for their stay in Austria. This assumption is more realistic than believing that every respondent answers in an extreme manner (don't care at all/ very important) to all vacation aspects listed.

13

Summary


binary 6000 12 6 4 3-3-3-3 48.93%

Class Distribution


1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types


14

L3I1 0.5 0.8 0.8 0.5 0.5 0.8

L3I2 0.5 0.8 0.8 0.5 0.5 0.8

L3I3 0.5 0.8 0.8 0.5 0.5 0.8

L4I1 0.5 0.8 0.5 0.8 0.8 0.5

L4I2 0.5 0.8 0.5 0.8 0.8 0.5

L4I3 0.5 0.8 0.5 0.8 0.8 0.5

scendep.3

Scenario 3 (Dependent): Medium Importance Scenario (Dependent)

Description

The Medium Importance Scenario certainly comes closest to what reality is all about. Not giving up symmetry and equal segment size (Basic Scenario), the restriction of variables to either below or above average answers is abandoned. Each segment rates six (dierent) variables higher then the average (of the entire data), but the remaining variables do not aid the distinction with far below-average values. Data generation: autologistic model (correlations between variables modeled, see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. color, size of buttons, design are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \looks" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. Six segments of travelers with equal segment size are known to exist, with each group attaching more than average importance to 6 questions, while attaching medium importance to the remaining transistor radio attributes (as opposed to low importance statements in the Basic Scenario). Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. the variables landscape, environmental protection, fresh air are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \nature" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. Six equally sized segments of travelers are known to exist, with each group attaching more than average importance to 6 questions, while attaching medium importance to the remaining travel aspects (as opposed to low importance statements in the Basic Scenario).

15

Summary


binary 6000 12 6 4 3-3-3-3 41.87%

Class Distribution


1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types


L2I1 0.84 0.48 0.84 0.46 0.83 0.49

L2I2 0.83 0.49 0.85 0.48 0.85 0.51

L2I3 0.83 0.50 0.83 0.49 0.84 0.51

16

L3I1 0.46 0.84 0.84 0.49 0.48 0.83

L3I2 0.48 0.85 0.83 0.50 0.49 0.83

L3I3 0.51 0.85 0.84 0.52 0.50 0.83

L4I1 0.49 0.84 0.47 0.85 0.83 0.46

L4I2 0.50 0.83 0.49 0.86 0.83 0.49

L4I3 0.51 0.83 0.50 0.86 0.85 0.51

scendep.4

Scenario 4 (Dependent): Bad Indicator Scenario (Dependent)

Description

All dependent scenarios model correlation between those variables that result from the same latent variable. In case of the Bad Indicator Scenario one of the variables within each group of three is correlated with the remaining couple of items to a lower extent. Data generation: autologistic model (correlations between variables modeled, see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. color, size of buttons, design are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \looks" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. But one variable is correlated to a lower extent, indicating that the question does not represent the underlying latent variable as well as the remaining two questions do (e.g. color and design better represent the \look of the radio" as the button size does). Six segments of travelers with equal segment size are known to exist. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. the variables landscape, environmental protection, fresh air are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \nature" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. But one variable is correlated to a lower extent, indicating that the question does not represent the underlying latent variable as well as the remaining two questions do (e.g. fresh air and landscape better represent \nature" as environmental protection does). Six equally sized segments of travelers are known to exist.

17

Summary


binary 6000 12 6 4 3-3-3-3 81.2%

Class Distribution


1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types


L2I1 0.83 0.18 0.84 0.19 0.84 0.19

L2I2 0.85 0.19 0.85 0.21 0.84 0.21

L2I3 0.86 0.16 0.89 0.13 0.88 0.15

18

L3I1 0.19 0.86 0.84 0.16 0.18 0.86

L3I2 0.17 0.86 0.85 0.19 0.18 0.86

L3I3 0.12 0.89 0.87 0.14 0.12 0.89

L4I1 0.20 0.85 0.19 0.84 0.85 0.20

L4I2 0.21 0.86 0.19 0.87 0.87 0.21

L4I3 0.15 0.90 0.15 0.87 0.89 0.14

scen.5

Scenario 5: Niche Segment Scenario

Description

The symmetry of above and below average variables is the same as for the Basic Scenario: In each segment generated six (dierent) variables are above / below average. As opposed to the Basic Scenario, the number of individuals (cases) varies over the segments, with the tiny segment number 2 representing a niche. Data generation: no correlations between variables modeled (see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios). Six segments of travelers are known to exist. As it is the case in reality, the size of these consumer segments varies. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects). Six segments of travelers are known to exist. As it is the case in reality, the size of these consumer segments varies. Summary


binary 6000 12 6 4 3-3-3-3 88.85%

Class Distribution


1 1000

2 300

3 700

4 3000

Types


19

5 500

6 500

1 2 3 4 5 6

L1I1 0.8 0.2 0.2 0.8 0.2 0.8

L1I2 0.8 0.2 0.2 0.8 0.2 0.8

L1I3 0.8 0.2 0.2 0.8 0.2 0.8

L2I1 0.8 0.2 0.8 0.2 0.8 0.2

L2I2 0.8 0.2 0.8 0.2 0.8 0.2

L2I3 0.8 0.2 0.8 0.2 0.8 0.2

20

L3I1 0.2 0.8 0.8 0.2 0.2 0.8

L3I2 0.2 0.8 0.8 0.2 0.2 0.8

L3I3 0.2 0.8 0.8 0.2 0.2 0.8

L4I1 0.2 0.8 0.2 0.8 0.8 0.2

L4I2 0.2 0.8 0.2 0.8 0.8 0.2

L4I3 0.2 0.8 0.2 0.8 0.8 0.2

scendep.5

Scenario 5 (Dependent): Niche Segment Scenario (Dependent)

Description

The symmetry of above and below average variables is the same as for the Basic Scenario: In each segment generated six (dierent) variables are above / below average. As opposed to the Basic Scenario, the number of individuals (cases) varies over the segments, with the tiny segment number 2 representing a niche. Data generation: autologistic model (correlations between variables modeled, see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. color, size of buttons, design are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \looks" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. Six segments of travelers are known to exist. As it is the case in reality, the size of these consumer segments varies. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. the variables landscape, environmental protection, fresh air are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \nature" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. Six segments of travelers are known to exist. As it is the case in reality, the size of these consumer segments varies. Summary


binary 6000 12 6 4 3-3-3-3 78.23% 21

Class Distribution


1 1000

2 300

3 700

4 3000

5 500

6 500

Types


L2I1 0.84 0.20 0.84 0.17 0.83 0.18

L2I2 0.86 0.18 0.86 0.17 0.84 0.17

L2I3 0.86 0.22 0.85 0.18 0.84 0.20

22

L3I1 0.16 0.90 0.84 0.17 0.16 0.85

L3I2 0.16 0.90 0.84 0.18 0.16 0.84

L3I3 0.16 0.91 0.85 0.17 0.17 0.86

L4I1 0.17 0.87 0.18 0.85 0.85 0.16

L4I2 0.16 0.86 0.17 0.84 0.85 0.16

L4I3 0.15 0.85 0.19 0.85 0.86 0.14

scen.6

Scenario 6: Answer Tendencies Scenario

Description

In reality numerous answer tendencies occur in empirical data sets. That is usually caused by personality traits of the respondents. In the Answer Tendencies Scenario two segments are modeled, that basically give the same answer on each one of the 12 questions. The symmetry of the Basic Scenario is thus given up, the segment sizes remain equal. Data generation: no correlations between variables modeled (see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios). Six segments of travelers with equal segment size are known to exist. While four segments have a dierentiated view about the importance of the attributes of transistor radios stated in the questionnaire, one group of consumers believes that not a single one of these attributes is important (imagine buyers who need a little radio in the kitchen only to listen to the news from time to time when washing the dishes) and another group attaches high importance to each aspect (maybe the group of music enjoyers). Service marketing research example

Empirical example: In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects). Six segments of travelers with equal segment size are known to exist. While four segments have a dierentiated view about the importance of the vacation aspects stated in the questionnaire, one tourist group believes, that not a single one of these attributes is important (they might be visiting friends and therefore primarily care about talking with them a lot) and another group attaches high importance to each aspect (this could be a group of tourists that does not spend a vacation in a foreign country very often and therefore has the highest expectations concerning everything). Summary


binary 6000 12 6 4 3-3-3-3 81.25% 23

Class Distribution


1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types


24

L3I1 0.8 0.2 0.2 0.2 0.8 0.2

L3I2 0.8 0.2 0.2 0.2 0.8 0.2

L3I3 0.8 0.2 0.2 0.2 0.8 0.2

L4I1 0.8 0.2 0.2 0.2 0.2 0.8

L4I2 0.8 0.2 0.2 0.2 0.2 0.8

L4I3 0.8 0.2 0.2 0.2 0.2 0.8

scendep.6

Scenario 6 (Dependent): Answer Tendencies Scenario (Dependent)

Description

In reality numerous answer tendencies occur in empirical data sets, that is usually caused by personality traits of the respondents. In the Answer Tendencies Scenario two segments are modeled, that basically give the same answer on each one of the 12 questions. The symmetry of the Basic Scenario id given up, the segment sizes remain equal. Data generation: autologistic model (correlations between variables modeled, see Working Paper # 7 for details) Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. color, size of buttons, design are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \looks" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. Six segments of travelers with equal segment size are known to exist. While four segments have a dierentiated view about the importance of the attributes of transistor radios stated in the questionnaire, one group of consumers believes, that not a single one of these attributes is important (imagine buyers who need a little radio in the kitchen only to listen to the news from time to time when washing the dishes) and another group attaches high importance to each aspect (maybe the group of music enjoyers). Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects), where groups of variables represent a latent dimension and are therefore answered in the same way (e.g. the variables landscape, environmental protection, fresh air are either all important to an individual or they are not. These variables could all be part of a more complex latent variable called \nature" etc.). Here we suppose that every factor is represented by three questions. These groups of variables are correlated with each other. Six equally sized segments of travelers are known to exist. While four segments have a dierentiated view about the importance of the vacation aspects stated in the questionnaire, one tourist group believes, that not a single one of these attributes is important (they might be visiting friends and therefore primarily care about talking with them a lot) and another group attaches high importance to each aspect (this could be a group of tourists that does not spend a vacation in a foreign country very often and therefore has the highest expectations concerning everything).

25

Summary


binary 6000 12 6 4 3-3-3-3 71.08%

Class Distribution


1 1000

2 1000

3 1000

4 1000

5 1000

6 1000

Types


L2I1 0.92 0.16 0.18 0.92 0.20 0.17

L2I2 0.93 0.19 0.21 0.93 0.20 0.18

L2I3 0.93 0.20 0.22 0.93 0.21 0.21

26

L3I1 0.83 0.07 0.18 0.18 0.83 0.17

L3I2 0.83 0.07 0.18 0.20 0.83 0.18

L3I3 0.83 0.08 0.21 0.21 0.84 0.20

L4I1 0.79 0.06 0.18 0.17 0.18 0.81

L4I2 0.80 0.08 0.18 0.20 0.20 0.81

L4I3 0.81 0.08 0.20 0.21 0.21 0.81

scen.7

Scenario 7: Asymmetric Scenario

Description

The Asymmetric Scenario gives up the symmetry restriction of the Basic scenario hanging on to the assumption of the same number of respondent (cases) existing in each segment. Data generation: no correlations between variables modeled Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios). Segments do not attach great importance to six and low importance to the remaining six features listed in the questionnaire. Instead, up to nine product attributes are rated important by the dierent segments. Six segments of travelers with equal segment size are known to exist. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects). Segments do not attach great importance to six and low importance to the remaining six questions. Instead, up to nine vacation aspects are rated important by the dierent segments. Six segments of travelers with equal segment size are known to exist. Summary


binary 6000 12 6 4 3-3-3-3

Class Distribution


1 1000

2 1000

3 1000

4 1000

Types


27

5 1000

6 1000

1 2 3 4 5 6

L1I1 0.8 0.2 0.2 0.8 0.2 0.8

L1I2 0.8 0.2 0.2 0.8 0.2 0.8

L1I3 0.8 0.2 0.2 0.8 0.2 0.8

L2I1 0.8 0.8 0.8 0.2 0.2 0.2

L2I2 0.8 0.8 0.8 0.2 0.2 0.2

L2I3 0.8 0.8 0.8 0.2 0.2 0.2

28

L3I1 0.2 0.8 0.8 0.2 0.2 0.8

L3I2 0.2 0.8 0.8 0.2 0.2 0.8

L3I3 0.2 0.8 0.8 0.2 0.2 0.8

L4I1 0.2 0.8 0.2 0.8 0.8 0.8

L4I2 0.2 0.8 0.2 0.8 0.8 0.8

L4I3 0.2 0.8 0.2 0.8 0.8 0.8

scen.8

Scenario 8: Extreme Segment Size Scenario

Description

The Basic Scenario assumes that the same amount of respondents (cases) exists (and is thus generated) in each segment. The Niche Market Scenario gives up this restriction by de ning six groups of respondents with dierent sizes each. In the Extreme Segment Scenario three small consumer groups (with the same size|n=300|as the tiny segment in the Niche Market Scenario) are de ned. The remaining three segments are equally sized large segments including 1700 respondents. Data generation: no correlations between variables modeled Empirical Examples Marketing research example

In a survey among buyers of transistor radios 6000 customers or potential customers were questioned about how important certain features are to them (e.g. technical features, color, size, . . . ). The goal of the study is to identify groups of tourists with the same preferences that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios). Six segments of travelers with equal segment size are known to exist, with three of them representing niche markets (with 300 customers each) and three representing mass markets (with 1.700 respondents each). Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about how important certain vacation aspects are to them (e.g. security, comfort, landscape, . . . ). The goal of the study is to identify groups of tourists with the same vacation importances / expectations that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (vacation aspects). Six segments of travelers are known to exist, with three of them representing niche markets (with 300 customers each) and three representing mass segments (with 1.700 respondents each). Summary


binary 6000 12 6 4 3-3-3-3

Class Distribution


1 300

2 300

3 300

4 1700

29

5 6 1700 1700

Types


30

L3I1 0.2 0.8 0.8 0.2 0.2 0.8

L3I2 0.2 0.8 0.8 0.2 0.2 0.8

L3I3 0.2 0.8 0.8 0.2 0.2 0.8

L4I1 0.2 0.8 0.2 0.8 0.8 0.2

L4I2 0.2 0.8 0.2 0.8 0.8 0.2

L4I3 0.2 0.8 0.2 0.8 0.8 0.2

scen.9

Scenario 9: Important \Yes" Scenario

Description

The basic scenario is not based on typical experiences from empirical data sets. It is completely symmetric and assumes the same amount of individuals (cases) to be members of the clusters generated. The Important \Yes" Scenario is more realistic concerning both restrictions: the average segment ratings do not have to be high in six variables and low in the remaining six, the design is thus asymmetric. Also, the segments sizes dier from each other. Data generation: no correlations between variables modeled Empirical Examples Marketing research example

In a survey among buyers of tiny portable transistor radios 6000 customers or potential customers were questioned about the usage of these tiny portable radios (e.g. in the car, in the oÆce, at home, . . . ). The goal of the study is to identify groups of tourists with the kinds of radio usage that can be addressed as homogeneous market segments by the marketing management (for product design, product modi cation, advertising, . . . ). The questionnaire contains 12 questions (features of transistor radios). Six segments of travelers with equal segment size are known to exist. Of course a large number of respondents will probably not use these tiny portable transistor radios at all. The positive answers are thus more informative for the behavioral segmentation desired and a large segment of non-users will occur. Service marketing research example

In a survey among tourists visiting Austria, 6000 travelers were questioned about what vacation activities they engage in (e.g. tennis, jogging, skiing, sightseeing, . . . ). The goal of the study is to identify groups of tourists with the same vacation activities that can be addressed as homogeneous market segments by the destination management. The questionnaire contains 12 questions (activities). Each segment shows a dierent pattern of activities. The fact, that tourists state an activity is more informative than the statement, not to e.g. play tennis. It is therefore important to identify segments with speci c activity combinations then to de ne them by a lack of certain activities. Six segments of travelers with dierent segment size are known to exist, where a large amount of respondents does not indicate any activity at all. Summary


binary 4000 10 5 4 2-3-2-3

31

Class Distribution


1 200

2 800

3 200

4 800

5 2000

Types

(mean values of data generating distribution) L1I1 L1I2 L2I1 L2I2 L2I3 L3I1 1 0.9 0.9 0.9 0.9 0.9 0.1 2 0.9 0.9 0.2 0.2 0.2 0.1 3 0.1 0.1 0.1 0.1 0.1 0.9 4 0.1 0.1 0.1 0.1 0.1 0.9 5 0.1 0.1 0.1 0.1 0.1 0.1

32

L3I2 0.1 0.1 0.9 0.9 0.1

L4I1 0.1 0.1 0.9 0.2 0.1

L4I2 0.1 0.1 0.9 0.2 0.1

L4I3 0.1 0.1 0.9 0.2 0.1

Acknowledgement This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (`Adaptive Information Systems and Modelling in Economics and Management Science'). The authors wish to thank Klaus Grabler, Kurt Hornik, Josef Mazanec, Klaus Potzelberger, and Helmut Strasser for helpful discussions and Christian Buchta for generating the dependent data sets.

References Dolnicar, S., Leisch, F., Steiner, G., & Weingessel, A. (1998a). A Comparison of Several Cluster Algorithms on Arti cial Binary Data Scenarios from Tourism Marketing: Part 2. Working Paper Series 19, SFB \Adaptive Information Systems and Modeling in Economics and Management Science", http://www.wu-wien.ac.at/am. Dolnicar, S., Leisch, F., Weingessel, A., Buchta, C., & Dimitriadou, E. (1998b). A Comparison of Several Cluster Algorithms on Arti cial Binary Data Scenarios from Tourism Marketing. Working Paper Series 7, SFB \Adaptive Information Systems and Modeling in Economics and Management Science", http://www.wu-wien.ac.at/am. Leisch, F., Weingessel, A., & Dimitriadou, E. (1998a). Competitive learning for binary valued data. In Niklasson, L., Boden, M., & Ziemke, T. (eds.), Proceedings of the 8th International Conference on Arti cial Neural Networks (ICANN 98), vol. 2, pp. 779{784, Skovde, Sweden. Springer. Leisch, F., Weingessel, A., & Hornik, K. (1998b). On the Generation of Correlated Arti cial Binary Data. Working Paper Series 13, SFB \Adaptive Information Systems and Modeling in Economics and Management Science", http://www.wu-wien.ac.at/am.

33

34

3

2

Scenario 1

Algorithm HCL-ED HCL-AD k-means NGAS-ED NGAS-AD SOM TRN HCL-BD st-1 st-2 st-3 st-4 HCL-ED HCL-AD k-means NGAS-ED NGAS-AD SOM TRN HCL-BD st-1 st-2 st-3 st-4 HCL-ED HCL-AD k-means NGAS-ED NGAS-AD SOM TRN HCL-BD st-1 st-2 st-3 st-4 6.00 6.00 5.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 4.00 6.00 6.00 5.00 6.00 4.00 6.00 6.00 6.00 6.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

found / 6.00 / / 6.00 / / 5.80 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 5.50 / / 6.00 / / 6.00 / / 5.80 / / 6.00 / / 5.40 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 0.40 / / 0.70 / / 0.80 / / 0.80 / / 0.40 / / 1.00 / / 0.90 / / 0.00 / / 0.50 / / 0.20 / / 0.40 / / 0.40 / 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 2.00 3.00 2.00 3.00 2.00 2.00 2.00 0.00 1.00 1.00 1.00 1.00

class. rate map. 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.82 0.70 / 0.80 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.82 0.81 / 0.81 / 0.82 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.83 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.82 0.82 / 0.82 / 0.82 0.71 / 0.79 / 0.82 0.82 / 0.82 / 0.82 0.82 / 0.82 / 0.82 0.78 / 0.79 / 0.79 0.82 / 0.82 / 0.83 0.77 / 0.80 / 0.80 0.82 / 0.82 / 0.82 0.82 / 0.82 / 0.82 0.82 / 0.82 / 0.82 0.79 / 0.82 / 0.82 0.35 / 0.36 / 0.37 0.35 / 0.39 / 0.42 0.34 / 0.39 / 0.43 0.35 / 0.36 / 0.40 0.32 / 0.35 / 0.38 0.37 / 0.38 / 0.39 0.32 / 0.35 / 0.38 0.00 / 0.00 / 0.00 0.35 / 0.37 / 0.40 0.36 / 0.37 / 0.39 0.34 / 0.37 / 0.39 0.33 / 0.36 / 0.39

class. rate all 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.82 0.64 / 0.79 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.82 0.81 / 0.81 / 0.82 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.83 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.83 0.82 / 0.82 / 0.82 0.82 / 0.82 / 0.82 0.55 / 0.74 / 0.82 0.82 / 0.82 / 0.82 0.82 / 0.82 / 0.82 0.67 / 0.76 / 0.79 0.82 / 0.82 / 0.83 0.49 / 0.71 / 0.80 0.82 / 0.82 / 0.82 0.82 / 0.82 / 0.82 0.82 / 0.82 / 0.82 0.79 / 0.82 / 0.82 0.04 / 0.07 / 0.12 0.04 / 0.07 / 0.16 0.04 / 0.09 / 0.13 0.05 / 0.08 / 0.17 0.06 / 0.08 / 0.11 0.04 / 0.09 / 0.12 0.06 / 0.08 / 0.12 0.00 / 0.00 / 0.00 0.06 / 0.07 / 0.07 0.07 / 0.07 / 0.07 0.04 / 0.06 / 0.07 0.05 / 0.06 / 0.07

(center-type)2 0.13 / 0.14 / 0.15 0.13 / 0.14 / 0.15 0.13 / 0.15 / 0.25 0.14 / 0.15 / 0.16 0.13 / 0.14 / 0.15 0.40 / 0.41 / 0.42 0.14 / 0.15 / 0.16 0.66 / 0.67 / 0.68 0.13 / 0.13 / 0.13 0.13 / 0.13 / 0.13 0.13 / 0.13 / 0.13 0.13 / 0.13 / 0.13 0.12 / 0.12 / 0.13 0.12 / 0.13 / 0.13 0.08 / 0.13 / 0.20 0.10 / 0.11 / 0.11 0.10 / 0.11 / 0.11 0.27 / 0.33 / 0.35 0.12 / 0.12 / 0.13 0.46 / 0.62 / 0.69 0.10 / 0.10 / 0.11 0.10 / 0.11 / 0.11 0.10 / 0.11 / 0.11 0.10 / 0.12 / 0.16 0.13 / 0.15 / 0.21 0.13 / 0.17 / 0.33 0.13 / 0.16 / 0.20 0.10 / 0.14 / 0.32 0.10 / 0.14 / 0.22 0.08 / 0.12 / 0.15 0.10 / 0.14 / 0.23 0.00 / 0.00 / 0.00 0.09 / 0.10 / 0.10 0.09 / 0.09 / 0.09 0.09 / 0.10 / 0.13 0.09 / 0.10 / 0.12

Table 4: Summary of the Results on the Independent Data Sets, Part I

1-4 4 1-4-6 4-5-6 2 1-2-3-4-5-6 1-2-4 1-4-5-6 1-4-6 1-2-4-5-6

1-2-4-6

never

comp. range 0.03 / 0.07 / 0.12 0.03 / 0.06 / 0.10 0.04 / 0.13 / 0.32 0.04 / 0.08 / 0.14 0.04 / 0.08 / 0.13 0.06 / 0.17 / 0.31 0.04 / 0.08 / 0.15 0.00 / 0.04 / 0.15 0.01 / 0.06 / 0.12 0.02 / 0.07 / 0.12 0.01 / 0.05 / 0.11 0.01 / 0.07 / 0.12 0.03 / 0.08 / 0.13 0.04 / 0.07 / 0.13 0.00 / 0.08 / 0.37 0.01 / 0.04 / 0.08 0.01 / 0.04 / 0.09 0.01 / 0.07 / 0.17 0.03 / 0.07 / 0.18 0.00 / 0.00 / 0.00 0.00 / 0.03 / 0.08 0.00 / 0.03 / 0.08 0.00 / 0.03 / 0.08 0.00 / 0.04 / 0.21 0.00 / 0.16 / 0.62 0.00 / 0.02 / 0.62 0.00 / 0.13 / 0.61 0.00 / 0.08 / 0.61 0.00 / 0.07 / 0.64 0.00 / 0.08 / 0.38 0.00 / 0.09 / 0.62 0.00 / 0.00 / 0.00 0.00 / 0.05 / 0.36 0.00 / 0.00 / 0.00 0.00 / 0.07 / 0.63 0.06 / 0.27 / 0.61

35

6

Scenario 5

Algorithm HCL-ED HCL-AD k-means NGAS-ED NGAS-AD SOM TRN HCL-BD st-1 st-2 st-3 st-4 HCL-ED HCL-AD k-means NGAS-ED NGAS-AD SOM TRN HCL-BD st-1 st-2 st-3 st-4 5.00 6.00 4.00 5.00 5.00 2.00 5.00 4.00 4.00 4.00 4.00 4.00 6.00 6.00 5.00 6.00 5.00 5.00 6.00 5.00 5.00 6.00 6.00 5.00

found / 5.80 / / 6.00 / / 4.70 / / 5.00 / / 5.70 / / 3.40 / / 5.90 / / 4.60 / / 4.30 / / 4.50 / / 4.70 / / 4.70 / / 6.00 / / 6.00 / / 5.50 / / 6.00 / / 5.60 / / 5.40 / / 6.00 / / 5.00 / / 5.90 / / 6.00 / / 6.00 / / 5.90 / 6.00 6.00 6.00 5.00 6.00 4.00 6.00 5.00 5.00 5.00 5.00 5.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 5.00 6.00 6.00 6.00 6.00

class. rate map. 0.79 / 0.79 / 0.81 0.83 / 0.84 / 0.84 0.76 / 0.79 / 0.82 0.79 / 0.80 / 0.81 0.83 / 0.84 / 0.85 0.79 / 0.84 / 0.90 0.79 / 0.79 / 0.80 0.85 / 0.87 / 0.89 0.77 / 0.79 / 0.81 0.74 / 0.77 / 0.80 0.73 / 0.76 / 0.80 0.73 / 0.76 / 0.79 0.81 / 0.81 / 0.82 0.81 / 0.81 / 0.81 0.73 / 0.77 / 0.81 0.81 / 0.81 / 0.82 0.73 / 0.78 / 0.81 0.72 / 0.75 / 0.77 0.81 / 0.81 / 0.81 0.67 / 0.67 / 0.67 0.75 / 0.81 / 0.82 0.81 / 0.81 / 0.82 0.81 / 0.81 / 0.81 0.72 / 0.80 / 0.81

class. rate all 0.71 / 0.78 / 0.79 0.83 / 0.84 / 0.84 0.51 / 0.67 / 0.79 0.71 / 0.71 / 0.72 0.74 / 0.81 / 0.85 0.51 / 0.66 / 0.74 0.72 / 0.78 / 0.79 0.63 / 0.72 / 0.80 0.53 / 0.59 / 0.67 0.49 / 0.61 / 0.71 0.54 / 0.64 / 0.71 0.54 / 0.63 / 0.71 0.81 / 0.81 / 0.82 0.81 / 0.81 / 0.81 0.67 / 0.74 / 0.81 0.81 / 0.81 / 0.82 0.69 / 0.77 / 0.81 0.65 / 0.69 / 0.77 0.81 / 0.81 / 0.81 0.59 / 0.61 / 0.62 0.67 / 0.80 / 0.82 0.81 / 0.81 / 0.82 0.81 / 0.81 / 0.81 0.72 / 0.80 / 0.81

(center-type)2 0.18 / 0.21 / 0.23 0.16 / 0.18 / 0.19 0.15 / 0.19 / 0.27 0.16 / 0.19 / 0.21 0.16 / 0.18 / 0.21 0.16 / 0.26 / 0.33 0.20 / 0.20 / 0.21 0.46 / 0.53 / 0.58 0.15 / 0.18 / 0.20 0.16 / 0.17 / 0.19 0.17 / 0.19 / 0.20 0.17 / 0.19 / 0.20 0.12 / 0.12 / 0.13 0.12 / 0.12 / 0.13 0.11 / 0.14 / 0.15 0.12 / 0.12 / 0.13 0.12 / 0.13 / 0.15 0.35 / 0.38 / 0.44 0.12 / 0.13 / 0.15 0.58 / 0.58 / 0.58 0.11 / 0.11 / 0.14 0.11 / 0.11 / 0.11 0.12 / 0.12 / 0.12 0.12 / 0.13 / 0.25

Table 5: Summary of the Results on the Independent Data Sets, Part 2

2

2 2 2 2 2

2-5

2

never

comp. range 0.03 / 0.11 / 0.21 0.05 / 0.11 / 0.22 0.02 / 0.17 / 0.36 0.04 / 0.12 / 0.27 0.04 / 0.12 / 0.29 0.05 / 0.19 / 0.39 0.02 / 0.10 / 0.27 0.00 / 0.00 / 0.01 0.04 / 0.13 / 0.26 0.03 / 0.11 / 0.30 0.03 / 0.11 / 0.20 0.02 / 0.09 / 0.22 0.02 / 0.05 / 0.10 0.03 / 0.05 / 0.09 0.00 / 0.10 / 0.37 0.02 / 0.05 / 0.09 0.02 / 0.09 / 0.25 0.00 / 0.12 / 0.34 0.04 / 0.07 / 0.12 0.00 / 0.00 / 0.00 0.00 / 0.05 / 0.25 0.00 / 0.02 / 0.06 0.00 / 0.02 / 0.05 0.00 / 0.06 / 0.52

36

3

2

Scenario 1


found / 6.00 / / 6.00 / / 4.70 / / 6.00 / / 6.00 / / 5.80 / / 6.00 / / 5.40 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 5.30 / / 5.60 / / 4.80 / / 6.00 / / 5.20 / / 5.60 / / 6.00 / / 4.60 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 1.10 / / 2.10 / / 1.50 / / 1.10 / / 2.10 / / 2.00 / / 1.10 / / 2.50 / / 0.90 / / 0.90 / / 0.90 / / 0.90 / 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 2.00 3.00 2.00 2.00 3.00 2.00 2.00 4.00 1.00 1.00 1.00 1.00

class. rate map. 0.69 / 0.70 / 0.70 0.69 / 0.70 / 0.70 0.62 / 0.68 / 0.71 0.69 / 0.70 / 0.70 0.69 / 0.70 / 0.70 0.67 / 0.67 / 0.69 0.69 / 0.70 / 0.70 0.67 / 0.69 / 0.70 0.70 / 0.70 / 0.70 0.70 / 0.70 / 0.70 0.70 / 0.70 / 0.70 0.70 / 0.70 / 0.70 0.61 / 0.68 / 0.69 0.61 / 0.67 / 0.68 0.51 / 0.64 / 0.70 0.68 / 0.68 / 0.68 0.60 / 0.67 / 0.68 0.58 / 0.63 / 0.66 0.68 / 0.68 / 0.68 0.66 / 0.68 / 0.69 0.68 / 0.68 / 0.68 0.68 / 0.68 / 0.68 0.68 / 0.68 / 0.68 0.68 / 0.68 / 0.68 0.39 / 0.44 / 0.46 0.40 / 0.45 / 0.47 0.45 / 0.47 / 0.48 0.38 / 0.44 / 0.46 0.39 / 0.44 / 0.47 0.36 / 0.40 / 0.46 0.39 / 0.44 / 0.47 0.47 / 0.50 / 0.52 0.43 / 0.44 / 0.46 0.43 / 0.44 / 0.46 0.43 / 0.44 / 0.46 0.43 / 0.44 / 0.46

class. rate all 0.69 / 0.70 / 0.70 0.69 / 0.70 / 0.70 0.45 / 0.54 / 0.70 0.69 / 0.70 / 0.70 0.69 / 0.70 / 0.70 0.57 / 0.65 / 0.69 0.69 / 0.70 / 0.70 0.46 / 0.63 / 0.70 0.70 / 0.70 / 0.70 0.70 / 0.70 / 0.70 0.70 / 0.70 / 0.70 0.70 / 0.70 / 0.70 0.44 / 0.60 / 0.68 0.44 / 0.64 / 0.68 0.41 / 0.54 / 0.68 0.68 / 0.68 / 0.68 0.25 / 0.60 / 0.68 0.56 / 0.62 / 0.66 0.68 / 0.68 / 0.68 0.44 / 0.52 / 0.69 0.68 / 0.68 / 0.68 0.68 / 0.68 / 0.68 0.68 / 0.68 / 0.68 0.68 / 0.68 / 0.68 0.04 / 0.05 / 0.10 0.08 / 0.09 / 0.15 0.04 / 0.07 / 0.09 0.04 / 0.05 / 0.11 0.08 / 0.09 / 0.15 0.08 / 0.10 / 0.12 0.04 / 0.05 / 0.11 0.04 / 0.11 / 0.18 0.04 / 0.04 / 0.04 0.04 / 0.04 / 0.04 0.04 / 0.04 / 0.04 0.04 / 0.04 / 0.04

(center-type)2 0.33 / 0.34 / 0.35 0.32 / 0.33 / 0.35 0.20 / 0.27 / 0.37 0.33 / 0.34 / 0.35 0.32 / 0.34 / 0.37 0.41 / 0.49 / 0.53 0.33 / 0.34 / 0.37 0.39 / 0.53 / 0.58 0.30 / 0.32 / 0.33 0.30 / 0.32 / 0.33 0.30 / 0.32 / 0.33 0.30 / 0.32 / 0.33 0.27 / 0.37 / 0.42 0.29 / 0.39 / 0.42 0.28 / 0.37 / 0.43 0.41 / 0.41 / 0.42 0.14 / 0.36 / 0.42 0.39 / 0.42 / 0.45 0.40 / 0.42 / 0.43 0.36 / 0.43 / 0.56 0.41 / 0.41 / 0.41 0.41 / 0.41 / 0.41 0.41 / 0.41 / 0.41 0.41 / 0.41 / 0.41 0.17 / 0.19 / 0.31 0.31 / 0.34 / 0.46 0.16 / 0.27 / 0.35 0.17 / 0.19 / 0.30 0.32 / 0.34 / 0.47 0.12 / 0.16 / 0.18 0.17 / 0.19 / 0.31 0.21 / 0.53 / 0.85 0.17 / 0.17 / 0.18 0.17 / 0.17 / 0.18 0.17 / 0.17 / 0.18 0.17 / 0.17 / 0.18

Table 6: Summary of the Results on the Dependent Data Sets, Part I

3-4 3-4 3-4 3-4

6

never

comp. range 0.02 / 0.15 / 0.27 0.02 / 0.17 / 0.27 0.03 / 0.22 / 0.46 0.02 / 0.15 / 0.28 0.03 / 0.19 / 0.31 0.21 / 0.31 / 0.44 0.02 / 0.18 / 0.26 0.00 / 0.00 / 0.00 0.02 / 0.18 / 0.25 0.02 / 0.18 / 0.25 0.02 / 0.18 / 0.25 0.02 / 0.18 / 0.25 0.01 / 0.06 / 0.44 0.01 / 0.08 / 0.44 0.00 / 0.10 / 0.50 0.00 / 0.03 / 0.08 0.00 / 0.03 / 0.08 0.00 / 0.13 / 0.27 0.00 / 0.04 / 0.09 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.02 0.00 / 0.00 / 0.02 0.00 / 0.00 / 0.02 0.00 / 0.00 / 0.02 0.00 / 0.01 / 0.07 0.00 / 0.12 / 0.61 0.00 / 0.06 / 0.21 0.00 / 0.04 / 0.56 0.00 / 0.14 / 0.58 0.00 / 0.10 / 0.42 0.00 / 0.05 / 0.58 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00

37

6

5

Scenario 4


found / 6.00 / / 6.00 / / 4.60 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 5.60 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / / 4.00 / / 4.70 / / 4.00 / / 4.40 / / 4.40 / / 3.60 / / 4.20 / / 3.10 / / 5.40 / / 5.40 / / 5.40 / / 5.40 / / 6.00 / / 5.00 / / 4.60 / / 5.40 / / 5.00 / / 4.60 / / 6.00 / / 5.00 / / 6.00 / / 6.00 / / 6.00 / / 6.00 / 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 5.00 5.00 6.00 5.00 5.00 4.00 5.00 4.00 6.00 6.00 6.00 6.00 6.00 6.00 5.00 6.00 6.00 5.00 6.00 5.00 6.00 6.00 6.00 6.00

class. rate map. 0.79 / 0.79 / 0.79 0.78 / 0.79 / 0.79 0.74 / 0.77 / 0.79 0.78 / 0.79 / 0.79 0.79 / 0.79 / 0.79 0.74 / 0.75 / 0.76 0.79 / 0.79 / 0.79 0.75 / 0.78 / 0.79 0.79 / 0.79 / 0.79 0.79 / 0.79 / 0.79 0.79 / 0.79 / 0.79 0.79 / 0.79 / 0.79 0.68 / 0.75 / 0.84 0.72 / 0.75 / 0.78 0.68 / 0.75 / 0.83 0.68 / 0.72 / 0.76 0.72 / 0.76 / 0.79 0.71 / 0.74 / 0.81 0.68 / 0.73 / 0.82 0.79 / 0.83 / 0.85 0.62 / 0.70 / 0.79 0.62 / 0.70 / 0.79 0.62 / 0.70 / 0.79 0.62 / 0.70 / 0.79 0.71 / 0.71 / 0.71 0.60 / 0.63 / 0.72 0.56 / 0.62 / 0.67 0.64 / 0.71 / 0.73 0.54 / 0.65 / 0.73 0.63 / 0.64 / 0.65 0.71 / 0.71 / 0.72 0.59 / 0.60 / 0.61 0.71 / 0.71 / 0.72 0.71 / 0.71 / 0.72 0.71 / 0.71 / 0.72 0.71 / 0.71 / 0.72

class. rate all 0.79 / 0.79 / 0.79 0.78 / 0.79 / 0.79 0.49 / 0.59 / 0.79 0.78 / 0.79 / 0.79 0.79 / 0.79 / 0.79 0.74 / 0.75 / 0.76 0.79 / 0.79 / 0.79 0.52 / 0.73 / 0.79 0.79 / 0.79 / 0.79 0.79 / 0.79 / 0.79 0.79 / 0.79 / 0.79 0.79 / 0.79 / 0.79 0.49 / 0.54 / 0.61 0.58 / 0.62 / 0.67 0.48 / 0.56 / 0.69 0.52 / 0.55 / 0.58 0.59 / 0.62 / 0.67 0.49 / 0.58 / 0.64 0.48 / 0.55 / 0.58 0.49 / 0.52 / 0.57 0.57 / 0.64 / 0.69 0.57 / 0.64 / 0.69 0.57 / 0.64 / 0.69 0.57 / 0.64 / 0.69 0.71 / 0.71 / 0.71 0.47 / 0.57 / 0.70 0.34 / 0.54 / 0.61 0.57 / 0.64 / 0.72 0.51 / 0.60 / 0.72 0.45 / 0.53 / 0.59 0.71 / 0.71 / 0.72 0.54 / 0.57 / 0.59 0.71 / 0.71 / 0.72 0.71 / 0.71 / 0.72 0.71 / 0.71 / 0.72 0.71 / 0.71 / 0.72

(center-type)2 0.19 / 0.20 / 0.21 0.19 / 0.20 / 0.21 0.15 / 0.17 / 0.19 0.20 / 0.20 / 0.21 0.19 / 0.20 / 0.20 0.49 / 0.50 / 0.51 0.19 / 0.20 / 0.21 0.36 / 0.51 / 0.55 0.18 / 0.19 / 0.20 0.18 / 0.19 / 0.20 0.18 / 0.19 / 0.20 0.18 / 0.19 / 0.20 0.16 / 0.28 / 0.42 0.24 / 0.32 / 0.37 0.17 / 0.28 / 0.43 0.26 / 0.33 / 0.42 0.22 / 0.28 / 0.37 0.29 / 0.37 / 0.43 0.17 / 0.32 / 0.42 0.28 / 0.29 / 0.38 0.23 / 0.41 / 0.53 0.23 / 0.41 / 0.53 0.23 / 0.41 / 0.53 0.23 / 0.41 / 0.53 0.28 / 0.28 / 0.30 0.21 / 0.34 / 0.39 0.22 / 0.29 / 0.35 0.20 / 0.26 / 0.31 0.21 / 0.29 / 0.35 0.27 / 0.32 / 0.35 0.28 / 0.29 / 0.29 0.52 / 0.52 / 0.53 0.27 / 0.27 / 0.28 0.27 / 0.27 / 0.28 0.27 / 0.27 / 0.28 0.27 / 0.27 / 0.28

Table 7: Summary of the Results on the Dependent Data Sets, Part II

2

2-5

5

5

never

comp. range 0.02 / 0.08 / 0.18 0.02 / 0.08 / 0.17 0.08 / 0.14 / 0.27 0.03 / 0.11 / 0.19 0.02 / 0.10 / 0.18 0.08 / 0.28 / 0.41 0.02 / 0.10 / 0.20 0.00 / 0.00 / 0.00 0.03 / 0.11 / 0.16 0.03 / 0.11 / 0.16 0.03 / 0.11 / 0.16 0.03 / 0.11 / 0.16 0.00 / 0.12 / 0.48 0.01 / 0.22 / 0.48 0.01 / 0.23 / 0.46 0.01 / 0.05 / 0.22 0.00 / 0.14 / 0.43 0.00 / 0.14 / 0.36 0.00 / 0.07 / 0.40 0.00 / 0.00 / 0.00 0.02 / 0.20 / 0.59 0.02 / 0.20 / 0.59 0.02 / 0.20 / 0.59 0.02 / 0.20 / 0.59 0.01 / 0.09 / 0.23 0.00 / 0.20 / 0.57 0.01 / 0.21 / 0.50 0.03 / 0.18 / 0.35 0.08 / 0.21 / 0.37 0.00 / 0.15 / 0.33 0.01 / 0.10 / 0.23 0.00 / 0.00 / 0.00 0.00 / 0.08 / 0.19 0.00 / 0.08 / 0.19 0.00 / 0.08 / 0.19 0.00 / 0.08 / 0.19

Artificial binary data scenarios - ePub WU Institutional Repository

Artificial binary data scenarios - ePub WU Institutional Repository

Suggest Documents

ePub Institutional Repository - ePub WU Institutional Repository

ePub Institutional Repository - ePub WU Institutional Repository

ePub Institutional Repository - ePub WU Institutional Repository

ePub Institutional Repository - ePub WU Institutional Repository

ePub Institutional Repository - ePub WU Institutional Repository

ePub Institutional Repository - ePub WU Institutional Repository

ePub Institutional Repository - ePub WU Institutional Repository

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU

ePub Institutional Repository - ePub WU