Collective Data Mining From Distributed, Vertically ... - Semantic Scholar

2 downloads 240 Views 364KB Size Report
This paper develops collective data mining, a ... questions the scalability of a popular data mining tech- ..... We therefore choose to call them software agents.
Collective Data Mining From Distributed, Vertically Partitioned Feature Space Hillol Kargupta and Erik Johnson and Eleonora Riva Sanseverino Byung-Hoon Park and Luisa Di Silvestre and Daryl Hershberger School of Electrical Engineering and Computer Science Washington State University, Pullman, WA 99164-2752, USA fhillol,bhpark,erivasan,ejohnso1,[email protected] [email protected]

Abstract

Mining (DDM) e orts have been directed toward distributed databases de ned over the same feature space, little work has been done for databases with distinctly di erent feature spaces. This paper considers this little explored problem. This paper presents collective data mining, a new approach for mining data from a network of databases with distinct feature spaces. It also describes BODHI (Beseizing knOwledge through Distributed Heterogeneous Induction), a distributed data mining (DDM) system, that implements the collective data mining approach. Although the architecture is designed for accommodating di erent inductive learning algorithms for data analysis at di erent sites, this paper makes use of a scalable approach using a gene expression based evolutionary algorithm and reports the result of applying the overall collective data mining approach to distributed fault detection in large electrical power distribution network. The paper is organized as follows. First it describes the general DDM problem, and provides related background information. The theoretical foundation and the collective data mining technique for mining data from multiple, di erent databases with di erent but possibly related features, are presented next. This is followed by an overview of the BODHI, a distributed data mining system being developed at the DIADIC group of the Washington State University. Although the proposed collective data mining architecture works independent of the local learning algorithms at individual data sites, the overall performance of the system depends on the scalability of the chosen algorithm at the sites. Next it concentrates on this issue. It questions the scalability of a popular data mining technique, called genetic algorithms (GAs) (De Jong 1975; Goldberg 1989b; Holland 1975) and identi es a major problem of simple GAs that restricts its scalability for problems with large number of features. It also presents the so called gene expression messy genetic algorithm (GEMGA) (Kargupta 1996a; 1996b; Bandyopadhyay, Kargupta, & Wang 1998; Kargupta & Bandyopadhyay 1998) that has been reported to o er scalable performance on several occasions. Next it develops the overall GEMGA based learning algorithm.

This paper develops collective data mining, a unique approach for nding patterns from a network of databases, each with a distinct feature space. This paper addresses both distributed cooperative learning at the global level and also learning at the local data sites. In addition to developing the foundation of the collective data mining, it also presents BODHI, a distributed data mining (DDM) system that implements the collective data mining approach. Although the architecture is ideal for accommodating di erent inductive learning algorithms for data analysis at di erent sites, this paper suggests one scalable approach using a gene expression based evolutionary algorithm. This approach is used for distributed fault detection in an electrical power distribution network. Experimental results demonstrating the success of the developed system are also presented.

Introduction

Knowledge discovery and data mining (KDD) (Fayyad, Piatetsky-Shapiro, Smyth, 1996) deals with the dicult problem of extracting \interesting" patterns from large amount of data. This problem is further complicated by the fact that in many cases, the data is not typically stored in a monolithic manner; rather, it is distributed in a heterogeneous environment. Moreover, the datasets collected in di erent sites may not be de ned over the same set of features. Consider the problem of detecting unknown patterns that associate disease outbreak with weather and other industrial emission related features. These datasets are very unlikely to be stored in a single database. For example in United States, disease databases are likely to be found in organizations like CDC (Center for Disease Control) or FDA (Food and Drug Administration). On the other hand environmental data sets are likely to be maintained by organizations like EPA (Environmental Protection Agency). It is very unlikely that these databases will ever be put together in a single site. Similar situations can be easily found in di erent business, engineering, and scienti c practices such as drug development, multi-media disease marker detection, military battle eld and installation management. Although several Distributed Data 70

1

X1 X 2 X 3 X 4 1 1 0 0

1

2

1

0

0

1

2 1

1

1

0

3

0

1

1

1

3

1

1

0

1

4

0

1

1

0

4

0

0

0

1

1

X1 X 2 X 3 X 4 1 1 0 0

1

2

1

0

0

1

2 1

1

1

0

3

0

1

1

1

3

0

1

1

1

4

0

1

1

0

4

0

0

0

1

problem for vertically partitioned feature space. Supervised learning requires each data point to be labelled, providing the corresponding class membership information. In this paper we shall assume that the class membership labels are available at all the data sites. The following section reviews related work in DDM.

X1 X 2 X 3 X 4 1 1 1 0

Related work

Although distributed data mining is a fairly new eld, it has been enjoying growing amount of attention since inception. As noted earlier most of the work in this area deals with horizontally partitioned feature spaces. This section brie y reviews some of these e orts and other related works. The JAM system (Stolfo et al. 1997) is a java based multi-agent system in which different data mining agents are allowed to have di erent machine learning algorithms for learning classi ers. Classi ers generated on di erent data sets with potentially di erent algorithms are collected and inductive learning algorithms are run on this collection to generate new classi ers called meta-classi ers. This process may be continued iteratively resulting in a hierarchy of meta-classi ers. Further research on metalearning can be found elsewhere (Chan & Stolfo 1993c; 1993a; 1993b; 1995; 1996b; 1996a; 1998). The PADMA system (Kargupta et al. ; Kargupta, Hamzaoglu, & Sta ord 1997) achieves scalability by locating agents with the distributed data sources. An agent coordinating facilitator gives user requests to local agents which then access and analyze local data, returning analysis results to the facilitator, which merges the results. The high level results returned by the local agents are much smaller than the original data, thus allowing economical communication and enhancing scalability. The authors report on a PADMA implementation for unstructured text mining but note that the architecture is not domain speci c. There are several examples of agent based systems for information discovery on the World Wide Web (Lesser & others 1998)(Menczer & Belew 1998)(Moukas 1996). In (Yamanishi 1997) the author presents two models of distributed Bayesian learning. Both models employ distributed agent learners, each of which observes a sequence of examples and produces an estimate of the parameter specifying the target distribution, and a population learner, which combines the output of the agent learners in order to produce a signi cantly better estimate of the parameter of the target distribution. One model applies to situation in which the agent learners observe data sequences generated according to the identical target distribution, while the second model applies when the data sequences may not have the identical target distribution over all agent learners. An automated distributed meeting scheduler is described in (Sen 1997). This system employs distributed intelligent agents which negotiate among themselves to schedule meetings satisfying user constraints. In a more general treatment of the same basic goal, the Challenger system (Chavez, Alexabdros, & Maes 1997) employs

X5 X 6 X 7 X 8 1 1 0 0

Figure 1: Distributed data sites with (top) horizontally and (bottom)vertically partitioned feature space. The application domain and the experimental results are presented next. This is followed by the conclusion of this paper.

Background

This section presents the related background material. It rst introduces di erent possible data distribution scenarios in a DDM environment. Next, related existing literature is reviewed.

Distributed feature space

Distributed Data Mining (DDM) is the process of extracting knowledge from distributed data using distributed computation. Although by de nition all DDM algorithms share the common problem of mining from distributed data, the very nature of the data may demand di erent learning architectures and di erent techniques. For example, the data can be collected and stored in di erent databases representing either horizontally or vertically partitioned feature space. Figure 1 illustrates these two cases. In case of horizontally partitioning, the feature space is common across the different databases. On the other hand, in case of vertical partitioning, di erent databases de ne di erent feature spaces. In this paper we focus on vertically partitioned feature space. In this case, we assume that the di erent features of an event are observed and stored in a distributed fashion at di erent data sites. Each such event can be uniquely labeled so that the corresponding feature values across the di erent databases can be uniquely identi ed just by matching the corresponding indices. In this paper we consider supervised learning 71

ple databases, a set of theories results. He then casts this idea in terms of partial reasoning which he then relates to knowledge discovery. A basic requirement of algorithms employed in DDM is that they have the ability to scale up. A survey of methods of scaling up inductive learning algorithms is presented in (Provost & Venkateswarlu 1998).

multiple agent to manage distributed resources. Results of an example application of CPU load balancing in a network of computers is presented. E ective coordination and communication among groups of distributed agents is important to the group performance on the task at hand. In (Mammen & Lesser 1998) the authors investigate the inter-related issues of the timing of agent communication and the amount of information which should be communicated. A more robust method of communication of inductive inferences is presented in (Davies & Edwards 1996). The authors suggest including in the communication the context from which the inference was formed, in the form of the version space boundary set. This allows the receiving agents to better integrate the information with the inference it has induced from local information. The methods of combining local knowledge to optimize some global objective is another aspect of general distributed machine learning research which applies to DDM. A method of forming committees of decision trees in a multi-agent environment is presented in (Heath, Kasif, & Salzberg 1996). In (Davies & Edwards ) the authors compare the relative performance of incremental theory revision and knowledge integration. They conclude that there is no di erence in the accuracy of result and con rm the superior speed of knowledge integration for the data set evaluated. Agent learning research concerns the ability of agents to independently improve their ability to perform their assigned tasks. A majority of agent learning approaches are based on reinforcement learning methods. A general survey of reinforcement learning methods is provided in (Kaelbling, Littman, & Moore 1996). Another example of a DDM system (Aronis et al. 1996) is the WoRLD system for inductive rule-learning from multiple distributed databases. WoRLD uses spreading activation, instead of item-by-item matching, as the basic operation of the inductive engine. Database items are labeled with markers ( indicating in or out of concept), which are then propagated through databases looking for values where in or out of concept markers accumulate. The authors note that WoRLD currently relies on manually maintained links between distributed databases and the assumption of a standardized vocabulary across heterogeneous databases. The fragmented approach to mining classi ers from distributed data sources is suggested by (Cho & W'uthrich 1998). In this method, a single, best, rule is generated in each distributed data source. These rules are then ranked using some criterion and some number of the top ranked rules are selected to form the rule set. In (Lam & Segre 1997) the authors extend e orts to automatically produce a Bayesian belief network from discovered knowledge by developing a distributed approach to this exponential time problem. A formal treatment of distributed databases is presented in (Nowak 1998). The author assets that the information contained in an individual data base gives rise to a theory, and that given multi-

Since this work makes use of a class of genetic algorithms (GAs) it will be interesting to review DDM related GA literature. As noted elsewhere (Booker, Goldberg, & Holland 1989) the population based approach makes the GAs quite attractive for parallel/distributed implementations. Although most of the available parallel/distributed GAs deal with optimization problems, co-evolutionary problem solving has drawn considerable interest since the inception of genetic algorithm and classi er systems (Holland 1975). The Cooperative Co-evolutionary Genetic Algorithm (Potter & De Jong 1994) extends the traditional GA model by explicitly modeling the co-evolution of cooperative species, each of which represents a sub- component of a potential solution. Each species evolves separately with credit assignment at the species level de ned in terms of the tness of the complete solutions in which the species members participate. In (Seredynski 1994), the author presents a model of Loosely Coupled Genetic Algorithms which GAs evolve local populations attempting to maximize some local tness function and at the same time model global behavior in the sense of searching for a global optimum. The author presents the model in terms of a non-cooperative N-player game with limited interaction but notes its' general applicability. Another approach to using distributed GA is presented in (Venkateswaran, Obradovic, & Raghavendra 1996). Here the distributed problems are homogeneous while the GAs are heterogeneous in the sense that they employ di ering crossover and mutation probabilities. Each of the distributed GAs works on a complete version of the same problem with occasional exchanges of information among tasks. In (Neri & Giordana 1995), (Anglano et al. 1997) and (Anglano et al. 1998) the authors present a two level genetic algorithm designed for concept induction in propositional and rst order logics. The system exploits niche and species for learning multi-modal concepts, in part, through a distributed lower level GA architecture. The lower level GAs promote the growth of species in the distributed population. The higher level GA updates and evaluates a global description based on individuals chosen from the di erent species evolved at the rst level and provides a mechanism for favoring the evolution of those species which best t with the current global description. The following section formulates the underlying distributed inductive learning problem for the vertically partitioned feature space. 72

Distributed Inductive Learning For Vertically Partitioned Feature Space

at site A and estimate their corresponding coecients. Similarly site B can accomplish the same; however, basis functions of feature variables taken from both site A and site B cannot be evaluated without some kind of exchange of information between A and B. Let us illustrate the issue using monomial basis functions. We can write, X X X f^(x) = w x + w x x + w x x x +   

Detecting patterns using data mining techniques rst requires de ning the notion patterns. Patterns can be viewed in terms of relations. In case of un-supervised learning patterns are interpreted as relations among the di erent members of the domain. In case of supervised learning, relations among the di erent members of the domain and the corresponding members in the range (class labels or the output value to be predicted) are desired. Detecting such relations in a distributed fashion requires rst taking a look at the basic steps of inductive learning. Let us consider supervised learning. In supervised inductive learning, the goal is to learn a function f^ : X ! Y from the data set = f(x(1) ; y(1) ); (x(2) ; y(2));    (x( ) ; y( ) ) generated by underlying function f : X ! Y , such that the f^ approximates f . Any point from the domain x = x1 ; x2 ;    x is an n-tuple and x -s correspond to individual dimensions of the domain. In supervised learning the range Y denotes the space of class labels. Typically inductive learning algorithm come with a set of basis functions that are used for generating a representation of the underly function, X f (x) = w (x) (1)

i

k

n

n

j

k

k

k

a

k

k

j

k

k

? w^ )(w ? w^ ) (x) (x) j

k

k

j

k

j

k

k

2

j

k

x

j;k

Where, is the abbreviated representation of (x ). Note that the basis functions to be orP (xare) assumed (x) = 0, when the thonormal. Therefore, sum is over all x-s in the space Punder consideration and j 6= k. On the other hand (x) (x) = 1. Let us de neP a random variable Z = (x ) (x ). Now E [Z ] = i (xP) (x ) = 0, when j 6= k. By law i of large numbers xi 2S approaches E [Z ] = 0 as n increases. Therefore for large n, we can write, X (f ? f^)2 = X(w ? w^ )2 (3) j

k

b

j

? w^ ) (x)

j

2

x

k

a

i

Now summing it over all the data points in the training set , X (f ? f^)2 = X(w ? w^ )(w ? w^ ) X

Where w^ denotes the approximate estimation of the coecients w . Di erent learning algorithms use different basis functions de ned through the chosen representation. For example, linear regression makes use of linear basis functions; similarity based evolutionary rule learning systems and decision trees can be viewed as a learning process using Walsh (Bethke 1976; Goldberg 1989a) and discrete Fourier transformations (Kushilevitz & Mansour 1991). Polynomial basis functions are also frequently used in modeling and system identi cation. Once the learning algorithm chooses its basis functions the learning task reduces to estimating the coecients of the signi cant basis functions. In a distributed data mining problem the task is to do that using distributed observation of the learning dataset S . When the feature space is vertically partitioned, in say two data sites A and B , S = fS ; S g; where S = f(x( 1) ; y(1) ); (x( 2); y(2) );    (x( ) ; y( ) ). Data x( ) denote the i-th dataset, comprised of only x features out of the all x features. We assume that the class label y of the i-th dataset is available to all the data sites. Since the dataset S is non-local to the site A, it can only detect signi cant basis functions by the features observed a;k

i;j;k

j;k

k

a;

k

k

(f ? f^)2 =

k

Where (x) denotes the k-th basis function and w denotes the corresponding coecient. The objective of the learning algorithm is to generate the approximate function, X f^(x) = w^ (x) (2) k

j

i;j;k

X(w X(w

f ? f^ =

k

k

i

i;j

If x1 and x2 belongs to site A and site B respectively, evaluation of x1 x2 will not be possible without sharing some information among the data sites. However, evaluation of basis function x2 x3 is possible if x2 and x3 belong to the same data site. Clearly, the task of learning the signi cant basis functions and the corresponding coecients is restricted by the partitioning of the overall feature space. Our goal is therefore to develop a learning algorithm that accurately detects the signi cant basis functions and their corresponding coecients accurately from the data in a vertically partitioned feature space. From Equations 1 and 2 we can write that,

n

k

i;j

i

i

j

x

j

k

x

j

j

i

i

x

j

i

k

i

j

i

k

i

i

Z

i

n

j

2

x

j

j

Clearly, the overall sum square error is minimized when

w^ = w for all j . This derivation assumes that all the j

j

feature variables are observed and available for model building at the same time. Let us now investigate if the situation changes when feature space is vertically partitioned. Let us assume that the feature space is divided into two sets A and B with feature spaces S and S respectively. Let B be the set of all basis functions under consideration. B and B be the set of all basis functions de ned by feature variables in S and

a;

a;i

a

i

a

b

b

a

b

a

73

S respectively; B be the set of those basis functions in B that use feature variables from both S and S . Therefore B = B [ B [ B . We write j 2 B to denote a basis function (x) 2 B ; we also write j 2= B to denote a basis function (x) 2 B [ B . Now let b

ab

a

a

b

ab

b

j

a

a

j

b

ab

us explore what happens when one of these sites try to learn f (x) using only its local features. Let us de ne, X w^ (x) f^ (x) = (4) a

j

j

2

B

j

2

B

j

j

j

2

j =B

i

j

a

B

i

j

j

i

j

i

j

j

i

j

j

i

i

i

j

B ;j = B

2

a

i

j

i

j

a

Now again using law of large number we can write, X (f ? f^ )2 = X (w ? w^ )2 + X w2 (5) a

2

i

P i

B

i

j

2

a

j =B

a

i

j

j

j

a

j =B

a

j =B

The collective data mining is a new approach for inductive learning from distributed, vertically partitioned feature space. The collective mining architecture exploits the fact that in general the function to be learned can be inherently represented in a distributed fashion using a set of appropriate basis functions. The overall architecture of the algorithm is independent of the speci c inductive learning algorithm used at di erent sites for detecting patterns. Given a choice of di erent learning algorithms for di erent data sites, this approach offers a way to generate a global but distributed model of the overall dataset without necessarily assuming that overall problem is decomposable according to the site speci c partitioning of the feature space. The overall architecture of collective mining is based on the decomposition of inductive learning, into local and non-local basis function evaluation, discussed in the previous section. Each data site is provided with a program that analyzes the local data and evaluates the basis functions, de ned by the locally observed feature variables. Each of these programs is given a certain degree of autonomy in terms of choice of learning algorithms, communication, and data manipulation. We therefore choose to call them software agents. The agents are independent; however, they cooperate, when needed. Agents cooperate to each other through the

a

x

Facilitator

Collective Data Mining

Equation 5 tellsPus that 2 (f ? f^ )2 takes the minimum value of 2 a w2 when w^ = w . Although the minimum value of the error is non-zero, this optimal solution value of w , 8i 2 B remains correct in the global context, even when all the features are considered together. The only di erence between the globalPlearning and local learning process is the error term 2 a w2 introduced by the basis functions de ned by the feature variables not observed at site A. Therefore, in a vertically partitioned feature space the task of inductive learning can be divided into the following stages: 1. evaluate the local basis functions (B ; B )and estimate their corresponding coecients using the local data. 2. identify the portion of data incorrectly modeled by each of the locally generated distributed data model at sites. This contributes to the error term P di erent 2 in the mean square error. w 2 a 3. identify the set of data (I ) that is incorrectly modeled even after considering all the basis functions generated by the local feature spaces of all the data sites. This error is due to the lack of consideration of the basis functions (B ) de ned by the feature variables from multiple data sites. j =B

Local components of the model

ab

B

i = B ;j = B

2

Local components of the model

4. collect all the feature values corresponding to I from all data-sites to a single data-site and generate the basis functions (B ) de ned by the cross-terms from di erent feature spaces across the di erent sites. This process generates a distributed model of the data and requires minimal data trac without making the assumption of site speci c problem decomposability. This is essentially the main concept behind the proposed collective data mining technique proposed in this paper. The following section further discusses the implementation speci c issues of the collective data mining architecture.

a

i = B ;j

x

DDM agent

Figure 2: The collective data mining architecture.

X w (w ? w^ ) + 2 a 2 a X w (w ? w^ ) + 2 a 2 a X ww 2

2

DDM agent

Globally non-linear terms of the model

Using the above equation we can write, X (w ? w^ )(w ? w^ ) + (f (x) ? f^ (x))2 =

i

DDM agent

j

j

i;j

Data Site C

a

a

a

Data Site B

Local components of the model

From Equations 1 and 4 we can write, X (w ? w^ ) (x) + X w (x) f (x) ? f^ (x) = a

Data Site A

a

j

b

j

ab

74

facilitator. Figure 2 shows the overall learning architecture. Although, the gure shows only two levels of hierarchy, the proposed collective data mining architecture can generalized for a multi-level implementation with more levels. The facilitator agent at a higher hierarchical level takes the overall decisions. The facilitator is also equipped with learning algorithms. Each of the agents can be facilitator for lower level agents. The interaction among the agents, during the learning task, is brie y described in the following. Like most machine learning algorithms, the collective mining architecture works under two modes, learning and testing phase. During the learning phase, rst all the agents learn from their own local data. Once the local basis functions and their corresponding coecients are identi ed by the individual agents, the indices corresponding to the data subsets, incorrectly predicted by each of the agents are sent to the facilitator. In addition the agents also send the strength or con dence level of a class prediction, to the facilitator. The strength or con dence level of a prediction for a speci c class may be evaluated by computing the percentage of correctly predicted outputs against the total number of outputs for the same class label in the learning databases. The facilitator in turn identi es the common data subset that is incorrectly predicted by all the agents and requests this data subset from all the agents. Once this data subset arrives, the facilitator runs an inductive learning algorithm of its choice to learn the basis functions de ned using feature variables from di erent data sites. During the testing phase, each of the agents make their own local observation and their individual predictions; they send the predictions to the facilitator together with the associated con dence. The facilitator ranks results according to the con dence level of each agent in performing that particular task. On the basis of a user-de ned threshold measure in terms of con dence level, the agent predictions are identi ed to be reliable or not reliable. If all the agents are considered to be unreliable by the facilitator on that particular task, the facilitator decides to take in charge for that event and it requests the corresponding observed feature values from the individual agents. Once the feature values arrive, the facilitator applies its own model/rules to make the nal predictions. On the other hand if the facilitator receives predictions from an agent with high con dence that becomes the predicted output of the overall system. The following section describes an experimental system BODHI, under development, that implements the collective data mining architecture for scalable DDM from vertically partitioned feature space.

Figure 3: Overall architecture of the BODHI system. large, distributed databases with vertically partitioned feature space. The BODHI system works independent of the speci c learning algorithms used by the data site agents or the facilitators for mining the data. Furthermore, this communication system is intended to be platform independent. To this end, the system is being implemented in Java (rev 1.1). The system consists of four primary types of components: the Communication Control Process (CCP), the Agent Communication Processes (ACPs), the Machine Learning Agents (MLAs), and the User Interface (UI). In general, the CCP is responsible for coordinating communication and control ow between the various ACPs, which in turn act as the interface between the communication system and the MLAs. The UI is the interface between the user and the CCP. All control and communication commands issued by the user are through the UI, which communicates directly with the CCP only. Figure 3 presents the level one dfd/cfd for this system:

Communication Control Process (CCP)

The Communication Control Process (CCP) is the central component of the communication system. It is responsible for passing control directives along from the User Interface (UI) to the Agent Control Processes (ACPs), coordinating communication between the ACPs, and returning the status and results of the various Machine Learning Agents (MLAs), via the ACPs, to the UI. The primary component of the CCP is the Communication Manager. This component is responsible all communication between the CCP and the ACPs. Each

The BODHI: An Architecture for Heterogeneous DDM

The BODHI (Beseizing knOwledge through Distributed Heterogeneous Induction) is an agent based system in development for performing knowledge discovery from 75

InitModule maintains a le of secure hosts that contain installed ACPs, and the location of the local MLAs accessible to the ACPs at the various sites. When the user starts the process, the user is presented with a list of ACPs available to the system, and will be allowed to choose which ACPs to use, and what network topology is to be used. Furthermore, the associated MLAs are displayed, so as to allow the user to chose which MLA to use for a given ACP. Note that the MLAs must be installed on the client machine, the interface module for the given MLA must be present in the UI (see the section on the UI for details), and nally, ssh and scp must be installed on the client machine. Once the user has indicated the con guration for the ACPs their associated MLAs, the InitModule attempts to initialize the ACPs, and thereby, the MLAs. If there is a failure to do so, a message is sent to the UI, in order to allow the user to either recon gure the system or remove the problem node from the system.

Agent Communication Processes (ACPs)

The Agent Communication Processes (ACPs) act as the interface between the Communication Control Process (CCP) and the Machine Learning Algorithms (MLAs). The ACPs serve two primary functions: to communicate shared data to other MLAs via their associated ACPs, and to pass control information between the MLAs and the CCP. When a given MLA is directed by the CCP to share data with another MLA, this task is accomplished via direct communication of the ACPs. This communication is to be accomplished via scp (secure copy). When so directed, the ACP receives a stream from the MLA, and then open an scp session with the second ACP involved in the process. The data is then passed, and the session closed. The ACP is also responsible for passing control directives received via the CCP to the MLA, and returning status and results to the CCP for routing to the User Interface (UI). When the ACP receives a control directive, it passes it along to the MLA, and then wait for some response from the MLA. When it receives a response, it passes the response back to the CCP. While in general, it is of no importance to the ACP what the content of the stream to be passed represent, it does matter in two cases: start-up and shutdown of the MLA. The former of these is taken care of when the ACP is initialized; one of the arguments passed to the ACP for initialization is the name of the associated MLA and any command line arguments required for the MLA. When an ACP is directed to shut down, it rst stops the MLA process, and then terminates itself, closing all streams it has opened when it does so. It should be noted that when an ACP is associated with an MLA that is acting as a facilitator, it is still required to receive its instructions from the CCP. Therefore, if a facilitator MLA needs to communicate with those agents for which it is responsible, the control sequences must still originate with the CCP. This keeps

Figure 4: Architecture of the Communication Control Process (CCP). ACP has a child process (a Java thread) associated with it, referred to as a Communication Manager Child (CMChild) that is spawned o at runtime from the Communication Manager. When a control statement is issued via the UI, the control stream is passed through the CCP's Communication Manager to the appropriate child process, and is then passed along to the associated ACP. At this point in time, the thread blocks, and waits for a signal from the associated ACP or a control signal passed from the UI indicating that the action of the MLA is to be interrupted. When a given MLA has completed the tasks assigned in the control directives passed to it via the ACP, it sends a response to the child process of the Communication Manager. This response is some combination of a status indication and results presentation. The Communication manager then sends the response to the UI, which in turn either sends a response directing the next action for the given MLA, or prompts the user for further input. Figure 4 presents the details of the Communication Manager and its child processes. All communication between the CCP and the UI and various ACPs is accomplished using streams routed via ssh (secure shell), speci cally, scp. scp is a le transfer protocol built on top of ssh, and should provide a level of security not available via direct stream transfers. The secondary task of the CCP is initialization of the ACPs and their associated MLAs. To this end, the CCP contains another initialization module that is responsible for this task, referred to as the InitModule. The 76

in the communication tree is not a leaf node, it is assumed by default to be a facilitator node. All parameters speci c to a given MLA are entered to the MLAUI associated with the MLA, and results speci c to the MLA are displayed in the MLAUI. The MLAUIs are accessible through the PUI, and appear as a separate window on the screen. Furthermore, the MLAUI are responsible for decoding the results and status streams received from the MLAs. For the purpose of expendability, a base MLAUI Java class is being developed that has the skeleton of the control structure, and a set of abstract functions for the author of the MLA to overload to accomplish the control of the MLA. The Sequence Control Module (SCM) is responsible for determining what each module should do at each step of the process. The SCM receives status reports from the CCP, and then determines, given the parameters entered into the PUI and MLAUIs, what should be done given the current state of the world. While results and status are still be passed back to the appropriate module (PUI and/or MLAUIs), unless otherwise directed, the SCM is responsible for coordinating the run until completion. This includes the actions taken by a facilitator MLA when it is required to communicate with those agents that are under the facilitator MLA in the network structure. Once the parameters for the run are entered, the SCM is responsible for passing along the appropriate command strings and arranging data sharing between the MLAs. The SCM routes incoming messages through the PUI module, which routes them to the appropriate MLAUI module for decoding. The MLAUI module then responds with two messages: one for the SCM, and one for the given MLA. The SCM schedules data transfers as designated by the former message, and also routes the latter to the appropriate MLA. The CCPInterface is responsible for all direct communication with the CCP. As with all of the other distributed processes, communication between the UI module and the CCP is be accomplished via ssh and scp. For the purpose of routing of information, a system of tagging the messages is used. The tags for a speci c process are comprised of the site name of the MLA concatenated with the process name. Figure 5 presents the overall structure of the UI module: So far we have primarily addressed the issue of scalable learning from a vertically partitioned feature space. The proposed collective data mining architecture has been presented in a way independent of the speci c choice of inductive learning algorithm used by the data site agents and the facilitators. However, scalability of the overall DDM system also depends on the scalability of these learning algorithms. The following sections address this issue and develops a scalable gene expression based genetic algorithm.

the control of the system centralized in the CCP, and also allows for real time updates on the UI.

Machine Learning Algorithms (MLAs)

This system overview is primarily concerned with the communication system, and not the speci cs of the Machine Learning Algorithms (MLAs). However, there are several necessary criteria for all MLAs used in this system: the ability to send and receive data to be shared with other MLAs as a stream, the ability to halt and wait for a control stream, and the ability to send a status as a stream. As this is a distributed system, it is important that the MLAs be able to communicate. This is accomplished via streams being transferred via this system. Therefore, each MLA must be able to send and receive pattern data in the form of a stream. Furthermore, each MLA must be able to halt and wait for a control stream once it has completed its current set of instructions. The format of this control stream is determined by the UI module associated with the MLA (see the UI section for details). Note that this format is not important to the ACP associated with the MLA. Finally, the MLAs must be able to send two types of data in the form of a stream: status data and results data. The status data must be in a format recognizable by the UI module associated with the MLA (see the UI section for details).

The User Interface (UI)

The User Interface (UI) module of this system is both the high level control module and the user interface module for this system. It is responsible for all user input, reports to the user, and coordinating the tasks performed by the Communication Control Process (CCP). The UI is intended to provide a basis for communication with the end user, and be open to expansion as more types of Machine Learning Algorithms (MLAs) are added to the system. Furthermore, the UI is responsible for coordinating what control sequences to send the CCP. Therefore, the UI is comprised of four types of modules: the Primary User Interface (PUI), the Sequence Control Module (SCM), the Machine Learning Algorithm User Interfaces (MLAUIs), and the CCP Interface (CCPInterface). The PUI is the main interface with the end user. It has the ability to display the network topology and overall results obtained by the system as a whole. Global start up and shut down are accomplished directly through the PUI module, as is network topology and communication sequencing between the MLAs. The MLAUIs are the control and status interfaces for the various MLAs in the system. Each MLA in the system must have an associated MLAUI. The MLAUI is responsible for con guration of each of the individual MLAs in the system, and is the interface which is used to specify all necessary parameters for a given MLA, including the roll of a given MLA (e.g., single agent, facilitator). Note that if a node representing an MLA 77

sults for genetic algorithms still remained valid. In GABIL (DeJong, Spears, & Gordon 1993), Disjunctive Normal Form (DNF) concept descriptions are evolved using an LS-1 style approach. This work is aimed at a single class learning application. The goodness of a concept description is measured as the square of examples correctly classi ed. The COGIN approach developed elsewhere (Greene & Smith 1993; 1994) addresses multi-class problem domains introducing competition for coverage of training examples, encouraging the population to co-operatively solve the concept learning task. Each rule is a conjunction of attribute/value sets in binary coding. In this approach, the newly created rules using GA operators, together with the existing population of rules, are ranked in order of tness and are inserted one by one in this rank order into the next generation of the population, provided they cover some example in the training set which has not already been covered by a previously inserted rule. Any such redundant rule is discarded. The population size thus changes dynamically according to the number of rules required to cover the entire set of training examples. Fitness is based on entropy measure, modi ed according to classi cation accuracy. Both single point and uniform crossover have been used. Recombination is applied to the left hand sides of rules only. The right hand side of a rule is assigned to be the majority class found within training examples covered by the rule. The REGAL system (Neri & Giordana 1995) uses a similar coverage based approach for multiconcept learning. Each rule is evolved on its own and co-operation within population is encouraged through competition for coverage. This work introduced the Universal Su rage selection operator, that selects rules providing larger coverage together. A parallel GA based approach (Cui, Fogarty, & Gammack 1993) is used for the identi cation of `good' and `bad' customers in credit-scoring application. Solutions are evaluated using either a generic classi cation accuracy measure, or an application-speci c measure of pro tability. The results are then compared with other classi cation algorithms (Bayes, k-nearest neighbors and ID3). Typically modi ed versions of the simple GA (De Jong 1975; Goldberg 1989b; Holland 1975) are used for most of the data mining applications. However, scalability of simple GA (e.g. growth of number of objectuve function evaluation with respect to increasing number of features) has been seriously questioned in di erent occasions (Goldberg, Korb, & Deb 1989; Kargupta 1995; Thierens & Goldberg 1993). As a result interest in data analysis using scalable genetic approahces (Kargupta et al. 1998; Whitley et al. 1997) is growing. The following section addresses the issue of scalable genetic algorithms.

Figure 5: Architecture of the user interface module.

Scalable Data Mining And The Genetic Algorithms

As noted earlier in this paper, the word \pattern" in the context of data mining is typically used to mean relations, captured in terms of rules, similarity based subsets, associations among the search space dimensions. Therefore, a data mining algorithm can also be viewed as a search for appropriate rules, similarities or other kinds of associations. The Genetic Algorithms (GAs) (De Jong 1975; Goldberg 1989b; Holland 1975) t quite well into this application. The GAs can be used for nding any of these pattern types. Apart from these, typically the data mining process requires feature selection, model optimization, and system identi cation techniques. The GAs are also suitable for such applications. There exists a growing body of literature on the application of the GAs to data analysis/mining problems. The following part of this section reviews some of these works.

Genetic Algorithm based data mining

Since machine learning algorithms nd frequent applications in data mining, it is appropriate to review some of the early GA Based Machine Learning (GBML) systems. LS-1 (Smith 1980; 1983; 1984) is an example of one such early GBML system that used simple GA like genetic operators to manipulate a population of production rules. They manipulated the representation at di erent levels of granularity re ecting the semantics of the representation, showing that re-

Decomposing blackbox search/optimization

Scalability (variation of performance quality with respect to growing problem diculty, desired accuracy, reliability, computational resources) of DDM algo78

f### #f f# f ##f

1### 0###

should pay careful attention to determine which relation is \appropriate" and which is not. Let us take an example to illustrate the idea. Say, we have a few people sitting in a room and we would like to identify the person with highest amount of money in his or her pocket. If we want to do any better than enumeration, i.e. exhaustively picking every person in the room and checking the person's pocket for the amount of money the person has, we must make intelligent guesses by observing certain features of the people (e.g. quality of dress, shoes etc.). If we consider \all possible features" we are again back to enumeration (Watanabe 1969; Mitchell 1980). We must consider a certain nite set of features that de nes the bias of the process. Features, like quality of dress de ne relations among the set of people. Depending on what we mean by the \quality of dress", such a relation may divide the set of people into di erent classes, such as cheaply dressed people, very expensively dressed people, and so on. We consider hypotheses de ned by the feature set, use it to divide the search space into di erent classes, and evaluate hypotheses using samples taken from the search domain. The set of features, that we restrict our attention to, may be pre-determined or dynamically constructed during the course of induction. The decomposition of BBO in SEARCH in terms of relation, class, and sample spaces essentially captures this idea. Two main important underlying processes of a BBO algorithm are, (1) construction of partial ordering, followed by selection among relations and (2) construction of partial ordering, followed by selection among classes. Note that the former step is essential since some relations are inherently good and some are not. For example, \quality of dress" may be a good one; however, \color of the hair" may not be a good relation for the above mentioned problem. In SEARCH, relations that are inherently good for decision making are said to properly delineate the search space. If we construct a partial ordering among the classes, de ned by a relation of order k (logarithm of the number of classes de ned by the relation), select the \top" ranked classes for further exploration, and the class containing the optimal solution is one among those selected classes, then we say that order-k relation properly delineates the search space. The search for appropriate relations and classes can be viewed as decision making processes in the relation and class spaces respectively. SEARCH o ers a general probabilistic and approximate framework to do that. If the relation space provided a priori to the search algorithm contains all the relations needed to solve a problem and the order of all of these suitable relations is bounded from top by some constant k, then the given problem can be solved in sample complexity (can be loosely de ned as the number of samples taken for solving the problem) polynomial in problem size, solution quality, success probability. This class of problem is called the class of order-k delineable problems. SEARCH points out that, since induction is an essential part of BBO, search for appropriate relations is

1111 1011 1001 .. .

Figure 6: BBO decomposition in relation, class, and sample spaces. Note the similarity based equivalence relations. Here f denotes a position of equivalence and the # character matches with any binary value. rithms is an important issue since large databases and high dimensional feature spaces are typical characteristics of many common databases. Therefore the scalability of the GAs is likely to play a critical role in their success in large scale DDM applications. This section focuses on the scalability issue of the GAs. It gives a serious look at the fundamental underlying search processes in the GAs, points out some serious bottle-necks of frequently used simple GAs (De Jong 1975), presents a new scalable genetic algorithm. Understanding the genetic algorithms (GAs) requires rst understanding the foundation of non-enumerative blackbox search/optimization (BBO) algorithms. The goal of a BBO algorithm is to nd solution(s) from the search space that extremizes the objective function value beyond an acceptable criterion. Since for most of the interesting problems the search space is quite large, BBO algorithms, like the GAs, depend on non-enumerative search, which is actually an inductive process. This section makes a note of that and offers a decomposition of the underlying processes in a non-enumerative BBO algorithm using the Search Envisioned As Relation and Class Hierarchizing (SEARCH) framework. The SEARCH framework proposed elsewhere (Kargupta 1995; Kargupta & Goldberg 1996) studies the fundamental issues in BBO by decomposing it into searches in (1) relation, (2) class, and (3) sample spaces (Figure 6). The foundation of SEARCH is based on the fact that induction is an essential part of nonenumerative BBO, since in absence of any analytic information about the objective function structure, a BBO algorithm must guess based on the samples it takes from the search space. SEARCH also notes that induction is no better than table look up unless we restrict the scope of the inductive search algorithm to a nite set of relations1 among the search space members. If relations are important to consider, then we 1 A relation is de ned as a set ordered tuples. A class is a tuple of elements taken from the domain under consideration. In this document we will primarily be concerned with tuples taken from space of n-ary Cartesian products of the search domain with itself. Equivalence relation are symmetric,transitive, and associative relations; similarity based equivalence relations among a space of binary sequences de ne equivalence based on similarity among the sequences.

79

di erent schemata. For example in a 4-bit representation schema (singular form of schemata) 11## denotes the set of all strings that start with a 11 (i.e. the set f1100; 1101; 1110; 1111g). The corresponding partition can be represented by ff ##, where f denotes the position of equivalence and # denotes the wild card character. Partition ff ## divides all strings into four schemata namely 00##, 01##, 10##, and 11##. The e ect of selection, crossover, and mutation applied on the population can also be interpreted in the space of partitions and schemata. For a given population of strings and GA operators the so called schema theorem (Holland 1975) can be used to determine an expected bound on the growth or destruction of schemata from generation to generation. The simple GA has been quite successful in solving many di erent problems (Goldberg 1989b; Mitchell 1996); however it is no magic. Like any other BBO algorithm, the sGA is fundamentally based on an inductive search process. Therefore, the observations made by the SEARCH framework are equally applicable to sGA. The success of GAs depends on several factors, including 1. detection of schemata that capture the desired solution(s) 2. interaction between schemata and genetic operators 3. population size 4. problem diculties. Although schemata and partitions are often used as a tool to understand the underlying behavior of GAs, the sGA does not have any mechanism for explicit processing of schemata and partitions. This implicit perspective alone falls far short in delivering reliable performance for schema and partition detection. A simple illustration can be given using the following example. Consider the deceptive trap (Ackley 1987) function, f (x) = k if u = k; f (x) = k ? 1 ? u otherwise; where u is the unitation variable, or the number of 1-s in the string x, and k is the length of the sub-function. This function is widely reported to be dicult for simple GA since low order partitions lead sGA toward the wrong direction. If we carefully observe this trap function, we shall note that it has two peaks. One of them corresponds to the string with all 1-s and the other is the string with all 0-s. The solution with all 1-s is the optimal one. Let us construct a test function by concatenating 5-bit trap functions one after another. In other words the each consecutive ve bits de ne a separate trap function. This overall concatenated function can be linearly decomposed into order-5 sub-functions. In order to solve this problem eciently either the sGA need to be informed regarding the related bits or they must be adaptively detected by selecting the appropriate partitions and schemata. Lack of mechanism for explicit partition and schema detection makes the sGA perform very poorly for such problems. Figure 7 shows the result of a typical sGA run for a 36-bit objective

critical. Instead of looking for better solutions from the beginning, SEARCH advocates a BBO algorithm to 1. detect the structure of the search space, induce relations and classes to capture that 2. identify desired quality solutions by guiding the search following the detected structure. A detailed description of each of these processes can be found elsewhere (Kargupta 1995). The following section brie y describes the simple GA and identi es at least one major problem of simple GA that may lead to poor scalability.

The Simple Genetic Algorithms

The simple Genetic Algorithm (sGA)(De Jong 1975; Goldberg 1989b; Holland 1975) is a popular class of genetic algorithms. The simple GA uses operators like selection, crossover and mutation to explore the search space adaptively in order to maximize or minimize the given objective function (sometimes called tness function). Simple GA typically uses a sequence representation. In other words, the search variables are represented as a sequence (often called chromosome) of symbols, chosen from some given alphabet set. The slot corresponding to any entry in the sequence is called a gene. Popular approaches include binary, gray, and real valued codings of the search variables. A simple GA starts from a randomly generated population. The simple GA iteratively applies the search operators|selection, crossover, and mutation|to this population to produce a new population of chromosomes. The main search operators are brie y described below. 1. Selection: Compute the objective ( tness) function values of all the chromosomes. Make more copies of the chromosomes with higher tness and use these additional copies to replace those chromosomes of the population that have worse objective function values. 2. Crossover: The crossover operator is usually applied on the population with a high probability. There are several types of crossover operators prevailing in the GA literature. A simple one-point crossover rst picks up two chromosomes from the population randomly. Next it picks a random crossing-point (i.e. a slot in the chromosome) that divides each chromosome into two halves. This is followed by swapping of either the left or right halves. 3. Mutation: Simple mutation is usually a low pro le operator that changes the value of a gene with some low probability. Although the sGA explicitly processes a population of chromosomes, a better understanding about the underlying search may be obtained by investigating the processing of schemata (similarity based equivalence classes) in the population (Holland 1975). In a sequence representation, a similarity based partition divides the space of all sequences (chromosomes) into 80

35

a xed-locus representation of the simple GA. Bagley used the so called inversion operator for adaptively clustering the related genes that de ne good partitions and schemata. The inversion operator works by reversing the order of the genes lying in between a pair of randomly chosen points along the chromosome. Although this mechanism was used for generating new tightly coded partitions, Bagley's work provides no mechanism for accurate evaluation of the partitions. Moreover, introduction of the inversion operator restricted the use of GA crossover operator and Bagley did not conclude in favor of the use of inversion. Rosenberg (Rosenberg 1967) also investigated the possibility of learning linkage by evolving the probability of choosing a location for crossover. Although this approach does not rigorously search for appropriate partitions, adaptive crossover point may be able to process schemata, with widely separated xed bits, better than a single point crossover. Frantz (Frantz 1972) investigated the utility of the inversion operator and like Rosenberg reported that inversion is too slow and not very e ective. Holland (Holland 1975) also realized the role of linkage learning and suggested that the use of inversion operator despite its reported failure in earlier studies. Goldberg and Lingle (Goldberg & Lingle 1985) introduced a new PMX crossover operator that could combine the ordering information of the selected regions of the parent chromosomes. They concluded that this approach has more potential then the earlier approaches. Schaffer and Morishima (Scha er & Morishima 1987) introduced a set of ags in the representation. These ags were used for identifying the set of genes to be used for crossover points. For di erent test problems, they noted the formation of certain favorite crossover points in the population, that corroborated their hypothesis regarding the need for detecting gene linkage. Goldberg and Bridges (Goldberg & Bridges 1990) con rmed that lack of linkage knowledge can lead to failure of GAs for dicult classes of problems, such as deceptive problems. Additional e orts on linkage learning GAs can be found elsewhere (Levenick 1991; Paredis 1995; Smith & Fogarty 1996). Harik introduced the LLGA (Harik 1997) which made an e ort to learn linkage by introducing the so called exchange crossover operator and the probabilistic expression based representation. An alternate approach to linkage learning can be found elsewhere (Smith & Fogarty 1996). In addition to the growing empirical evidences for the need of explicit linkage learning algorithms in the GAs, theoretical advancements have also started corroborating these observations. Ecacy of such implicit processing of relations has been seriously questioned on theoretical grounds elsewhere (Goldberg, Korb, & Deb 1989; Kargupta 1995; Thierens & Goldberg 1993). Thierens and Goldberg (Thierens & Goldberg 1993) showed that simple GA fails to scale up for the class of problems with only order-k signi cant partitions, unless information about the appropriate partitions is provided by the user.

Best fitness

30

Best fitness value

25

20

15

10

5

0 0

50

100

150 Generations

200

250

300

Figure 7: The variation of the best tness value of an sGA population with respect to di erent generations. The optimal solution has a tness value of 36. The sGA with population size = 100, crossover probability = 0.7, mutation probability = 0.001, binary tournament selection. function, comprised of six trap sub-functions, each of size six bits. The sGA fails to obtain the optimal solution. There exists many real life problems in which the underlying non-linearity of the problem results in higher order problem delineability. As a result, success demands e ective search for appropriate partitions and schemata. This problem of detecting appropriate relations and classes is traditionally called linkage learning in the GA literature. Although by de nition linkage learning is not necessarily restricted to similarity based relations and classes, in the following part of this paper linkage learning will be restricted to only those special cases. Linkage learning is essentially the problem of detecting appropriate bases of the underlying problem representation. A general de nition of linkage learning can be found elsewhere (Kargupta 1998; Kargupta & Bandyopadhyay ) There is a growing consensus that scalable linkage learning is essential for the success of GAs in search, machine learning, optimization, and data mining problems. However, the need for linkage learning was realized even during the early inception of the GAs. The following section describes the related work since the dawn of genetic algorithm research.

Linkage learning in Simple GA

The ecacy of the implicit processing has been questioned since the inception of the GAs. Several efforts have been made for designing GAs capable of explicit detection of signi cant partitions and schemata. The history of linkage learning e orts dates back to Bagley's dissertation (Bagley 1967). Bagley uses a representation in which the gene explicitly contains both the position and the allele value. For example, string ((0 1)(2 0)(1 1)) will corresponds to the string 110 in 81

the schema to be detected is bounded by some constant k. For a sequence representation with alphabet set  of cardinality , a randomly generated population of size  is expected to contain one instance of every order-k schema. The population size in GEMGA is therefore m = c , where c is a constant. Although we treat c as a constant, c is likely to depend on the variation of tness values of the members of the schema. Note that the population size is independent of the problem size `. For all the experiments reported in this work, the population size is kept constant.

The main reasons behind this lack of scalability are the merger of the relation, class, and sample spaces into a single population and the lack of adequate e orts to methodically search for the appropriate order-k partitions. The sGA also has some additional problems in the context of ecient partition search. A single sample from the search space can be used for the evaluation of all the relations under consideration. This is because that sample must belong to some schema de ned by any partition. This is often called implicit parallelism in the GA literature. Although this can be exploited in a very systematic manner when relations are methodically processed, implicit processing of schemata makes this quite noisy in the sGA. These observations regarding the problems of simple GA in searching appropriate partitions and schemata resulted in the development of a di erent promising evolutionary algorithms that pay attention to the linkage learning issue. The following section describes one such algorithm, called the gene expression messy genetic algorithm.

k

k

Representation

The GEMGA uses a sequence representation. Each sequence is called a chromosome. Every member of this sequence is called a gene. A gene is a data structure, which contains the value, and capacity. The value as is obvious contains the value of the gene, which could be any member of the alphabet set, . The capacity associated with every gene takes a positive real value. The chromosome also contains a dynamic list of lists called the linkage set. The linkage set of a chromosome is a list of weighted lists. Each member of this sequence consists of a list, termed locuslist which de nes the set of genes that are related, and three factors, the weight, goodness and trials. The weight is a measure of the number of times that the genes in locuslist are found to be related in the population. The goodness value indicates how good the linkage of the genes is in terms of its contribution to the tness. This value is normalized between 0 and 1, and is initialized to 0. The trial eld indicates the number of times this linkage set has been tried. (Note that if the trial of any element of the linkage set is zero, then its goodness is temporarily assumed to be 1, unless proved otherwise.) The linkage set space over all genes de nes the relation space of the GEMGA. Figure 8 shows the structure of a chromosome in GEMGA. Unlike the original messy GA (Deb 1991; Goldberg, Korb, & Deb 1989) no under or overspeci cations are allowed. A population in GEMGA is a collection of such chromosomes.

The Gene Expression Messy GA

The Gene Expression Messy Genetic Algorithm (GEMGA) (Kargupta 1996a; 1996b; Bandyopadhyay, Kargupta, & Wang 1998) o ers an ecient approach to learning linkage. In GEMGA, the problem of associating similarities among chromosomes with similarities of their corresponding tness values is posed as a problem of detecting approximate symmetry. Symmetry can be de ned as an invariance in the pattern under observation when some transformation is applied to it. Similarities among the tness values of the members of a schema can be viewed as a kind of approximate symmetry, that remains invariant under any transformation that satis es the closure property of the schema. The GEMGA identi es the tness symmetry (in an approximate sense) by identifying symmetry breaking dimensions, and uses them to detect the underlying linkage, This approach appears to be quite successful for properly delineating partitions of an optimization problem. An additional strength of this method is its intelligent crossover, which explicitly works to preserve linkage information detected earlier in the algorithm. Unlike other approaches (Deb, Harik etc), which relied on a position independent coding and tried to get a \tight linkage", this approach makes no such attempts; rather, it starts by detecting local tness symmetry and uses that to do a bottoms up construction of global linkage information. In the following sections we review the recently proposed version (v.1.3) (Bandyopadhyay, Kargupta, & Wang 1998) of the GEMGA .

Operators

The GEMGA has three primary operators, namely: (1) Transcription, (2) PreRecombinationExpression and (3) RecombinationExpression. Each of them is brie y described in the following. Transcription The GEMGA Transcription operator plays an important role in the detection of schemata. It detects local symmetry in the tness landscape by noting the relative invariance of the tness values of chromosomes under transformations that perturb the chromosome , one gene at a time. It changes the current value of a gene to a di erent value, randomly chosen from the alphabet set and notes the change in tness value. If the tness deteriorates because of the change in gene value, that gene is identi ed as the symmetry breaking dimension and the corresponding gene capac-

Population sizing

In order to detect a schema, the GEMGA requires that the population contains at least one instance of that schema. If we consider the population size to be a constant and randomly initialized, then this can be guaranteed only when the order (number of xed positions) of 82

Chromosome

same value and capacity are grouped together in a new set. If the set is already present in the linkage set of the rst chromosome then the weight of the corresponding locuslist is increased by an amount INCR WEIGHT, else it is included as a new element of the linkageset. GetFinalLinkage A conditional probability matrix P is now constructed , where the entry P [i; j ](i 6= j ) denotes the probability of nding gene i in a locuslist given gene j is already present. Incase (i == j ) it denotes the probability of a locuslist containing gene i only. The maximum probability max[i] in each row is calculated and those entries which are less than max[i] ? EPSILON are replaced with 0. Now a new linkage set is calculated by collecting the nonzero entries in each row into a locuslist and using the mean of the corresponding probabilities as its weight. The set is added to the new linkage set if its weight is greater than WEIGHT THRESH. The addition of a new locuslist is done in the same way as was done during the resolution phase by checking for the existence of another locuslist with the same members as the new one. RecombinationExpression The RecombinationExpression Phase is the selecto-recombinative phase of the algorithm and is run a xed (No Gen) times. It also consists of two steps. First a mating pool is created by performing binary tournament selection in the population. Then the GEMGA Recombination operator is applied iteratively over pairs of chromosomes. First, copies of the given pair is made, and one of them is marked. An element of the linkage set of the marked chromosome is selected, based on a linearly combined factor of its weight and goodness, for swapping. The corresponding genes are swapped between the two chromosomes provided the goodness values of the disrupted locuslists of the unmarked chromosome are less than that of the selected one. The linkage sets of the two chromosomes are adjusted accordingly. Depending on whether the tness of the unmarked chromosome decreases or not, the goodness of the selected linkage set element is decreased or increased. Finally, only two of the four chromosomes (including the two original copies) are retained (Bandyopadhyay, Kargupta, & Wang 1998).

i

Gene Gene 0 1

Gene

Link

l-1

Genes

0

Link

Link

1

i-1

Linkage Set

Gene j Locus

Value

Capacity

Link j Locus

0

Locus

1

Locus

weight

goodness

trial

k-1

locuslist

Figure 8: Structure of a chromosome in GEMGA. ity is set to zero, indicating that the value at that gene value cannot be changed. On the other hand, if the tness improves or does not change at all, the corresponding capacity of the gene is set to one, indicating that this dimension o ers symmetry, with respect to the pattern of improvement in tness. Finally, the value of that gene is set to the original value and the tness of the chromosome is set to the original tness. This process continues for all the genes and nally all the genes whose capacities are reduced to zeroes are collected in one set, called the initial locuslist . This is stored as the rst element of the linkage set associated with the chromosome. Its weight, goodness, and trial factors are initialized to 1, 0, and 0 respectively. The transcription operator does not change anything in a chromosome except the capacities and it initiates the formation of the linkage sets. Any symmetry that remains true over the whole search space also remains true within a local domain. Locally detected schemata are next evaluated in a population-wide global sense, as described in the following section. PreRecombinationExpression The PreRecombinationExpression stage detects schemata that capture symmetry beyond a small local neighborhood de ned by the bit-wise perturbation of transcription. The PreRecombinationExpression phase determines the clusters of genes precisely de ning the relations among those instances of genes. It consists of two steps ResolveLinkage, and GetFinalLinkage. ResolveLinkage Each Chromosome in the population undergoes a xed number (No Linkage Exp) of ResolveLinkage operations with di erent members of the population other than itself. During the ResolveLinkage operation those genes which are members of the initial locuslist (constructed using Transcription ) of both the chromsomes and having the

The algorithm

The overall structure of the GEMGA is summarized below: 1. Initialization Randomly initialize the duly sized population. 2. PrimordialExpression Detect schemata that captures local tness symmetry by the so called transcription operator. Since population size m = c , this can be done in time O( `). 3. PreRecombinationExpression Identify schemata that capture tness symmetry over a larger domain. k

k

83

The simple kNN

This only requires comparing the chromosomes with each other and no additional function evaluation is needed.

The k-Nearest Neighbor (kNN) classi cation algorithm assumes the existence of an appropriate distance metric in the feature space. In this work we consider Euclidean distance metric. Given a new sample with unknown class label, it rst determines the k nearest neighbors based on the chosen distance metric. The class label is then predicted based on the majority of the class labels among the k nearest neighbors.

4. RecombinationExpression (a) Create a Mating Pool using Binary Tournament Selection. (b) GEMGA recombination: The GEMGA uses a recombination operator, designed using motivation from cell meosis process that combines the e ect of selection and crossover. Reconstruct, modify schema linkage sets and their parameters. (c) Mutation: Low probability mutation like simple GA. All the experiments reported in this work used a zero mutation probability. The primordial expression stage requires O( `) objective function evaluations. PreRecombinationExpression requires O(2 ) pair-wise similarity computation time (no objective function evaluation). The length of the Recombination stage can be roughly estimated as follows. If t be the total number of generations in juxtapositional stage and if selection gives copies to the best chromosome then if selection dominates the crossover, every chromosome of the population will converge to same instance when = m, t = log m= log (6) Substituting m = c , we get : + k log  n t = log clog (7) Therefore, the number of generations in recombination expression stage is O(k). This result is true when selection is allowed to give an exponentially increasing number of copies. The overall number of function evaluations is bounded by O( `). This analysis assumes that the cardinality of the alphabet set of the chosen represents is bounded by a small constant (e.g. in case of binary representation it is two). The GEMGA has been applied to solve a feature selection problem using power distribution network fault-diagnosis data set. The following section describes a GEMGA based feature selection algorithm that makes use of a nearestneighbor classi er system for predicting unknown class labels.

The GEMGA based kNN

In a simple kNN, all the feature variables are considered for the computation of the distance values. However typically in a high dimensional space not all the features are required for classi cation and moreover the data may be noisy. Therefore a kNN using a selected feature subset often works much better than a simple kNN. However, selecting the appropriate subset of features from a large set of features is a dicult search problem. An example simple GA based approach for kNN feature selection can be found elsewhere (Punch et al. 1993). In this work we use the GEMGA for kNN feature selection. A GEMGA chromosome is a binary string where each binary variable is associated with one unique feature; if a binary variable takes a value of 1, the corresponding feature is considered in kNN. If it is 0 then it is not considered. The length of the chromosome is therefore same as the total number of features available in the given problem. The objective function of a chromosome is evaluated by evaluating the performance of the kNN using the feature subset suggested by the chromosome. The objective func? ) ; where tion is chosen to be f (x) ( TotPats is the number of total patterns to be examined and CorrectPats is the the number of patterns correctly classi ed by the kNN algorithm. In order to make the nearest-neighbor search ecient, the training data is clustered by class and each class is in turn represented by the mean vector of the class members. The nearest-neighbor search is rst performed among these mean vectors. Once the closest class mean vectors are identi ed, those class members are then searched for k nearest-neighbors. The following section presents the experimental results for a diagnostic model of electrical distribution systems.

k

k

t

k

T otP ats

CorrectP ats

T otP ats

k

Application Of Collective Data Mining For Power Distribution Systems Diagnostics

The Learning Algorithm: Combining The kNN and The GEMGA

This section presents a distributed approach to the problem of fault prediction in electrical distribution networks. As these systems are large and complex, distributed data analysis may reduce the cost considerably because of large reduction in operating times. In the following sections a review of the fault diagnosis in electrical distribution systems is rst presented. Next, the current scope of the application problem is

The application problem considered in this paper requires prediction of unknown class labels using the observations made of a large number of features, observed in a distributed fashion. Although the collective learning framework works ne with di erent learning algorithm, in this paper we consider all agents equipped with a GEMGA based learning algorithm. Details of this approach are given below. 84

dition of the network has therefore to be conveyed to the HV/MV sub-station where any decision has to be taken and then from this sub-station the control signals are sent to the remotely controlled circuit-breakers. In the present paper, the used system learns by example in a physically distributed fashion from a set of adequate input-output tuples generated by means of a simulation model developed in (Augugliaro et al. 1996). The input features are those electrical features that can be measured at di erent measurement points in a distribution MV radial network. Each control point provides a set of measured electrical features to the local fault diagnosis system by means of which the local diagnosis system either in a self-sucient fashion or in a collective way, is able to identify all the considered faults. In the present approach, the overall architecture is organized in di erent levels. The rst level is the local one, for which each metering station is able to identify and locate a certain subset of considered faults. For the faults that cannot be identi ed and located at each metering point, a collective approach is used. Therefore, some of the Medium Voltage/Low Voltage sub-stations will be forced to co-operate for the sake of the accurate identi cation of all the considered faults. In the following section, the diagnostic problem modeling is described.

described. Finally, the experimental results on a radial distribution network are presented.

Background

Failure of power distribution lines can cause serious problems for the modern society. Faults occurring in large distribution systems require to be identi ed soon and a fast decision making process for the service restoration strategy to be implemented has to be carried out. Moreover, the growing size of distribution systems, e.g., in large cities, does not always allow a full data collection for centralized faults identi cation and location detection. For these systems as for many other complex physical systems, distributed information processing is a necessity more than a viable option. An ecient approach to solve the problem of fault-identi cation and location-detection in large distribution systems is proposed here. For these problems, it is often necessary to change the knowledge of the diagnostic systems in accordance with the increased amount of data. One of the aims of machine learning in this eld is to provide an adaptive algorithm which is able to show the fault diagnosis without extensive help from human beings. An increasing amount of work is being reported that makes use of AI based techniques for addressing the diagnostic problem in distribution systems. In (Momoh, Dias, & Laird 1997) the implementation of an integrated design for fault diagnosis in grounded and ungrounded distribution systems is presented. The whole architecture is comprised of di erent modules. One of them uses a hybrid Arti cial Neural Network based approach for phase classi cation and fault location. The other modules also perform feature extraction. All these functions are performed at the High Voltage (HV)/Medium Voltage (MV) sub-station. The studied type of faults is single-line-to-ground. Wen (Wen & Chang 1997) and Chang (Chang et al. 1997) deal with the uncertainties in faults identi cation and location. The former uses a probabilistic approach together with a re ned genetic algorithm, using local search operators and di erent crossover operators within the same strategy. Here, the genetic algorithm is used for generating and evolving better hypotheses progressively until one can explain the reported input signals from the alarm signals. In the latter an expert system based on a fuzzy logic ranking of hypotheses is used. (Teo 1995) presents a learning system for faults identi cation and location detection. the learning here is carried out on the basis of the fault patterns provided by a network simulator. Elsewhere (Zhu, Luckeman, & Girgis 1997) the information obtained from the transients recording is integrated together with a knowledge base related to the feeder. The rst location of the fault is then adjusted by means of real-time measurements combined with and a fault diagnosis algorithm. In this way, the approximations inherent in the system's modeling are considered. In all the approaches described above, the diagnostic systems work in a centralized way. Real-time information regarding the changing topology and loading con-

Problem description

The distribution network under current consideration is a typical distribution system comprised of a transmission distribution sub-station providing supply to the MV/LV nodes via a number of main feeders. Distribution system faults can be categorized into two main types: faults for breaking down of the insulating system of the lines, and faults due to the mechanical breakdown of a line. The rst type has then been divided into single-phase faults, phase-to-phase faults, and phaseto-phase-to-ground faults, whereas the second is mostly expected for three kinds of mechanical breakdown of the overhead lines, which are direct (conductor broken and source end on ground, load end hanging in air), inverse (conductor broken and load end on ground, source end hanging in air) and double (when both ends of the line are on ground). The parameters mostly in uencing the electrical features in fault conditions are: 1. the neutral grounding system; 2. the fault resistance; 3. the supplied load entity at the moment the fault occurred; 4. the fault location in the network. The electrical features considered to be useful here for the diagnosis are the zero, positive and negative components of voltage and current, the negative and zero components of real and reactive power, measured at the output sections of the MV/LV sub-stations (indicated by control points or data sites). The elementary 'event' which can take place in the de ned system is one of the distribution system faults mentioned above occurring 85

between two adjacent MV/LV sub-stations. Once the parameters in uencing any of the considered faulty conditions have been de ned, extensive simulation studies have been carried out in order to produce a training set for the learning system. In the present application, and for the considered faulty events, the parameters are those listed above. The simulation software for the system's behavior is the one set up in (Augugliaro et al. 1996). In the following section, the Distributed Diagnostic model is described.

MV

1 2 3 4 HV

5

6

7

8

The distributed diagnostic model

Performing distributed data mining for fault diagnosis problems in automated distribution systems speeds up the service restoration strategy implementation. Also this bene ts customers and electrical utilities with better service quality and costs reduction. The proposed distributed model allows the classi cation and location of certain faults occurring in large-scale distribution networks. In largely electri ed centers and in large distribution systems, the opportunity to cluster operations and to take local decisions is certainly a viable and ecient option. Therefore, the proposed application aims at putting into evidence the opportunity o ered by the distributed data-mining algorithms and architectures in power distribution systems fault diagnosis. In the present work, the simulation software outputs ctitious measurements of f electrical features at each measuring point, for e di erent possible fault scenarios. They constitute the local training databases, being input-output tuples. The training set comprises 438 input-output tuples, whereas the test data-base consists of 101 tuples. In the present formulation, each fault-class is identi ed by its type and location. Therefore, two di erent classes of faults can be the same fault type occurring at two di erent sections of the network. This formulation allows a good de nition of the diagnostic problem in a physically distributed fashion. The proposed distributed architecture allows classi cation and decision making on the current situation by means of small local databases. Each possible scenario can be identi ed either using local information or by means of co-operation. Any of the real-time collected set of data are measured in a distributed fashion and are assessed by the local classi ers. If any of the data sites is able to identify the current situation, then the output signal of the local classi er can be promptly used for carrying out the required operations, after having received the positive signal from the facilitator. If none of the data sites is able to perform the identi cation of the current situation, the facilitator gathers a sucient amount of learning data from all the data sites as described earlier in this paper and performs learning. In this way, the data ow from the terminal sites toward a central point is performed if and only if the fault cannot be locally predicted. Moreover, each of the data sites is alerted of the current situation so that any temporary or permanent restorative recon guration strategy can be carried out immediately. When a fault occurs, in order to re-

9

Figure 9: The test System. store supply to a ected customers in a short period of time, the new restorative con guration is required to be close to the pre-fault con guration. Therefore, the post-fault con guration can often be attained by means of a few locally assessed switching operations. In the following section, the system on which the diagnostic application is performed is described in detail. Then the results are reported and commented.

Application and test results

The test electrical system for which the present application has been carried out is the radial distribution network represented in g.1 and it's electrical and topological features can be found in (Di Silvestre 1998). The studied distribution system has the following characteristics: 1. single HV/MV sub-station with Y-Y transformer supplying some main feeders; 2. each MV main feeder is either overhead line or cable line or mixed; 3. LV loads are supplied through MV/LV -Y transformers installed along the MV main feeders; 4. the measurements are performed at each of the MV/LV derivations; 5. loads entity varies with daily, weekly and monthly cadence. The system comprises di erent types of lines as: 1. 2 lines of A type, overhead lines, total length: 21 km, supplying 10 MV/LV sub-stations whose rated power is 250 kVA; 2. 5 lines of B type, cable lines, total length: 8 km, supplying 22 MV/LV sub-stations whose rated power is 250 kVA; 3. 2 lines of C type, mixed cable-overhead lines, total length: 21 km, supplying 22 MV/LV sub-stations whose rated power is 250 kVA. 86

For the sake of simplicity, in our application, the set of features considers one of the lines of C type when a fault occurs in it. Since the simulation of measured features at each control point was concerning the ten electrical features listed in section , we have ten electrical features per control point for the diagnostic model identi cation. The number of events to be distinguished is thirty six. We are considering the following six different faulty events: 1. insulation breakdown (single-phase and double single-phase); 2. mechanical breakdown direct; 3. mechanical breakdown inverse; 4. three-phase; 5. phase-to-phase; 6. phase-to-phase-to-ground; With the proposed distributed architecture, the described fault diagnosis modeling problem has been dealt with. For the test results, the system has been considered having four data and elaboration sites out of seven measuring stations. The data are therefore collected and elaborated at four physically distributed sites. From the results we see that the system is able to predict 68% faults correctly. Di erent runs have been performed for the current system and the issue of scalability has been considered. For the current system we have compared performances of systems having di erent number of data sites and therefore of agents. In gures 10 & 11 the linear performance of the GEMGA algorithm is shown as the input data size grows for each of the sites. Three di erent cases for the same system have been considered. The rst is related to the situation for which data collection can be performed at only two control sites and as a result of that the di erent measuring systems are placed at sites that are nearby the agents. The second and third cases refer to systems having three and four data sites respectively. Figure 10 and 11 show accuracies achieved and the number of function evaluations performed in these three di erent cases. The following section concludes this paper.

100

Accuracy(%)

80

60

40

20

0 2

3 Problem size

4

Figure 10: Accuracy Result. The x-axis represents the number of data sites used in the experiment. 10000

Number of function evaluations

8000

6000

4000

2000

0 2

3 Problem size

4

Figure 11: Total number of function evaluations. The x-axis represents the number of data sites used in the experiment. this paper realizes the importance of scalable data mining at individual data sites. It developed a gene expression messy GA based learning algorithm that has been successfully applied to predict faults in large electrical power distribution network. We hope that this work will take a small step toward lling the lacuna in the literature for mining vertically partitioned feature space and make the eld of distributed data mining stronger and capable of handling both vertically and horizontally partitioned datasets.

Conclusions

Distributed data mining is an emerging eld with many potential applications in di erent aspects of science, business, and engineering. This paper addressed one important class of the application domain, vertically partitioned feature space. It proposed the so called collective data mining approach that exploits the fact that any function can be represented in a distributed fashion by decomposing it using a set of basis functions. Moreover, the proposed approach works in a way independent of the data mining algorithms used at the individual data sites and facilitators. This paper also presented the BODHI system that implements the collective data mining architecture developed in this paper. Although the proposed collective Mining architecture does not depend on the speci c learning algorithm,

Acknowledgments This work was supported by School of Electrical Engineering and Computer Science, Washington State University. Part of this work is also supported by a grant from American Cancer Society. The authors would like to thank Prof. Angelo Campoccia, Kakali Sarkar, Prof. Luigi Dusonchet and Prof. Antonino Augugliaro for their collaboration, help, and for furnishing the electrical data. 87

References

Working Notes AAAI Work. Knowledge Discovery in Databases. AAAI. 227{240. Chan, P. K., and Stolfo, S. J. 1995. A comparative evaluation of voting and meta-learning on partitioned data. In Proceedings of Thwelfth International Conference on Machine Learning, 90{98. Chan, P., and Stolfo, S. 1996a. On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information System 8:5{28. Chan, P. K., and Stolfo, S. J. 1996b. Sharing learned models among remote database partitions by local meta-learning. In Simoudis, E.; Han, J.; and Fayyad, U., eds., The Second International Conference on Knowledge Discovery and Data Mining, 2{7. AAAI Press. Chan, P., and Stolfo, S. 1998. Toward scalable learning with non-uniform class and cost distribution: A case study in credit card fraud detection. In Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, o. AAAI Press. Chang, C.; Chen, J.; Rinivasan, D.; Wen, F.; and Liew, A. 1997. Fuzzy logic approach in power system fault section identi cation. IEE Procedings Generation Transmission, Distribution 406{414. Chavez, A.; Alexabdros, M.; and Maes, P. 1997. Challenger: A multi-agent system for distributed resource allocation. In Proceedings of the International Conference on Autonomous Agents. Cho, V., and W'uthrich, B. 1998. Toward real time discovery from distributed information sources. In Wu, X.; Kotagiri, R.; and Korb, K. B., eds., Research and Development in Knowledge Discovery and Data Mining, number 1394 in Lecture Notes in Computer Science : Lecture Notes in Arti cial Intelligence, 376{ 377. New York: Springer-Verlag. Second Paci c-Asia Conference,PAKKD-98, Melbourne, Australia , April 1998. Cui, J.; Fogarty, T. C.; and Gammack, J. 1993. Searching databases using parallel genetic algorithms on a transputer computing surface. Future generation Computer System (9):33{40. Davies, W., and Edwards, P. Distributed learning: An agent-based approach to data-mining. In Proceedings of Machine Learning-95 Workshop on Agents That Learn year = 1995. Davies, W., and Edwards, P. 1996. The communication of inductive inferences. In Distributed Arti cial Intelligence Meets Machine Learning: Learning in Multi-Agent Environments, number 1221 in Lecture Notes in Computer Science : Lecture Notes in Arti cial Intelligence. New York: Springer-Verlag. De Jong, K. A. 1975. An analysis of the behavior of a class of genetic adaptive systems. Dissertation Abstracts International 36(10):5140B. (University Micro lms No. 76-9381).

Ackley, D. H. 1987. A connectionist machine for genetic hill climbing. Boston: Kluwer Academic. Anglano, C.; Giordana, A.; Lo Bello, G.; and Saitta, L. 1997. A network genetic algorithm for concept learning. In B'ack, T., ed., Proceedings of the Seventh International Conference on Genetic Algorithms, 434{ 441. San Francisco: Morgan-Kaufmann. Anglano, C.; Giordana, A.; Lo Bello, G.; and Saitta, L. 1998. Coevolution, distributed search for inducing concept descriptions. In Nedellec, C., and Rouveirol, C., eds., Machine Learning: ECML-98, number 1398 in Lecture Notes in Computer Science : Lecture Notes in Arti cial Intelligence, 322{333. New York: Springer-Verlag. 10th European Conference on Machine Learning, Chemnitz, Germany, April 1998. Aronis, J. M.; Kolluri, V.; Provost, F. J.; and Buchanan, B. G. 1996. The world: Knowledge discovery from multiple distributed databases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA. Augugliaro, A.; Campoccia, A.; Di Silvestre, M. L.; and Dusonchet, L. 1996. Characterization of the inverse fault in mv distribution networks in order to improve protection eciency. In Proceedings of the 7th international Symposium on 'Short Circuit Currents in Power Systems', Warsaw, Poland, 1.15.1{1.15.8. Bagley, J. D. 1967. The behavior of adaptive systems which employ genetic and correlation algorithms. Dissertation Abstracts International 28(12):5106B. (University Micro lms No. 68-7556). Bandyopadhyay, S.; Kargupta, H.; and Wang, G. 1998. Revisiting the GEMGA: Scalable evolutionary optimization through linkage learning. In Proceedings of the IEEE International Conference on Evolutionary Computation, 603{608. IEEE Press. Bethke, A. D. 1976. Comparison of genetic algorithms and gradient-based optimizers on parallel processors: Eciency of use of processing capacity. Tech. Rep. No. 197, University of Michigan, Logic of Computers Group, Ann Arbor. Booker, L.; Goldberg, D.; and Holland, J. 1989. Classi er systems and genetic algorithms. Arti cial Intelligence 40:235{282. Chan, P., and Stolfo, S. 1993a. Experiments on multistrategy learning by meta-learning. In Proceeding of the Second International Conference on Information Knowledge Management, 314{323. Chan, P., and Stolfo, S. 1993b. Meta-learning for multistrategy and parallel learning. In Proceeding of the Second International Work on Multistrategy Learning, 150{165. Chan, P., and Stolfo, S. 1993c. Toward parallel and distributed learning by meta-learning. In In 88

Kargupta, H., and Bandyopadhyay, S. A perspective on the foundation and evolution of the linkage learning genetic algorithms. In communication. Kargupta, H., and Bandyopadhyay, S. 1998. Further experimentations on the scalability of the gemga. Kargupta, H., and Goldberg, D. E. 1996. SEARCH, blackbox optimization, and sample complexity. In Belew, R., and Vose, M., eds., Foundations of Genetic Algorithms, 291{324. San Mateo, CA: Morgan Kaufmann. Kargupta, H.; Hamzaoglu, I.; Sta ord, B.; Hanagandi, V.; and Buescher, K. Kargupta, H.; Riva Sanseverino, E.; Johnson, E.; and Agrawal, S. 1998. The genetic algorithms, linkage learning, and scalable data mining. Kargupta, H.; Hamzaoglu, I.; and Sta ord, B. 1997. Scalable, distributed data mining using an agent based architecture. In Heckerman, D.; Mannila, H.; Pregibon, D.; and Uthurusamy, R., eds., Proceedings of Knowledge Discovery And Data Mining, 211{214. Menlo Park, CA: AAAI Press. Kargupta, H. 1995. SEARCH, Polynomial Complexity, and The Fast Messy Genetic Algorithm. Ph.D. Dissertation, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA. Also available as IlliGAL Report 95008. Kargupta, H. 1996a. Computational processes of evolution: The SEARCH perspective. Presented in SIAM Annual Meeting, 1996 as the winner of the 1996 SIAM Annual Best Student Paper Prize. Kargupta, H. 1996b. The gene expression messy genetic algorithm. In Proceedings of the IEEE International Conference on Evolutionary Computation, 814{ 819. IEEE Press. Kargupta, H. 1998. Gene expression and large scale evolutionary optimization. In Computational Aerosciences in the 21st Century. Kluwer Academic Publishers. Kushilevitz, S., and Mansour, Y. 1991. Learning decision rees using fourier spectrum. In Proc. 23rd Annual ACM Symp. on Theory of Computing, 455{464. Lam, W., and Segre, A. M. 1997. Distributed data mining of probabilistic knowledge. In Proceedings of the 17th International Conference on Distributed Computing Systems, 178{185. Washington: IEEE Computer Society Press. Lesser, V., et al. 1998. A next generation information gathering agent. Levenick, J. R. 1991. Inserting introns improves genetic algorithm success rate: Taking a cue from biology. In Belew, R. K., and Booker, L. B., eds., Proceedings of the Fourth International Conference on Genetic Algorithms. San Mateo, CA: Morgan Kaufmann. 123{127. Mammen, D., and Lesser, V. 1998. Problem structure and subproblem sharing in multi-agent systems.

Deb, K. 1991. Binary and oating-point function optimization using messy genetic algorithms. IlliGAL Report no. 91004 and doctoral dissertation, University of Alabama, Tuscaloosa, University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory, Urbana. DeJong, K. A.; Spears, W. M.; and Gordon, D. 1993. Using genetic algorithms for concept learning. Machine Learning (13):161{188. Di Silvestre, M. 1998. Modelli di analisi diagnostica delle reti elettriche di distribuzione nalizzati al miglioramento della qualita` del servizio. PhD. Thesis, University of Palermo, Italy. Frantz, D. R. 1972. Non-linearities in genetic adaptive search. Dissertation Abstracts International 33(11):5240B{5241B. (University Micro lms No. 7311,116). Goldberg, D. E., and Bridges, C. L. 1990. An analysis of a reordering operator on a GA-hard problem. Biological Cybernetics 62:397{405. (Also TCGA Report No. 88005). Goldberg, D. E., and Lingle, R. 1985. Alleles, loci, and the traveling salesman problem. In Grefenstette, J. J., ed., Proceedings of an International Conference on Genetic Algorithms and Their Applications. 154{ 159. Goldberg, D. E.; Korb, B.; and Deb, K. 1989. Messy genetic algorithms: Motivation, analysis, and rst results. Complex Systems 3(5):493{530. (Also TCGA Report 89003). Goldberg, D. E. 1989a. Genetic algorithms and Walsh functions: Part I, a gentle introduction. Complex Systems 3(2):129{152. (Also TCGA Report 88006). Goldberg, D. E. 1989b. Genetic Algorithms in Search, Optimization, and Machine Learning. New York: Addison-Wesley. Greene, D. P., and Smith, S. F. 1993. Competition{ based induction of decision models from examples. Machine Learning (13):229{257. Greene, D. P., and Smith, S. F. 1994. Using coverage as a model building constraint in learning classi er systems. Evolutionary Computation 2(1):67{91. Harik, G. 1997. Learning Linkage to Eciently Solve Problems of Bounded Diculty Using Genetic Algorithms. Ph.D. Dissertation, Department of Computer Science, University of Michigan, Ann Arbor. Heath, D.; Kasif, S.; and Salzberg, S. 1996. Committees of decision trees. In Cognitive Technology: In Search of a Humane Interface, 305{317. Holland, J. H. 1975. Adaptation in Natural and Arti cial Systems. Ann Arbor: University of Michigan Press. Kaelbling, L.; Littman, M.; and Moore, A. 1996. Reinforcement learning: A survey. Journal of Arti cial Intelligence Research 4:237{285. 89

Scha er, J. D., and Morishima, A. 1987. An adaptive crossover distribution mechanism for genetic algorithms. In Grefenstette, J. J., ed., Proceedings of the Second International Conference on Genetic Algorithms. 36{40. Sen, S. 1997. Developing an automated distributed meeting scheduler. IEEE Expert 12(4):41{45. Seredynski, F. 1994. Loosely coupled distributed genetic algorithms. In Dividor, Y.; Schwefel, H.-P.; and m'anner, R., eds., Parallel Problem Solving from Nature - PPSN III, number 866 in Lecture Notes in Computer Science, 514{523. New York: Springer-Verlag. International Conference on Evolutionary Computation, The Third Conference on Parallel Problem Solving from Nature, Jerusalem, Israel, 1994. Smith, J., and Fogarty, T. 1996. Recombination strategy adaptation via evolution of gene linkage. In Proceedings of the IEEE International Conference on Evolutionary Computation, 826{831. IEEE Press. Smith, S. F. 1980. A learning system based on genetic adaptive algorithms. Dissertation Abstracts International 41:4582B. (University Micro lms No. 81-12638). Smith, S. F. 1983. Flexible learning of problem solving heuristics through adaptive search. Proceedings of the 8th International Joint Conference on Arti cial Intelligence 422{425. Smith, S. F. 1984. Adaptive learning systems. In Forsyth, R., ed., Expert Systems: Principles and Case Studies. New York: Chapman and Hall. 169{189. Stolfo, S.; Prodromidis, A.; Tselepis, S. Lee, W.; D., F.; and Chan, P. 1997. JAM: Java agents for metalearning over distributed databases. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, 74{81. AAAI Press. Teo, C. 1995. Machine learning and knowledge building for fault diagnosis in distribution network. Electric Power and Energy Systems 17(2):119{122. Thierens, D., and Goldberg, D. 1993. Mixing in genetic algorithms. In Forrest, S., ed., Proceedings of the Fifth International Conference on Genetic Algorithms, 38{45. San Mateo, CA: Morgan Kaufmann. Venkateswaran, R.; Obradovic, Z.; and Raghavendra, C. 1996. Cooperative genetic algorithms for optimization problems in distributed computer systems. In Second Online Workshop of Evolutionary Computation. Watanabe, S. 1969. Knowing and guessing - A formal and quantitative study. New York: John Wiley & Sons, Inc. Wen, F., and Chang, C. 1997. Probabilistic approach for fault{section estimation in power systems based on a re ned genetic algorithm. IEE Procedings Generation Transmission, Distribution 144(2):160{168. Whitley, D.; Beveridge, R.; Guerra, C.; and Graves, C. 1997. Messy genetic algorithms for subset feature selection. In Punch, B., ed., Proceedings of the Seventh

In Proceedings of the Third International Conference on Multi-Agents Systems, 174{181. IEEE Computer Society. Menczer, F., and Belew, R. 1998. Adaptive information agents for distributed textual environments. In Sycara, K. P., and Wooldridge, M., eds., Proceedings of the Second International Conference on Autonomous Agents, 157{164. New York: ACM. Mitchell, T. M. 1980. The need for biases in learning generalizations. Rutgers Computer Science Tech. Rept. CBM-TR-117, Rutgers University. Mitchell, M. 1996. An Introduction To Genetic Algorithms. USA: MIT Press, 1st edition. Momoh, J.; Dias, L.; and Laird, D. 1997. An implementation of a hybrid intelligent tool for distribution system fault diagnosis. IEEE Trans. on Power Delivery 12(2):1035{1040. Moukas, A. 1996. Amalthaea: Information discovery and ltering using a multiagent evolving ecosystem. Autonomous Agent Group, MIT Media Laboratory. Neri, F., and Giordana, A. 1995. A parallel genetic algorithm for concept learning. In Eshelman, L. J., ed., Proceedings of the Sixth International Conference on Genetic Algorithms, 436{443. San Francisco: MorganKaufmann. Nowak, C. 1998. Multiple databases, partial reasoning, and knowledge discovery. In Wu, X.; Kotagiri, R.; and Korb, K. B., eds., Research and Development in Knowledge Discovery and Data Mining, number 1394 in Lecture Notes in Computer Science : Lecture Notes in Arti cial Intelligence, 403{ 404. New York: Springer-Verlag. Second Paci c-Asia Conference,PAKKD-98, Melbourne, Australia , April 1998. Paredis, J. 1995. The symbolic evolution of solutions and their representations. In Eshelman, L., ed., Proceedings of the Sixth International Conference on Genetic Algorithms, 359{365. San Mateo, CA: Morgan Kaufmann. Potter, M. A., and De Jong, K. A. 1994. A cooperative coevolutionary approach to function optimization. In Davidor, Y.; Schwefel, H.-P.; and Manner, R., eds., Parallel Problem Solving from Nature- PPSN III. Berlin: Springer-Verlag. 249{257. Provost, F., and Venkateswarlu, K. 1998. A survey of methods for scaling up inductive learning algorithms. In In communication. Punch, W. F.; Goodman, E. D.; Pei, M.; Lai, C.-S.; Hovland, P.; and Enbody, R. 1993. Further research on feature selection and classi cation using genetic algorithms. In ICGA93. 557{564. Rosenberg, R. S. 1967. Simulation of genetic populations with biochemical properties. Dissertation Abstracts International 28(7):2732B. (University Micro lms No. 67-17,836). 90

International Conference on Genetic Algorithms, not available. San Mateo, CA: Morgan Kaufmann. Yamanishi, K. 1997. Distributed cooperative bayesian learning strategies. In Proceedings of COLT 97, 250{ 262. New York: ACM. Zhu, J.; Luckeman, D.; and Girgis, A. 1997. Automated fault location and diagnosis on electric power distribution feeders. IEEE Trans. on Power Delivery 12(2):801{809.

91