COLLECTIVE APPROACH FOR BAYESIAN ... - Semantic Scholar

COLLECTIVE APPROACH FOR BAYESIAN NETWORK LEARNING FROM DISTRIBUTED HETEROGENEOUS DATABASE

By RONG CHEN

A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

WASHINGTON STATE UNIVERSITY School of Electrical Engineering and Computer Science December 2003

To the Faculty of Washington State University: The members of the Committee appointed to examine the dissertation of RONG CHEN find it satisfactory and recommend that it be accepted.

Chair

ii

ACKNOWLEDGMENT

Above all, I would like to thank Dr. Krishnamoorthy Sivakumar, who for the last four and half years has been my advisor. Dr. Sivakumar guided me in the research work, gave me support when I need it the most, also taught me a lot how to become a successful and independent researcher. His discussion with me helped me to form the key component of this dissertation. Special thanks for Dr. Hillol Kargupta, who introduced me to Bayesian network and distributed data mining and supported my research work. I would also like to thank Dr. Benjamin Joseph Belzer and Dr. Thomas Fischer; their lectures helped me to build a solid research background. They also gave me the IRL, which is important to me. I would also like to thank my parents and my wife for the endless love and support they provided me. This work was partially supported by NASA, under Cooperative agreement NCC 2-1252.

iii

COLLECTIVE APPROACH FOR BAYESIAN NETWORK LEARNING FROM DISTRIBUTED HETEROGENEOUS DATABASE Abstract by Rong Chen , Ph.D. Washington State University December 2003

Chair: Krishnamoorthy Sivakumar In this dissertation we concentrate on learning Bayesian Networks (BN) from distributed heterogeneous databases. We need to develop distributed techniques that save communication overhead, offer better scalability, and require minimal communication of possibly secure data. The objective of this work is to learn a collective BN from data that is distributed among geographically diverse sites. The data distribution is heterogeneous. The collective BN must be close to a BN learned by a centralized method and must require only a small amount of data transmission among different sites. In general, the collective learning algorithms have four steps: local learning, sample selection, cross learning, and combination. The key points in the proposed methods are: (1)use the BN decomposability property; (2)identify the samples that are most likely to be evidence of cross terms. We show that low-likelihood samples in each site are most likely to be the evidence of cross terms. One collective structure learning and two collective parameter learning methods iv

are proposed. For structure learning, the collective method can find the correct structure of local variables by choosing a base structure learning algorithm with the decomposability property. Some extra links may be introduced due to the hidden variable problem. Sample selection chooses low-likelihood samples in local sites and transmits them to a central site. In cross learning, the structure of cross variables and cross set are identified. In combination, we add all cross links and remove extra local links. For parameter learning, Collective Method 1 (CM1) and Collective Method 2 (CM2) can learn a BN which is close to Bcntr using a small portion of samples. Local learning learns parameters for local variables. Cross learning learns the parameters of cross variables. The combination step aggregates the parameters of local variables and cross variables. In order to handle applications with real-time constraints, we have developed CM2. Using a notion of cross set, CM2 chooses a subset of features in a local site to do the likelihood computation and data selection. This can greatly reduce the local computation and the data transmission overhead. Experimental results demonstrate the efficiency and accuracy of these methods.

v

Contents

1 Introduction

1

2 Background

9

2.1

Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2

Distributed Data Mining . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.3

Collective Learning Framework . . . . . . . . . . . . . . . . . . . . .

22

2.4

Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.4.1

Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.4.2

Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.4.3

Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.4.4

Collective Bayesian Network Learning . . . . . . . . . . . . . .

32

3 Distributed Bayesian Network Structure Learning 3.1

Structure Learning in Centralized Case . . . . . . . . . . . . . . . . .

vi

34 35

3.2

3.3

3.1.1

Dirichlet Distribution and Multinomial Sampling . . . . . . .

36

3.1.2

Scoring Based Method, K2 Algorithm . . . . . . . . . . . . . .

38

3.1.3

Dependence Analysis Method . . . . . . . . . . . . . . . . . .

42

3.1.4

Equivalent BN Structure . . . . . . . . . . . . . . . . . . . . .

44

Structure Learning in Distributed Case . . . . . . . . . . . . . . . . .

46

3.2.1

Cross Node, Cross Link, and Cross Set . . . . . . . . . . . . .

46

3.2.2

Overview of Collective Structure Learning Algorithm . . . . .

47

3.2.3

Local Learning . . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.2.4

Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.2.5

Cross Learning . . . . . . . . . . . . . . . . . . . . . . . . . .

56

3.2.6

Combination . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4 Distributed Parameter Learning

61

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.2

Parameter Learning in Centralized Case . . . . . . . . . . . . . . . .

63

4.3

Collective Bayesian Network Algorithm: CM1 . . . . . . . . . . . . .

66

4.3.1

Collective Method 1 . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.2

Performance Analysis . . . . . . . . . . . . . . . . . . . . . . .

68

Collective Bayesian Network Algorithm: CM2 . . . . . . . . . . . . .

74

4.4

vii

4.5

4.4.1

Data selection in CM2 . . . . . . . . . . . . . . . . . . . . . .

76

4.4.2

Comparison between CM1 and CM2 . . . . . . . . . . . . . .

79

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5 Experimental Results

82

5.1

Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . .

83

5.2

DistrBN system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

5.3

Parameter Learning: CM1 . . . . . . . . . . . . . . . . . . . . . . . .

85

5.3.1

ASIA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

5.3.2

ALARM Network . . . . . . . . . . . . . . . . . . . . . . . . .

89

Parameter Learning: CM2 . . . . . . . . . . . . . . . . . . . . . . . .

96

5.4.1

ASIA model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

5.4.2

ALARM Network . . . . . . . . . . . . . . . . . . . . . . . . .

98

Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5.4

5.5

5.6

5.5.1

ASIA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5.2

ALARM Network . . . . . . . . . . . . . . . . . . . . . . . . . 103

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6 Applications 6.1

108

Application: NASA DAO and NOAA AVHRR Pathfinder Datasets . 109 6.1.1

Description of the Datasets . . . . . . . . . . . . . . . . . . . 109 viii

6.2

6.1.2

Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.1.3

Distributed BN Learning . . . . . . . . . . . . . . . . . . . . . 115

Distributed Web Log Mining . . . . . . . . . . . . . . . . . . . . . . . 116 6.2.1

Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2.2

Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.2.3

Bayesian Network Learning . . . . . . . . . . . . . . . . . . . 120

7 Conclusion and Future Work

123

A Notation

129

ix

List of Figures 1.1

Knowledge Discovery and Data Mining . . . . . . . . . . . . . . . . .

4

2.1

Dataset in Relation Model Form . . . . . . . . . . . . . . . . . . . . .

18

2.2

Bayesian Network: ASIA Model . . . . . . . . . . . . . . . . . . . . .

27

2.3

D-separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.1

Distributed ASIA Model: {A, T, E, X} in site A and {S, L, B, D} in site B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

47

Extra Links in Local Learning: (a)Hidden Path Extra Link (b)Hidden Parent Extra Link. (Node Ordering {X, Y, Z})

. . . . . . . . . . . .

52

4.1

Local Dataset in Collective Learning . . . . . . . . . . . . . . . . . .

79

4.2

Distributed Parameter Learning Algorithm CM2 Framework . . . . .

79

5.1

Distributed ASIA Model: {A, T, E, X, D} in Site A and {S, L, B} in Site B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

86

5.2

Performance of Collective BN in CM1 . . . . . . . . . . . . . . . . . .

89

5.3

KL Distance between Conditional Probabilities for ASIA Model in CM1 90

5.4

ALARM Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

5.5

KL Distance between Conditional Probabilities for ALARM Network

94

5.6

KL Distance Between Conditional Probabilities for ALARM Network for Different Splitting Cases . . . . . . . . . . . . . . . . . . . . . . .

95

5.7

ASIA Model: CKL Distance for Cross Variables in CM2 . . . . . . .

97

5.8

CKL of CM1, CM2, and Random method . . . . . . . . . . . . . . . 100

5.9

Variance of CM2 and Random method . . . . . . . . . . . . . . . . . 100

5.10 Distributed ASIA Model Structure Learning . . . . . . . . . . . . . . 103 5.11 ALARM Network Structure Learning Experiment: 3 Sites . . . . . . 105 6.1

Earth Science Data model . . . . . . . . . . . . . . . . . . . . . . . . 109

6.2

Histogram of f8 and f20

6.3

Bcntr of March Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.4

NASA DAO/NOAA Structure Learning

6.5

Schematic Illustrating Preprocessing and Mining of Web Log Data . . 121

6.6

Bayesian Network Structure learnt from Web Log Data . . . . . . . . 122

6.7

KL Distance between Joint Probabilities in Distributed Web Log Min-

. . . . . . . . . . . . . . . . . . . . . . . . . 115

. . . . . . . . . . . . . . . . 118

ing Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

xi

List of Tables 2.1

Homogeneous case: Site A with a table for credit card transaction records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

19

Homogeneous case: Site B with a table for credit card transaction records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.3

Heterogeneous case: Site X with weather data. . . . . . . . . . . . . .

20

2.4

Heterogeneous case: Site Y with holiday toy sales data. . . . . . . . .

20

2.5

Merged dataset using City-State mapping. . . . . . . . . . . . . . . .

20

2.6

The conditional probability of node E for the ASIA model . . . . . .

29

3.1

The r value of ASIA model links

. . . . . . . . . . . . . . . . . . . .

53

3.2

Functionality of local learning, cross learning, and combining . . . . .

59

5.1

Distributed learning system: DistrBN . . . . . . . . . . . . . . . . . .

85

5.2

(Left) The conditional probability of node E and (Right) All conditional probabilities for the ASIA model . . . . . . . . . . . . . . . . . xii

87

5.3

The conditional probabilities of local site A and local site B . . . . .

88

5.4

Splitting cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.5

Comparison of local computation time (in second) for CM1 and CM2 in ASIA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.6

97

Comparison of local computation time (in second) for CM1 and CM2 in ALARM network . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5.7

Cross learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.8

Scalability experiment results . . . . . . . . . . . . . . . . . . . . . . 106

6.1

NASA DAO features . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.2

NOAA features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

xiii

Chapter 1 Introduction During the last couple of decades, our ability to collect and store data is becoming increasingly overwhelming. Advancement in storage, technology, and reduction in cost are going to maintain this trend. The bottleneck of turning data into success is the difficulty of extracting usable knowledge from the data. People try to seek models and patterns in data. Analyzing and extracting knowledge from such overwhelming amounts of data by human being is almost impossible. So a new field called Knowledge Discovery in Databases (KDD) has emerged to handle the problem. KDD denotes the complex process of identifying valid and useful patterns in data. These patterns can then be used for decision making, model building, or understanding the overall process. Some examples in scientific and commercial applications are as follows.

1

1. Web Log Mining. The World Wide Web (WWW) is growing at an astounding rate. In order to optimize tasks such as Web site design and Web server design, and to simplify navigation through a Web site, the analysis of how the Web is used is very important. Usage information can be used to optimize web site design. For example, if we find that 80% of users who buy a computer model A in a web shop also visit links to a specific peripheral device B, or software package C, we can set up appropriate dynamic links for such users. Another example is for web server design. The different resources like html, jpeg, midi etc. are typically distributed among a set of servers. If we find a significant fraction of users requesting resources from server A also request some resource from server B, we can either keep a copy of those resources in server A or redistribute the resources among the servers in a different fashion to minimize the communication between servers. Web server log contains records of user interactions when request for the resources in the servers is received. This contains a wealth of data for the analysis of web usage and identifying different patterns. 2. NASA Earth Observing System (EOS). NASA EOS generates more than 100 gigabytes of image data per hour, which are stored, managed, and distributed by the EOS data and information system (EOSDIS). A pair of Terra

2

spacecraft and Landsat 7 alone produces huge amounts of EOSDIS data per day. NASA Data Assimilation Office provides many research quality data sets. They are comprehensive and dynamically consistent datasets which represent the best estimate of the state of the atmosphere at a time. Having collected these datasets, we try to find the models and patterns in data to model the earth system and climate change. These models could be a climate model of a specific region such as Pacific Northwest, the feature relations between El Nino phenomenon which is used to predict and understand the phenomenon, and identification of the key feature that affects many other features and is central to the climate model. These models and patterns are invaluable in understanding global changes in weather patterns. 3. Bioinformatics. Bioinformatics applies computational techniques to biology and genomics related data. This kind of data is being collected in abundance in recent years. For example, the GenBank repository of nucleic acid sequences contained 8,214,000 entries on August 2000 and is doubling about every 15 months. When we have the data, the more important issue is to understand the data. Genomes dataset has 40 complete genomes and each has 1.6 millon to 3 billion entries. One of the objectives is to find the repeated structure assignments, extract feature patterns, and relate these feature patterns to dis-

3

Raw Data

Preprocessing

Transformation Preprocessed Data Data Mining Transformed Data

Knowledge

Figure 1.1: Knowledge Discovery and Data Mining eases. This knowledge could be used in understanding the cause, diagnosis, and treatment of diseases. 4. Financial Data. In recent years, an increasing number of financial data is available online. Using some simple script file, we can download the stock market data from sites like www.nasdaq.com or finance.yahoo.com. These data streams change frequently. By analyzing the data stream, we can find some patterns such as interrelationship between the prices of different stocks, periodic or other trends in stick prices, or the impact of news events on the stock market. These models can then be used to predict changes on stock prices and could be crucial to decision making. Figure 1.1 depicts the KDD process. Some researchers view data mining as the actual process of extracting knowledge hidden within the data and knowledge discovery to including pre and post-processing of data in addition to data mining. Now 4

it is more common to refer to data mining as the complete KDD process. In this dissertation, we use data mining to include the complete KDD process. An example of the KDD process is our web log mining application. The raw web log file was obtained from the web server of the School of EECS at Washington State University — http://www.eecs.wsu.edu. There are four steps in our processing. First we preprocess the raw web log file to transform it to a session form which is useful to our application. This involves cleaning the data - removing some image file and music file log, identifying a sequence of logs as a single session, based on the IP address (or cookies if available) and time of access. Each session corresponds to the logs from a single user in a single web session. We consider each session as a data sample. Then we categorize the resource requested from the server into different categories. For our example, based on the different resources on the EECS web server, we considered eight categories: E-EE Faculty, C-CS Faculty, L-Lab and facilities, T-Contact Information, A-Admission Information, U-Course Information, H-EECS Home, and R-Research. These categories are our features. This categorization has to be done carefully, and would have to be automated for a large web server. Each feature value in a session is set to one or zero, depending on whether the user requested resources corresponding to that category. An 8-feature, binary dataset was thus obtained, which was used to learn a BN. After that, we get the transformed data which can be used to find the patterns. Finally, we use the dataset to learn a probabilistic graph model to represent 5

the user patterns. One of the main challenges in data mining is the development of algorithms that can handle large and physically distributed data sets. For example, the data managed by EOSDIS are available from eight EOSDIS Distributed Active Archive Centers (DAACs). Each data center holds and provides data pertaining to a particular Earth science discipline, and they collectively provide a physically distributed but logically integrated database to support interdisciplinary research into global climate change. Mining web server log data and financial data are common examples of distributed data in the non-scientific domain. Web server log contains records of user interactions when request for the resources in the servers is received. This contains a wealth of data for the analysis of web usage and identifying different patterns. The advent of large distributed environments in commercial domains (e.g. the Internet and corporate intranets) introduces a new dimension to this process — a large number of distributed sources of data that can be used for discovering knowledge. Cost of data communication between the distributed databases is a significant factor in an increasingly mobile and connected world with a large number of distributed data sources. This cost consists of several components like (a) Limited network bandwidth, (b) data security, and (c) existing organizational structure of the applications environment. The field of Distributed Knowledge Discovery and Data Mining studies algorithms, systems, and human-computer interaction issues for knowledge discovery applications 6

in distributed environments. In this dissertation, we address Bayesian network learning from distributed heterogenous database. A Bayesian Network (BN) is a probabilistic model based on a directed acyclic graph. It is considered to be a promising model to represent uncertain reasoning in Artificial Intelligence (AI), expert systems, and medical diagnosis. Bayesian networks offer very useful information about the independencies and conditional independencies among the features in the application domain. Such information can be used for gaining better understanding about the dynamics of the process under observation. Financial data analysis, manufacturing process monitoring, sensor data analysis, and web mining are a few examples where mining Bayesian networks has been quite useful. Since Bayesian network is a powerful and promising model for many problem domains, we want to use it in distributed scenario. Constructing a Bayesian network only by experts needs lot of time and effort. Sometimes it may impossible because we do not understand the problem very well. So learning (or constructing) a Bayesian network from the observed data is crucial for the real world application of Bayesian network. There has been an enormous amount of work in the area of BN learning. We concentrate on the BN learning in distributed heterogenous case. The details of the motivation for this is given in chapter 2. In this dissertation, we propose two distributed parameter learning algorithms and one distributed structure learning algorithm in the framework of collective data mining. These algorithms can handle 7

the distributed learning problem. The efficiency and accuracy of them are verified by the experiment results and real world applications. The organization of the dissertation is as follows. Chapter 2 provides an overview of data mining, distributed data mining, background on BN, and introduces the collective learning framework. It also discusses related literatures. In chapter 3 the distributed structure learning algorithm for BN is presented. It can obtain the correct BN structure under some constraints. The work in this chapter is reported in [CSK03b, CS03]. Chapter 4 concentrates on the distributed parameter learning algorithm. Two collective methods that can handle distributed BN parameter learning are proposed in this chapter. Theoretical analysis is given. These algorithms are published in [CSK01b, CSK01a, CSK02, CS02]. Chapter 5 gives the experimental results to support proposed algorithms. These experiments are designed to test the accuracy, efficiency, and scalability etc. of the proposed algorithms. It also addresses the real world implementation problem and introduces the DistrBN system. Chapter 6 gives two real world applications. Chapter 7 summarizes the work we have done and provides directions for future research.

8

Chapter 2 Background The layout of this chapter is as follows. In section 2.1, we give a short review of some data mining algorithms. Section 2.2 addresses a special data mining task: learning models and patterns from distributed datasets. It also classifies the distributed learning into homogeneous and heterogeneous cases. A distributed learning framework is given in section 2.3. Finally, in section 2.4, we give an introduction to Bayesian networks, structure and parameter, and related literatures.

2.1

Data Mining

In [HMS01], data mining is defined as “the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways

9

that are both understandable and useful to the data owner.” This definition points out some characteristics of the data mining process. First, the data used in data mining process is already collected and we cannot change the collection method. On the other hand, statistical analysis methods can control the data collection process such as using different experimental design procedures. This makes the data mining process more data-driven instead of being model-driven. We should carefully select the models based on different data sets. Secondly, data mining tries to find novel patterns and models. That is, to find the knowledge that will help us understand the scientific, engineering, or business process and make decisions. We also should emphasize that data mining process is an automatic learning process. In this process, the machine can learn the clearly defined models based on the learning procedure given by a human being. Human being works as an expert and the main computation task should be handled by machine. Data mining discipline has its roots in many areas such as statistics, artificial intelligence, machine learning, and database. Data mining tasks include learning the predictive and descriptive models [Dun03]. In a predictive model, we have a set of variables of interest (dependent variables) and we try to represent/explain these variables by other variables (independent variables). Then we can the predict the dependent variable values for new data based on the model learned from historical data. Predictive model learning is supervised learning. Classification, regression, and time series prediction are examples of predictive model 10

learning. In descriptive model learning, we try to find the underlying models and patterns in the dataset which describe the overall or local relationship among domain problem variables. It is unsupervised learning; clustering, Bayesian network learning, association rule learning are typical descriptive model learning tasks. Some widely used data mining techniques are regression, decision tree, artificial neural networks, clustering, association rule learning, and Bayesian networks. Regression analysis is a very useful model to explain the relationship between a dependent variable and other independent variables [Kac82, BDB95]. A p-order multiple linear regression model can be represented as

Y = β0 + β1 X1 + . . . + βp Xp +

(2.1)

where Y is the dependent variable and {Xi } are independent variables. Variable is the random error which is the amount of variation in Y that cannot be accounted for by the linear combination of {Xi }. One key assumption is that the errors in different observations are independent and each has normal distribution N (0, σ 2 ). The estimate of coefficient {βî } is determined by solving the equation X X βˆ = X Y . Regression analysis is widely used in curve fitting and prediction. For example, in time series analysis, we can use linear regression to remove the trend of the series. Logistic regression [Kac82] is designed to handle problems in which the dependent 11

variable is discrete. The linear probability model is

Y = a + BX +

(2.2)

In this model, Y = 1 if event happens and Y = 0 otherwise. If we use linear regression, then the predicted value of Y will be non-binary. Also cannot be normal since Y is binary. Moreover, is heteroskedastic. To handle these problems, logistic regression model is developed: ln

p = a + BX + 1−p

(2.3)

where p = P (Y = 1). The coefficient a and matrix B can be estimated using maximum likelihood estimation method. Logistic regression is a popular tool in economics and finances. It can also be used in classification. Decision tree [Mit97, HMS01, Web] is a procedure for classification and regression using a tree-based structure. There are three different decision tree algorithms: CART, C4.5, and CHAID. These algorithms share similar ideas. The dataset has the form (x1 , x2 , . . . , xi , y). In each step, the decision tree algorithm chooses the best variable to split the data into two or more groups. A measure function is introduced to measure how well the given variable separates the samples according to the target classification. One commonly used measure function is information gain. CART

12

has exactly two branches from each nominal node. C4.5 has the number of branches as the category number. CHAID uses statistical significance test to determine the branch size. Pruning technique is used in decision trees and cross-validation is used to estimate the performance of the tree model on data. Feedforward Multilayer Perceptrons (MLPs) [Mit97, HMS01] is the most widely used artificial neural network. MLPs can provide a nonlinear mapping from the independent variable {Xi } to a dependent variable Y . So it can be used for nonlinear classification and regression. It can also handle problems on which logistic regression fails. MLP has a network structure. Each node is associated with a transformation function and each link has an associated weight. The weights can be determined by the back-propagation algorithm, which is a conjugate gradient technique to minimize the mean-squared error on the training dataset. However, it is hard to determined the structure of a MLP such as how many layers and how many hidden nodes should be included in the model. In real world applications, we often use trial-and-error and prior experience to determine the network structure. Clustering [Ber02] as a unsupervised learning technique has been developed by researchers from different area such as AI, image processing, and pattern recognition. K-means method is a type of partition-based clustering method. For each cluster (or category), we can define its centroid. The clustering process starts from an initial set of clusters and associated centroids and computes the distance between each data 13

point and the centroids to decide the label for that data point. After that, the clustering algorithm iteratively updates the centroid and data point labels. Several issues in clustering are how to evaluate the result, decide the number of clusters, and choice of the distance function. Association rule [AIS93, AS94] is probabilistic description about the co-occurrence of certain events in a dataset. An association rule may have the form: If X = x and Y = y, then Z = z with probability p. Note that this rule is a probabilistic rule. Conditional probability p(Z = z | X = x, Y = y) is the accuracy of the rule and p(X = x, Y = y, Z = z) is called the support of the rule. Association rule is widely used to find the local patterns of a dataset. For example, in a supermarket, the sales manager might want to check the sales data and find patterns like “many customers who buy coffee will also purchase sugar.” Then he may want to arrange these items together (or use other marketing strategies) to increase the cross-sale. The key part of association rule is how to search the patterns in a huge dataset. Principle Component Analysis (PCA) can be used for data reduction and visualization [KHK+ 00]. It defines a new basis for the dataset based on the eigenvalues and eigenvectors. The dimension of dataset can be effectively reduced by choosing the eigenvectors which account for most of the variation in the dataset. This could also help us do clustering, regression analysis, and decision tree analysis. However, interpreting the transformed data in a physical sense may not be easy. 14

There are a number of data mining software, systems, tools, and products. For example, there exists about one hundred software packages for Bayesian network learning or inference. SAS not only provides many built-in statistical algorithms to help users implement their own algorithm, it also has a product called Enterprise Miner. SAS Enterprise Miner divides the data mining process into several steps and users can choose different data input, transformations, and mining algorithms to build a system. Enterprise Miner gives algorithms including decision trees, neural networks, regression, memory based reasoning, bagging and boosting ensembles, twostage models, clustering, time series, and associations. Other products include SPSS Clementine, Microsoft SQL Server, IBM Intelligent Miner etc.

2.2

Distributed Data Mining

Distributed Data Mining (DDM) embraces the growing trend of merging computation with communication. The primary motivation for DDM is the mining of very large databases, that are geographically distributed. Some DDM application are as follows, 1. NASA EOSDIS. The data of EOSDIS are distributed in eight EOSDIS Distributed Active Archive Centers (DAACs). Each data center holds and provides data pertaining to a particular Earth science discipline. The features in each site are different. We not only analyze the local data in each site to get a pat15

tern for a particular Earth science area, but also need to collectively analyze all datasets in order to carry out the interdisciplinary research to find global climate change. 2. Web Log Mining. In an increasingly mobile world, the log data for an individual user would be distributed among many servers. Consider a subscriber of a wireless network. This person travels frequently and uses her palm-top computer and cell phone to do business and personal transactions. Her transactions go through different servers depending upon her location during the transaction. Now let us say her wireless service provider wants to offer more personalized service to her by paying careful attention to her needs and tastes. This may be useful for choosing the instant messages appropriate for her taste and needs. For example, if she is visiting the Baltimore area the company may choose to send her instant messages regarding the area Sushi and Italian restaurants that she usually prefers. Since too many of such instant messages are likely to be considered a nuisance, accurate personalization is very important. This is indeed quite well appreciated by the business community and use of Bayesian techniques for personalizing web sites has already been reported elsewhere [PB97, PMB96, BP96, BP97].

16

3. Financial data. Several financial organizations want to cooperate to prevent fraudulent intrusion. They need to use some way to share their information. However they cannot merge their database since it is sensitive. One naive way of doing distributed data mining is to aggregate all local datasets into one site to build a global dataset and then apply centralized data mining techniques to it. This is referred to as the centralized method. However in some cases the centralized method may infeasible. First, it requires a lot of data transmission. This may take a long time in a network with limited bandwidth and may also add to the cost in commercial applications. Second, the global dataset could be a very large and a centralized data mining technique may not scale well on this huge dataset. Large memory requirements and other computing resource may also make it infeasible. Third, some applications have security constraints, which make it impossible to aggregate all datasets in one site. Due to above reasons, DDM techniques need to be developed to save communication overhead, offer better scalability, with minimal communication of possibly secure data. DDM must deal with different possibilities of data distribution. In this dissertation, we assume the data is in relation model form. The dataset is assumed to be in a tabular form. Rows correspond to records or observations and columns correspond to features. Figure 2.1 provides an example of a dataset. The data sites may be

17

homogeneous. That is, there exists consistent database schemas across the sites. Tables 2.1 and 2.2 illustrate this case using an example from a hypothetical credit card transaction domain. There are two data sites A and B, connected by a network. The DDM-objective in such a domain may be to find patterns of fraudulent transactions. Note that both the tables have the same schema. The underlying distribution of the data may or may not be identical across different data sites. name

sex

age

grade

#1

Joe

20

M

B

#2

Mary

18

F

A

#3

Kevin

21

M

A

Figure 2.1: Dataset in Relation Model Form

In a more general case the data sites may be heterogeneous. In other words, sites may contain tables with different schema. Different features are observed at different sites. We assume that there exists a “key” that can link the observations across sites. We also assume that there exists a mechanism that can coordinates all distributed local datasets. The mechanism requires that there exists a one-to-one mapping among primary keys in all distributed sites. Let us illustrate this case with relational data. Consider a distributed heterogenous dataset which is distributed among two sites X and Y. Site X has a table containing weather-related data (see Table 2.3), whereas site Y contains holiday toy sales data (see Table 2.4). Each row is a sample and each

18

Table 2.1: Homogeneous case: Site A with a table for credit card transaction records. Account Amount Location Previous Unusual Number record transaction 11992346 -42.84 Seattle Poor Yes 12993339 2613.33 Seattle Good No 45633341 432.42 Portland Okay No 55564999 128.32 Spokane Okay Yes Table 2.2: Homogeneous case: Site B with a table for credit card transaction records. Account Amount Location Previous Unusual Number record transaction 87992364 446.32 Berkeley Good No 67845921 978.24 Orinda Good Yes 85621341 719.42 Walnut Okay No 95345998 -256.40 Francisco Bad Yes column is a feature (or attribute). Feature City is a primary key in site X and site Y has a primary key State. Using a City–State mapping, we can merge two records (Seattle, 63, 88, 4) from site X and (WA, Snarc Action Figure, 47.99, 23K) from site Y into one record (Seattle, 63, 88, 4, WA, Snarc Action Figure, 47.99, 23K). Table 2.5 is the merged dataset using City–State mapping. In the general heterogeneous case the tables may be related through different sets of key indices. We consider the heterogenous data scenario in this thesis. We would like to mention that heterogenous databases, in general, could be more complicated than the above scenario. For example, there maybe a set of overlapping features that are observed at more than one site. Moreover, the existence of a key that can be used to link together observations across sites is crucial to our approach. For 19

Table 2.3: Heterogeneous case: Site X with weather data. City Temp. Humidity Wind Chill Boise 20 24% 10 Seattle 63 88% 4 Portland 51 86% 4 Vancouver 47 52% 6 Table 2.4: Heterogeneous case: Site State Best Selling Item WA Snarc Action Figure ID Power Toads BC Light Saber OR Super Squirter City Boise Seattle Portland Vancouver

Y with holiday toy sales data. Price # Item Sold 47.99 23K 23.50 2K 19.99 5K 24.99 142K

Table 2.5: Merged dataset using City-State mapping. Temp. Humidity Wind State Best Selling Price # Item Item Sold Chill 20 24% 10 ID Power Toads 23.50 2K 63 88% 4 WA Snarc Action 47.99 23K 51 47

86% 52%

4 6

OR BC

Figure Super Squirter Light Saber

24.99 19.99

142K 5K

example, in a web log mining application, the key that can be used to link together observations across sites could be produced using either a “cookie” or the user IP address (in combination with other log data like time of access). However, these assumptions are not overly restrictive, and are required for a reasonable solution to the distributed Bayesian learning problem. The volume of DDM literature is growing fast. There exists a reasonably large body of work on DDM architectures and techniques for the homogeneous and hetero-

20

geneous cases. In the following, we review only the existing literature for heterogeneous DDM. Mining from heterogeneous data constitutes an important class of DDM problems. This issue is discussed in [PB95] from the perspective of inductive bias. The WoRLD system [AKPB97] addressed the problem of concept learning from heterogeneous sites by developing an “activation spreading” approach that is based on first order statistical estimation of the underlying distribution. A novel approach to learn association rules from heterogeneous tables is proposed in [CS99]. This approach exploits the foreign key relationships for the case of a star schema to develop decentralized algorithms that execute concurrently on the separate tables, and subsequently merge the results. An order statistics-based technique for combining high-variance models generated from heterogeneous sites is proposed in [TG00a]. Kargupta and his colleagues [KPHJ00] also considered the heterogenous case and proposed the Collective framework to address data analysis for heterogeneous environments. They proposed the Collective Data Mining (CDM) framework for predictive data modelling that makes use of orthonormal basis functions for correct local analysis. They proposed a technique for distributed decision tree construction [KPHJ00] and wavelet-based multi-variate regression [HK01]. The distributed decision tree learning algorithm for data stream is in [KP02]. Several distributed clustering techniques based on the Collective framework are proposed elsewhere [JK99, KHSJ00a]. 21

They also proposed the collective PCA technique [KHSJ00a, KHK+ 00] and its extension to a distributed clustering application [KHSJ00b]. A privacy-preserving distributed data mining technique using multiplicative random projection-based noise is proposed in [LKR03]. A distributed data stream monitor system called VEDAS is reported in [KBL+ 04]. Additional work on distributed decision tree learning [BS97], clustering [MSG00, PO00, SS00], genetic learning [KPJ+ 01] DDM design optimization [TG00b], classifier pruning [PS00], DDM architecture [KN96], and problem decomposition and local model selection in DDM [LPO00], are also reported.

2.3

Collective Learning Framework

Kargupta et. al. [KPHJ00] proposed the Collective Data Mining (CDM) framework to address data analysis for heterogeneous environments. The problem is supervised inductive learning. The goal is to learn a function fˆ : X n → y from the data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} which is generated from an underlying function f . In distributed heterogeneous scenario, different local sites contain different features. However, y is available to all sites. The foundation of CDM is based on the fact that any function can be represented in a distributed fashion using an appropriate basis.

22

It can be represented as follows,

f (x) =

ωk ψk (x)

k

where ψk (x) is the k-th basis function and ωk is the corresponding coefficient. The ˆ = ωˆk ψk (x) is objective of CDM is to get a ωˆk , an estimation of ωk , such that f (x) k close to f . The CDM notes that if we choose an orthonormal basis, the orthonormal spectrum can be accurately learnt in distributed fashion. The orthonormality is critical to the correct and independent local analysis. In general, the main steps of CDM are (1)Choose an appropriate orthonormal representation of the data model. (2)Construct the orthonormal representation from the data in distributed fashion. (3)Build the data model from orthonormal representation. A practical CDM algorithm often involves two steps: learn the local models and generate cross terms involving features from different sites. Distributed decision tree construction [KPHJ00] and wavelet-based multi-variate regression [HK01] were proposed. The orthonormal basis of distributed decision tree is Fourier basis. How to build a Fourier spectrum of a decision tree and reconstruct a decision tree from the Fourier coefficients is provided. The key point is the spectrum of the decision tree can be approximated by computing only a small number of low-order Fourier coefficients. It computes the low-order Fourier coefficients from the local sites, collects them in a single site, then use them 23

to generate a collective decision tree.

2.4

Bayesian Network

Probabilistic graphical model is combination of probability model and graph theory. Its probability theory part gives it ability to represent the uncertainty of problem domain. Graph theory makes the representation clear and gives a data structure to help the probability computation. Bayesian network, Markov random fields, factor graph, and decision diagram are widely used probabilistic graphical models. A Bayesian Network (BN) is a compact and efficient form to represent the joint distribution of a large number of variables. Let X denote a random variable in the problem domain and x is the state of X. V a(X) denotes the set of possible values for X or the support set of X. X denotes a set of variables and x denotes the state (or configuration) of X. X can be a discrete or continuous random variable. In this dissertation, we assume X to be discrete. Then a BN can be defined as a pair (G, θ). G is the structure and θ is the parameter of a BN.

2.4.1

Structure

G = (V, E) represents the structure of a BN. Here V is a finite and nonempty set whose elements represent variables (or nodes, features) in the problem domain. E is 24

the set of ordered pairs of distinct value in V. A element in E is called an edge (or link) which denotes probabilistic relationships among the variables. If there exists an edge between node X and Y , we say X and Y are adjacent. A sequence of nodes [X1 , X2 , . . . , Xn ] is called a path if (Xi−1 , Xi ) ∈ E(2 ≤ i ≤ n). If we omit the direction of the path, then it is a chain. A directed cycle is a path from a node to itself. A directed acyclic graph (DAG) is a graph containing no directed cycles. A BN structure G must be a DAG. Variable X is referred to as a parent of Y and Y is a child of X if there exists an edge X → Y . The set of parents of X is denoted by pa(X). If there exists a path from X to Y , then X is called an ancestor of Y and Y is called a descendent of X. The set of descendents of X is denoted by de(X) and the set of non-descendents is denoted by nd(X) = {V \ (X ∪ de(X))}. Definition 1 (Markov Condition) For a joint distribution P of all variables in V and a DAG G, we say (G, P ) satisfies the Markov condition if for each variable X, X is conditionally independent of its non-descendent given the values of its parent variables. We represent this symbolically as Ip (X, nd(X) | pa(X)). Another form of conditional independence is P (X | nd(X)) = P (X | pa(X)). Markov condition is crucial to the application of BN. Using this condition, we can represent a complicated joint distribution in a compact form. And it can also simplify the 25

computations in a BN model. For example, using this condition, we can write the joint distribution of the set of all variables in V as a product of conditional probabilities as follows: Theorem 1 (Factorization) If (G, P ) is a Bayesian network, then the joint probability distribution P is equal to the product of conditional distributions of all nodes given values of their parents.

P (V) =

P (X | pa(X)).

(2.4)

X∈V

The conditional independence between variables is either obtained from a priori expert knowledge or discerned from data, or a combination of both [Jen96]. Figure 2.2 is a BN called the ASIA model (adapted from [LS88]). The variables are Dyspnoea, Tuberculosis, Lung cancer, Bronchitis, Asia, X-ray, Either, and Smoking. They are all binary variables. The joint distribution of all variables is

P (A, S, T, L, B, E, X, D) = P (A)P (S)P (T | A)P (L | S)P (B | S) (2.5) P (E | T, L)P (X | E)P (D | B, E). One might wonder whether there exists other types of conditional independence besides that given by the Markov condition. For example, in ASIA model, we have I(X, {A, S} | E) from Markov condition. Is the statement I(X, {A, S} | {T, L}} also 26

A

S

T

L

E

B

X

D

Figure 2.2: Bayesian Network: ASIA Model true? To answer this question, we should use the notion of d-separation. D-separation [Pea88] plays a very important role in BN structure. Before we introduce this notion, we first review more graph theory. Given an edge X → Y , X is the head and Y is the tail. For an ordered tuple (X, Z, Y ), we say the following: (1)X → Z → Y is a head-to-tail meeting, and Z is a head-to-tail node. (2)X ← Z → Y is a tail-to-tail meeting, and Z is a tail-to-tail node. (3)X → Z ← Y is a head-to-head meeting, and Z is a head-to-head node. Sometimes X → Z ← Y is referred to as a v-structure. Definition 2 (D-separation) For a DAG, X, Y ∈ V, and C ⊂ V{X, Y }, X and Y are d-separated given C in G if and only if there exists no such a adjacency path p between X and Y that 1. There exist node Z ∈ C on p, and the edge incident to Z on p meet head-to-tail

27

at Z. 2. There exist node Z ∈ C on p, and the edge incident to Z on p meet tail-to-tail at Z. 3. There exists node Z, such that Z and all of Z’s descendants are not in C, on p, and the edge incident to Z on p meet head-to-head at Z. If X and Y are d-separated given C in G, we say that X and Y are d-connected by C. Figure 2.3 shows these three cases. It is shown in [Pea88] that the concept of dseparation encodes all the conditional independence, no other criterion can do better. Using the notion of d-separation, we can see that the statement I(X, {A, S} | {T, L}} is true in the ASIA model. C (1)

X

Z

Y

(2)

X

Z

Y

(3)

X

Z

Y

Figure 2.3: D-separation

Node ordering is a kind of domain knowledge that specifies a causal order of the

28

No. 1 2 3 4 5 6 7 8

T F T F T F T F T

L F F T T F F T T

E F F F F T T T T

Probability 0.9 0.1 0.1 0.01 0.1 0.9 0.9 0.99

Table 2.6: The conditional probability of node E for the ASIA model

nodes so that any parent node cannot appear earlier than its descendent in the order. Order A, S, T, L, B, E, X, D is a valid node order for the ASIA model.

2.4.2

Parameter

Conditional probability θijk = P (Xi = k | pa(Xi ) = j) is the probability of variable Xi to be in state k when pa(Xi ) is in state j. If a variable Xi has no parents (root node), then θijk corresponds to the marginal probability of node Xi . We will denote by θij the distribution of a variable Xi with a fixed parent state and by θi the Conditional Probability Table (CPT) of node Xi . We will denote by θ the set of all θijk , called parameters of a BN. For ASIA model, p(A), p(S), p(T | A), p(L | S), p(B | S), p(E | T, L), p(X | E), p(D | E, B) need to be specified. These conditional distributions can be specified in any form: implicitly by some parametric probability distributions, or explicitly as tables. The CPT of node E in ASIA model is given in table 2.4.2.

29

From conditional independence, we can see the compactness of BN. The number of values in joint distribution is exponential in terms of the number of variables. However, each value in the joint distribution can represented by the conditional distribution θ using equation (2.4). If each node does not have too many parents, there are not many values in θi . For example, consider a BN with n binary variables and each node has at most k parent, then we need at most 2k n values in θ. However, there are 2n − 1 values in the joint distribution. Typically, k 0. Random variables F1 , . . . , Fr are said to have a

Dirichlet distribution if their joint distribution has the Dirichlet density function form. Note that fr = 1 −

r−1 k=1

fk .

We now have the following Lemma.

36

Lemma 1 If a discrete random variable X has r possible values and F1 , . . . , Fr are random variables such that for all k

p(X = k | Fk = fk ) = fk ,

then p(X = k) = E(Fk ), where E(.) denotes expectation. We know that for a Dirichlet distribution, E(Fk ) = αk /α. We then have the following theorem: Theorem 2 If F1 , . . . , Fr have a Dirichlet distribution with parameter α1 , . . . , αr , then p(X = k) =

αk . α

(3.2)

A multinomial distribution is defined next. Definition 3 (Multinomial Distribution) Suppose random variable X is discrete and takes r possible values x1 , . . . , xr . If we take N independent samples of X and let Zk denote the random variable denoting the number of times X is in state k, then Z1 , . . . , Zr is a multinomial distribution with parameters N , p1 , p2 , . . . , pr ; i.e., p(Z) ∼ M u(N, p1 , . . . , pr ) if

kr p(Z1 , . . . , Zr ) = cpk1 1 . . . pr ,

37

(3.3)

where

r i=1

pi = 1,

r i=1

ki = N , and c = N !/k1 !k2 ! . . . kr !.

For a BN with multinomial local distribution, each local distribution is a collection of multinomial distributions, one for each configuration of pa(Xi ). Namely, θij = P (xi | pa(xi ) = j, S) is a multinomial distribution. A dataset D is called a multinomial sample of size M with parameter F = (f1 , f2 , . . . , fk ) if: Definition 4 (Multinomial Sample) Suppose D = {X 1 , X 2 , . . . , X M } is a set of random variables, each taking the same r possible values and F = {F1 , F2 , . . . , Fr } is a set of random variables such that for all h and all k,

p(X h = k | F1 = f1 , . . . , Fr = fr ) = fk .

Furthermore, suppose X h (1 ≤ h ≤ M ) are mutually independent, conditional on F . Then we call D a multinomial sample of size M with parameter F .

3.1.2

Scoring Based Method, K2 Algorithm

Score based methods define a score metric that describes the fitness of each possible structure to the observed data. Then the structure learning becomes an optimization problem: find the structure Sopt that maximizes or minimizes the score metric. There are many score metrics corresponding to different structure learning algorithms.

38

An important characteristic of some score metrics is decomposability. That is, the score function can be decomposed as follows:

Score(S, D) =

Score({Xi , pa(Xi )}, D(Xi , pa(Xi ))),

(3.4)

i

where D(Xi , pa(Xi )) is the data involving only Xi and pa(Xi ). This characteristic is similar to the factorization of a joint distribution 2.4. So this kind of score metric retains the Markov property of a BN. Decomposability plays an important role in our collective structure learning algorithm. Suppose given a structure S, the prior distribution of θij is Dirichlet:

p(θij | S) ∼ Dir(αij1 , αij2 , . . . , αijri ).

The data generating process is assumed to be multinomial:

p(D | θij , S) ∼ M u(Mij , θij1 , θij2 , . . . , θijri ).

Then the posterior distribution of θ is also Dirichlet

p(θij | D, S) ∼ Dir(αij1 + Nij1 , αij2 + Nij2 , . . . , αijri + Nijri ),

39

where Nijk is the number of samples in which xi = k and pa(xi ) = j. It is a sufficient statistic for θijk or for the state (xi = k, pa(xi ) = j). Predictive distribution denotes the distribution of the next case cm+1 to be observed after having observed a sample of m previous cases. We have

p(cm+1 | D, S) =

qi n αijk + Nijk i=1 j=1

αij + Nij

,

where D = {c1 , c2 , . . . cm } and Nijk are computed from D and Nij =

(3.5)

ri k=1

Nijk is a

sufficient statistic for θij . Since

p(D | S) =

N

p(cl | c1 , c2 , . . . , cl−1 , S),

(3.6)

l=1

using equations (3.5) and (3.6), we have

p(D | S) =

qi n

ri Γ(αij ) Γ(αijk + Nijk ) Γ(αij + Nij ) k=1 Γ(αij ) i=1 j=1

(3.7)

and qi n

ri Γ(αij ) Γ(αijk + Nijk ) . p(S, D) = p(S)p(D | S) = p(S) Γ(αij + Nij ) k=1 Γ(αij ) i=1 j=1

(3.8)

The score metric in equation (3.8) is called the K2 score. It was first proposed in

40

[CH92]. This scoring function is decomposable. The K2 score for a node is qi

ri Γ(αij ) Γ(αijk + Nijk ) K2(Xi , pa(Xi )) = Γ(αij + Nij ) k=1 Γ(αij ) j=1

(3.9)

Having defined a score metric, the next step is to identify a network structure with highest score. In general, this search problem is NP-hard. So we should use heuristic search methods, including greedy search and greedy search with restarts. Most widely used search methods for BN structure learning use the decomposability property. These search methods make a successive arc change. Possible arc changes are: for two nodes X and Y , add an arc (XY ) ⇒ (X → Y ), delete an arc (X → Y ) ⇒ (XY ), reverse the direction (X → Y ) ⇒ (X ← Y ). After each arc change, we must check whether the resulting graph S is a valid DAG. For each arc change, we have a score Scoreb for the DAG Sb before the change and Scorea for DAG Sa after the change. Acceptance of the change depends on the difference between the two scores. If a score has decomposability property, we can do the search node by node. For each node i, only Score(Xi , P a(Xi )a , D(Xi , pa(Xi )a )) needs to be evaluated and not the whole score metric. This can simplify the computations. K2 is the most widely used structure learning algorithm. The K2 algorithm is described in algorithm 1. The input to K2 algorithm is a dataset D, a node ordering, and output is the structure S. 41

Algorithm 1 K2 Algorithm 1: for i = 0 to n do 2: pa(Xi ) = φ 3: P old = K2(Xi , pa(Xi )) 4: OKToProceed := true 5: while OKToProceed == true do 6: z := the node in P red(X i ) \ pa(Xi ) that maximizes K2(Xi , pa(Xi ) z) 7: P new = K2(Xi , pa(Xi ) z) 8: if P new > P old then 9: P old := P new 10: pa(Xi ) := pa(Xi ) z 11: else 12: OKToProceed := false; 13: end if 14: end while 15: end for Here P red(Xi ) is the node set containing nodes that appear before Xi in the node order. K2 algorithm searches the structure node by node. That is, it finds the local structure of a node (parent set) that maximizes K2(Xi , pa(Xi )) and combines all these local structures to get the BN structure S. All structure learning algorithms do not search the optimal structure in the same way. The local searching is a kind of problem decomposition. We will use K2 algorithm as the centralized structure learning algorithm in our collective structure learning method.

3.1.3

Dependence Analysis Method

Conditional independence relationship is critical in dependence analysis structure learning algorithms. As we introduced in chapter 2, all valid conditional indepen42

dence relationships can be derived from the topology of a BN using the concept of d-separation. In this kind of learning algorithm, first a metric that measures the conditional independence is introduced, then we use d-separation to infer the structure of a BN. Here we review the algorithm proposed in [CBL97]. It uses mutual information and conditional mutual information to test the conditional independence. The mutual information of node Xi and Xj is defined as

I(Xi , Xj ) =

p(xi , xj )log

xi ,xj

p(xi , xj ) p(xi )p(xj )

(3.10)

p(xi , xj | c) p(xi | c)p(xj | c)

(3.11)

and the conditional independence is defined as

I(Xi , Xj | C) =

p(xi , xj , c)log

xi ,xj ,c

here C is a set of variables. Clearly, if I(Xi , Xj ) is a small value, then Xi and Xj are independent. If I(Xi , Xj | C) is a small value, then Xi and Xj are conditional independent given C. The input to algorithm A is a node ordering and a complete dataset, the output is the BN structure. Its learning procedure is as follows [CBL97]: 1. Phase I (Drafting). (1) Initiate the structure as a empty graph. Initiate an 43

empty list L. (2) For each pair of nodes, compute the mutual information. If the mutual information is greater than a threshold, put the node pair into L. Sort L and set pointer p to the head of L. (3) Get the first two pairs from L and add it to structure S. Remove these two pairs. (4) Get the pair from L using p. If there is no open path between the two nodes, add the arc to S and remove it from L. (5) Move p to next pair and go to step 4 unless p is in end of L. 2. Phase II (Thickening). (6) Move p to the head of L. (7) Find the cut-set that can d-separate the nodes in this pair. Compute the conditional mutual information. If these two nodes are conditionally independent given the cut-set, go to next step. If not, connect these two nodes. (8) Move p to next pair and go to step 7. 3. Phase III (Thinning) (9) For each arc in S, if there are other paths besides this arc, remove this arc temporarily and test whether these two nodes are conditionally independent in this graph. If yes, remove the arc permanently. If not, add the arc back to S.

3.1.4

Equivalent BN Structure

BN structure learning is confounded by the equivalent structure problem. This notion is given in [VP90, Hec98, Chi02]. There exists different types of structure equivalences. The first one is called Markov equivalence. 44

Definition 5 (Markov Equivalence) Two BN structures for a set of variables V are Markov equivalent if they represent the same set of conditional-independence assertions for V. In [VP90] it has been shown that two structures are independence equivalent if and only if they have the same structure ignoring arc direction and the same v-structure. A v-structure is an ordered tuple (X, Y, Z) such that there is an arc from X to Y and from Z to Y , but no arc between X and Z. Let S h be the hypothesis associated with structure S. For example, consider a BN h problem domain with two binary variables X and Y . Then Sxy which is associated

with structure S = {XY } is that D has two binomial distributions; one for X and the other for Y . Score equivalence is defined next. Definition 6 (Score Equivalence) Given two network structures S1 and S2 , S1 and S2 are score equivalent if Score(D, S1h ) = Score(D, S2h ). Score equivalence means we cannot identify S1 from S2 using a score metric. This kind of equivalence is important for structure learning. In general, we prefer a structure learning algorithm that gives a single best score structure if we use the scoring based method. Due to score equivalence, many of these evaluation criteria assign the same score to equivalent structures. So the best score structure is a set of structures. Sometimes this causes troubles in evaluating the learning algorithm and BN inference. 45

3.2

Structure Learning in Distributed Case

We first introduce some notions used in the proposed distributed learning algorithm.

3.2.1

Cross Node, Cross Link, and Cross Set

At each local site, we classify variables into local variables and cross variables. The links (or edges) in the whole BN are also classified as local links and cross links. This would help the identification of samples to be transmitted to the single site. There are two types of variables in a distributed BN. Definition 7 (Cross Variable) If a variable X and some of its parents are in different sites, then X is called a cross variable. Otherwise X is called a local variable. Next we define a cross link. Definition 8 (Cross Link) If an edge X → Y whose parent variable X and child variable Y are in different sites, this edge is referred to as a cross link. Otherwise it is a local link. An important notion in our collective learning is cross set, which is defined next. Definition 9 (Cross Set) Cross set of a distributed Bayesian Network is the node set CS={cross node} {local node which is the parent of cross node}

46

For example, in the distributed ASIA model in figure 3.1, the local variables in site A are A, T, X and cross variable is E. Local variables in site B are S, L, B and cross variable is D. Cross links are L → E and E → D. Other links are local links. {T, L, B, E, D} is the cross set. A

S

T

L

B

E

X

D

Figure 3.1: Distributed ASIA Model: {A, T, E, X} in site A and {S, L, B, D} in site B

3.2.2

Overview of Collective Structure Learning Algorithm

The collective structure learning algorithm is described in algorithm 2. From the description of the collective learning procedure, our algorithm includes four steps: 1: local learning, 2-5: data selection, 6-7: cross learning, 8-9: combination. Local learning and cross learning steps are both similar to a centralized BN learning problem. In this dissertation, the structure learning algorithm used for local learning and cross learning step is called the base structure learning algorithm. We use the K2 algorithm as the base structure learning algorithm. Local learning tries to find the correct structure of local variables. Data selection chooses a subset of samples to 47

Algorithm 2 Collective Structure Learning Algorithm 1: Learn local BN Blocal involving the variables observed at each local site. 2: Compute likelihood of samples based on local BNs. 3: Transmit the index set of low likelihood samples from each local site to the central site. 4: Compute the intersection of these index sets at the central site. 5: Transmit samples corresponding to the intersection set from all local sites. At the central site, a limited number of observations Dcoll of all the variables are now available. 6: Learn a new BN Bcross using Dcoll in central site. 7: Identify cross links, cross nodes, and cross sets based on Bcross . 8: Get the collective BN structure by combining the structure of local BNs and cross links. 9: Remove extra links introduced during local learning. learn the structure of cross variables. Cross learning can find the structure of cross variables and identify cross variable and cross set. Combination step will merge the local BNs with cross BN and remove extra local links. We will now elaborate on the various steps described above. In a real world application, real BN Breal is unobservable. Therefore, we compare the BN Bcoll learnt from the collective method with Bcntr to evaluate the performance of our algorithm, where Bcntr is the BN learnt when all the data are available at a central site. So the correct local structure of a node means the parent set of this node in Bcoll is same as that in Bcntr .

48

3.2.3

Local Learning

Local learning step applies the base structure learning algorithm to local datasets to get the the local BNs involving the variables observed at local sites. We can aggregate local local all local BN links to build structure Scoll . What is the difference between Scoll and local local Sreal ? Here Sreal is the real BN structure without cross links. To answer this

question, we analyze different type of nodes. For local nodes, we have the following theorem. Theorem 3 (Local Node Theorem) For a local node X, a structure learning algorithm with decomposability property can find the correct structure SX for this variable. Proof: Let X be a local node. For Bcntr , the search space S(X) of structures of X is the set of all subsets of P red(X) and the dataset is D. Let pabest be the parent set that maximizes score(X, pa(X), D). Suppose variable X is in site A. The search space S local (X) of local structure of X in local site A is the set of all subsets of {P red(X) \ {Y : Y ∈ / N odeSet(A)}}. The dataset is Dlocal (A). For a structure learning algorithm with the decomposability property, the score of any candidate parent set in S local (X) with respect to Dlocal (A) is the same as that in S(X) with respect to D. This is true for any base learning algorithms with decomposability property. For a local node, all parents are in the same site, so pabest is also in Slocal (X). 49

Since pabest can maximize/minimize score(X, pa(X), D) and score(X, pa(X), D) = scorelocal (X, pa(X), Dlocal ), the optimization result of local structure learning is also pabest . That is, local learning step can find the correct structure for X. This completes the proof. For a cross node Y , the situation is much more complicated than that for a local node. Some parent nodes of Y are not in the same site as Y . The parent set for Y can then be split into two sets: palocal (Y ) is the set of parent nodes in the same site and paother (Y ) is the set of parent nodes in a different site. Since we cannot observe any nodes in paother (Y ), the links from a node in paother (Y ) to Y (cross links) will be missing during local learning. The nodes in paother also introduce the extra link problem in local learning. local local Extra (local) links are links which appear in Scoll but not in Sreal . There are two

type of extra links. Type I is called “hidden path” extra link. An example is given in figure 3.2 (a). In this dissertation, when we depict a distributed BN model, the nodes with same shape are in same site and cross links are denoted by dotted edges. For the distributed BN model in figure 3.2 (a), the observations corresponding to variables X and Z are available at site A and those corresponding to variables Y are available at site B. In this case, when we learn a local BN at site A, we would expect an extra link X → Z, because of the path X → Y → Z in the overall BN and the fact that node Y is unobserved at site A. Type II extra links are caused by hidden 50

parents. As figure 3.2 (b) shows, Y and Z are conditionally independent given X. X is unobservable during local learning. So the local learning algorithm will introduce an extra link Y → Z since there exists some dependence between Y and Z. This extra link comes from the fact that Y and Z both are children of a hidden node X. For the extra link, we have the following corollary. Corollary 1 (Cross Set) The extra links introduced by a structure learning algorithm with decomposability property in local learning step are inside the cross set. The parents of extra local links are in the cross set and the children of extra links are cross nodes. Proof: From local node theorem, local learning can get the correct structure of a local node. So a local node cannot be a child of an extra local link. That means an extra link must end in a cross node. In local learning step, from the Markov property of Bayesian network, given the parents in cross set, a cross node will be conditionally independent of its ancestor in the same site. So the parents of extra local links cannot be outside the cross set. This completes the proof. We now know that the local learning step would introduce some extra local links due to the hidden variable problem. These extra links are inside the cross set. Cross set corollary tells us the trace of the extra links. Using this corollary, we can identify the links with hidden path or hidden parent patterns inside the cross set as possible 51

extra local links. (a)

(b)

X

Y

X

Extra link

Y Extra link

Z

Z

Figure 3.2: Extra Links in Local Learning: (a)Hidden Path Extra Link (b)Hidden Parent Extra Link. (Node Ordering {X, Y, Z})

Another question about local learning is whether the links from nodes in palocal (Y ) to a cross node Y can be correctly detected during local learning. To answer this question, we introduce the notion of strong and weak links. Definition 10 (Strong Link) A link in a Bayesian network is called a strong link if in structure learning process, adding this link will cause a relatively big improvement in the score metric. Otherwise, it is a weak link. In K2 algorithm, we can compute improvement rate r = |(scorenew −scoreold )/scoreold | when a link is added. We then set a threshold β. If r ≥ β, label this link is as a strong link. Otherwise, label it as a weak link. Table 3.1 gives the r values of some links in ASIA model. If we set β = 0.1, then S → B will be a weak link and other two links are strong links. Strong links are the main pattern of a Bayesian network and consistently appear in the learning process. Weak links may be affected by data sampling, processing, or the searching process. In the structure learning problem, we 52

have a dataset D which is faithful to the real BN Breal . If a link is a weak link, the BN resulting from removing that weak link is still close to Breal . For a cross variable Y , a local structure learning algorithm with decomposability property can find all the strong links from nodes in palocal (Y ) to Y . Table 3.1: The r value Link scoreold A → T -2532.1408 S → L -4162.4503 S → B -2504.1134

3.2.4

of ASIA model links scorenew r -2058.2958 0.1871 -1921.8099 0.5383 -2437.4035 0.0266

Sample Selection

The important objective is to correctly identify the coupling between variables that belong to two (or more) sites. In the following, we describe our approach to select observations at the local sites that are most likely to be evidence of strong coupling between variables at two different sites. At each local site, a local BN can be learnt using only samples in this site. This would give a BN structure involving only the local variables at each site and the associated conditional probabilities. Let pA (.) and pB (.) denote the estimated probability function involving the local variables. This is the product of the conditional probabilities as indicated by (2.4). Since pA (x), pB (x) denote the probability or likelihood of obtaining observation x at sites A and B, we would call these probability 53

functions the likelihood functions lA (.) and lB (.), for the local model obtained at sites A and B, respectively. The observations at each site are ranked based on how well it fits the local model, using the local likelihood functions. The observations at site A with large likelihood under lA (.) are evidence of “local relationships” between site A variables, whereas those with low likelihoods under lA (.) are possible evidence of “cross relationships” between variables across sites. Let IndexLow(A) denote the set of keys associated with the latter observations (those with low likelihood under lA (.)). In practice, this step can be implemented in different ways. For example, we can set a threshold ρA and if lA (x) ≤ ρA , then Indexx ∈ IndexLowA . The sites A and B transmit the set of keys IndexLowA , IndexLowB , respectively, to a central site, where the intersection IndexLow = IndexLowA ∩ IndexLowB is computed. The observations corresponding to the set of keys in IndexLow are then obtained from each of the local sites by the central site. The following argument justifies our selection strategy. Assumed conditional independence in the BN of Figure 2.2 and site A has variables {A, T, E, X, D} and {S, L, B} is in site B, using the rules of probability, it is easy to show that:

P (V) = P (A, B) = P (A | B)P (B) = P (A | nb(A))P (B),

(3.12)

where nb(A) = {B, L} is the set of variables in B, which have a link connecting it to 54

a variable in A. In particular,

P (A | nb(A)) = P (A)P (T | A)P (X | E)P (E | T, L)P (D | E, B).

(3.13)

Note that, the first three terms in the right-hand side of (3.13) involve variables local to site A, whereas the last two terms are the so-called cross terms, involving variables from sites A and B. Similarly, it can be shown that

P (V) = P (A, B) = P (B | A)P (A) = P (B | nb(B))P (A),

(3.14)

where nb(B) = {E, D} and

P (B | nb(B)) = P (S)P (B | S)P (L | S)P (E | T, L)P (D | E, B).

(3.15)

Therefore, an observation {A = a, T = t, E = e, X = x, D = d, S = s, L = l, B = b} with low likelihood at both sites A and B; i.e. for which both P (A) and P (B) are small, is an indication that both P (A | nb(A)) and P (B | nb(B)) are large for that observation (since observations with small P (V) are less likely to occur). Notice from (3.13) and (3.15) that the terms common to both P (A | nb(A)) and P (B | nb(B)) are precisely the conditional probabilities that involve variables from both sites A and B. In other words, this is an observation that indicates a coupling of variables between 55

sites A and B and should hence be transmitted to a central site to identify the specific coupling links and the associated conditional probabilities. In a sense, our approach to learning the cross terms in the BN involves a selective sampling of the given dataset that is most relevant to the identification of coupling between the sites. This is a type of importance sampling, where we select the observations that have high conditional probabilities corresponding to the terms involving variables from both sites. Naturally, when the values of the different variables (features) from the different sites, corresponding to these selected observations are pooled together at the central site, we can learn the coupling links as well as estimate the associated conditional distributions. These selected observations will, by design, not be useful to identify the links in the BN that are local to the individual sites.

3.2.5

Cross Learning

In this section, we concentrate on finding the structure of a cross node. In section 3.2.3, it is clearly shown that local learning can find the correct local structure for local variables. But local learning tells nothing about the the edges from paother (Y ) to Y for a cross variable. Cross learning is used to deal with this problem. In section 3.2.4, we explained the sample selection method used in the collective structure learning algorithm as a selective sampling of the given dataset D to get the samples which are

56

most relevant to the identification of coupling between the sites. Using this sample selection method, we can get a dataset in the central site that can be used to learn the cross links. These selected observations will, by design, not be useful to identify the links in the BN that are local to the individual sites. So the useful result in cross learning are the cross links. The observations corresponding to the set of keys in IndexLow are obtained from each of the local sites by the central site. This dataset is called Dcoll . Dataset Dcoll has samples involving all nodes so we can learn the coupling links as well as estimate the associated conditional distributions from Dcoll . These selected observations will, by design, not be useful to identify the links in the BN that are local to the individual sites. Cross learning is designed to find the cross links of cross nodes. In cross learning step, since Dcoll is just a small subset of D, some spurious links maybe introduced. These links do not consistently exist in D. Using the notion of weak link, we can check the r value of these cross links and remove these weak links. These links can be considered as noise in cross learning. Another important function of cross learning is to identify the cross nodes and cross set. In cross learning, if we find a link whose parent and child nodes are in different sites, then we will label the child node as a cross node and add parent and child node to the cross set. The information about cross nodes and cross set will be used in the combination step to remove the extra local links. 57

3.2.6

Combination

The last step of our collective learning is to combine the local BNs and Bcross learnt from cross learning into Bcoll . From the discussions in Sections 3.2.3 and 3.2.5, we conclude that (1) Local learning can find the correct structure of local nodes. (2) Cross learning can find the cross links of cross nodes. (3) If we only assemble all local BNs together and do not use any information from cross learning, there will be two types of errors: extra local link due to the hidden variable problem and missing cross links for cross variables. In the combining step, we focus on the last problem. That is, how to add the cross links and remove the extra local links. all As a first step, we assemble all local BNs together and get BN Blocal . Then add all all the cross links from Bcross to Blocal . Now the task is to remove the extra local

links. After cross learning, we have the information of cross node, cross set, and cross links. By combining Bcross and Blocal , we can identify the type I and type II extra link patterns. In particular, these links must end in a cross node and the parent node must be in the cross set. If there does exist this kind of hidden variable structure, we check the cross learning result for this cross node. An extra local link appearing due to the hidden variable problem will not be supported by the result of cross learning. Then we can remove it. We summarize the functionality of each of our steps in Table 3.2.

58

Table 3.2: Functionality of local learning, cross learning, and combining Local learning Cross learning Combining Find the structure Find the Put together local of local nodes cross links BNs and the cross BN Find the local Detect cross links of cross nodes nodes and cross set

3.3

Remove extra local links

Summary

In this chapter, we addressed the distributed structure learning problem. If we choose a base structure learning algorithm that has the decomposability property, we showed that the local structure of local variables can be correctly identified. Also the strong links from palocal (Y ) to Y can be identified for cross variable X. These are done in local learning. The key step of the collective method is to identify the samples Dcoll that are evidence of coupling between local and non-local variables. We showed that low likelihood samples in local sites are evidence of coupling terms with high probability. A subset of these samples are transmitted to a central site and dataset Dcoll is built. Cross learning step was used to get the cross links and identify cross all nodes and cross set based on Dcoll . Combination step adds the cross links to Blocal and

removes extra local links. This algorithm is verified experimentally in chapter 5. In particular, the collective method can learn the correct structure with a small fraction of samples transmitted to the central site. However, there are some limitations of the collective method. It may not perform well for densely connected networks; i.e., 59

if there are many cross links in the distributed model. This is because, in a densely connected distributed model, the local structure in site Li is quite different from the structure in real model involving all variables in site Li . This will make the local sample selection process to not perform well.

60

Chapter 4 Distributed Parameter Learning Two important issues in using a Bayesian network are: learning a BN and probabilistic inference. Learning a BN involves learning the structure of the network and obtaining the conditional probabilities (parameters). With the fixed structure G, learning parameters (conditional probabiliies) θ from data is called parameter learning. We considered the distributed structure learning problem in Chapter 3. We now address the parameter learning problem.

4.1

Introduction

For the BN parameter learning in a distributed scenario, the centralized solution is to download all datasets from distributed sites into a single site. Kenji [Ken97] worked

61

on BN parameter learning in the homogeneous distributed learning scenario. In this case, every distributed site has the same feature set but different observations. They proposed an approach called distributed cooperative BN learning in which different Bayesian agents estimate the parameters of the target distribution and a population learner combines the outputs of those Bayesian models. The agents and population learner are based on Gibbs algorithm. For the distributed heterogeneous case, we propose two collective parameter learning methods. These algorithms have been reported in [CSK01b, CSK01a, CSK02, CS02]. The basic idea of the collective parameter learning method is to decompose the parameter learning in distributed heterogeneous case into local learning and cross learning. Since we do not want to transmit all local datasets to a site as in a centralized learning scenario, the collective method tries to make a trade-off between the learning performance and data transmission overhead. In order to reduce the transmission overhead, we analyze the centralized parameter learning algorithm and use some properties to achieve less transmission. Local learning can get the parameter for local nodes. Then we choose a subset of samples to transmit to a central site based on the local BN models and use this subset to learn the parameters of cross nodes. Note that the best BN we can get in distributed heterogeneous case is the BN learnt from centralized learning algorithm which is referred to as Bcntr . For performance analysis, we compare the BN Bcoll learnt from the collective method with Bcntr . Our 62

objective is to make Bcoll as close to Bcntr as possible with only a small portion of samples transmitted to a single site. This chapter is organized as follows. Section 4.2 reviews the centralized BN parameter learning algorithms. Section 4.3 proposes distributed parameter learning algorithm CM1 and gives the performance analysis for this algorithm. Section 4.4 proposes another distributed parameter learning algorithm CM2 which is designed to handle the distributed learning task with real-time constraint. We summarize this chapter in section 4.5.

4.2

Parameter Learning in Centralized Case

Before we introduce the distributed learning method, we first review the centralized parameter learning algorithm. This is the base learning algorithm for our collective learning algorithm. Assuming the dataset is complete (no missing value) and θij are mutually independent, that is, p(θ | S) =

i

j

p(θij | S). Mathematically, learning

the probabilities in BN can be stated as: given a random sample D, compute the posterior distribution p(θ | S, D). The distribution p(Xi | pa(Xi ), D, S), which is a function of θi , is referred to as the local distribution function. It can be considered as a probabilistic classification or regression function. In principle, any classification/regression forms can be used to learn probabilities in BN. In this dissertation,

63

we assume all variables are discrete and finite values. So we use unrestricted multinomial model. Two widely used algorithms are maximum likelihood (ML) and maximum a posterior (MAP) method. The learning problem is to estimate θ from the observed data and prior knowledge. Let D denote the dataset containing M cases and has the form c1 = {x11 , x12 , . . . , x1n } c2 = {x21 , x22 , . . . , x2n } ... M M cM = {xM 1 , x2 , . . . , x n }

D = {c1 ∪ c2 · · · ∪ cM } where each case contains an instantiation of features. If this dataset D is from multinomial sampling with parameter θ or the local distribution of the BN has multinomial form , then the sample likelihood is

p(D | θ) =

qi r i n

N

θijkijk

(4.1)

i=1 j=1 k=1

here Nijk is the sufficient statistic of θijk or of state (xi = k, pa(xi ) = j). The ML estimate of θ which maximizes the sample likelihood with respect to θ is

Nijk ML θîjk = . Nij 64

(4.2)

where Nij =

ri k=1

Nijk is the sufficient statistics of θij .

The ML method suffers from so-called sparse data problem. For instance, if we have Nij = 0 for some parent configuration pa(xi ) = j, then we cannot get the ML estimate because the sample likelihood does not exist. This could be a severe problem if the BN is very complicated and we have lot of parameters to estimate but the sample size is small. In this case, we can use MAP method to handle this problem. The well known Bayes theorem is stated next.

p(θ | D) =

p(D | θ)p(θ) . p(D)

(4.3)

Here p(θ) is the prior distribution of θ and p(D | θ) is the sample likelihood. If we assume the prior distribution of θij is Dirichlet with parameter {αij1 , αij2 , . . . , αijri }, then the posterior distribution p(θij | D) is also a Dirichlet distribution.

p(θij | D) ∼ Dir(Nij1 + αij1 , . . . , Nijri + αijri )

The MAP estimate of θij which maximizes posterior distribution p(θ | D) is as follows:

αijk + Nijk M AP = . θîjk αij + Nij

65

(4.4)

4.3

Collective Bayesian Network Algorithm: CM1

We now present a collective strategy to learn a BN when data is distributed among different sites. We address here the heterogenous case, where each distributed site has all the observations for only a subset of the features. This section will concentrate on the parameter learning. The collective method proposed in this section is Collective Method 1 (CM1) [CSK01b].

4.3.1

Collective Method 1

The proposed distributed BN parameter learning algorithm, CM1, is described in algorithm 3. Algorithm 3 Collective Parameter Learning Algorithm: CM1 1: Learn local BN Blocal involving the variables observed at each local site. 2: Compute likelihood of samples based on local BNs. 3: Transmit the index set of low likelihood samples from each local site to the central site. 4: Compute the intersection of these index sets in central sites. 5: Transmit samples corresponding to the intersection set from all local sites. At the central site, a limited number of observations Dcoll of all the variables are now available. 6: Learn a new BN Bcross using Dcoll in central site. 7: Set the cross node parameters using Bcross and local node parameters using Blocal .

The primary steps in our approach are: 1: local learning, 2-5: sample selection, 6: cross learning, 7: combination. The non-local BN Bcross thus constructed would be effective in identifying associations between variables across sites, whereas the local 66

BNs would detect associations among local variables at each site. The conditional probabilities that involve only variables from a single site can be estimated locally, whereas the ones that involve variables from different sites can be estimated at the central site. Same methodology could be used to update the network based on new data. First, the new data is tested for how well it fits with the local model. If there is an acceptable statistical fit, the observation is used to update the local conditional probability estimates. Otherwise, it is also transmitted to the central site to update the appropriate conditional probabilities (of cross terms). Finally, a collective BN can be obtained by taking the union of nodes and edges of the local BNs and the nonlocal BN, along with the conditional probabilities from the appropriate BNs. Probabilistic inference can now be performed based on this collective BN. Note that transmitting the local BNs to the central site would involve a significantly lower communication as compared to transmitting the local data. It is quite evident that learning probabilistic relationships between variables that belong to a single local site is straightforward and does not pose any additional difficulty as compared to a centralized approach. In our collective learning method , we first learn a local BN involving the variables observed at each site based on local i data set. Let Blocal denote local BN at site i. If we use maximum likelihood (ML) to

estimate parameters, from equation (4.2), the estimate of θijk is entirely determined by Nijk . Then we have the following property [HGC95]: 67

Theorem 4 (Parameter modularity) Given two BNs B1 and B2 , if a variable Xi has same set of parents pa(Xi ), then CPT θi is the same in these two BNs. In [HGC95], the authors describe this property as an assumption for MAP learning. That is, if we assume the prior has the parameter modularity property, then the parameters learnt by MAP also have this property. For ML learning, parameter modularity can be derived from equation (4.2). From parameter modularity, we can i conclude that: CPT of local variable in Blocal is same as that in Bcntr . This guarantees

that local learning will produce the same result as the centralized learning approach for local variables.

4.3.2

Performance Analysis

In the following, we present a brief theoretical analysis of the performance of the proposed collective learning method. We compare the performance of Bcoll with Bcntr . There are two types of errors involved in learning a BN: (a) Error in BN structure and (b) Error in parameters (probabilities) of the BN. The structure error is defined as the sum of the number of correct edges missed and the number of incorrect edges detected. For parameter error, we need to quantify the “distance” between two probability distributions. We only consider learning error in the parameters, assuming

68

that the structure of the BN has been correctly determined (or is given). A widely used metric is the Kullback-Leibler (KL) distance (cross-entropy measure) dKL (p, q) between two discrete probabilities, {pi }, {qi }, i = 1, 2, . . . , N

dKL (p, q) =

N i=1

pi pi ln( ) qi

(4.5)

where N is the number of possible outcomes. Indeed, if p∗ is the empirically observed distribution for data samples {si , 1 ≤ i ≤ M } and h is a hypothesis (candidate probability distribution for the underlying true distribution), then [ATW91]

dKL (p∗ , h) =

M i=1

p∗ (si ) ln(

1 1 p∗ (si ) 1 )= ln − ln(h(si )) h(si ) M M M i=1 i=1 M

M

M 1 1 = ln ln(h(si )). − M M i=1

(4.6)

Therefore, minimizing the KL distance with respect to the empirically observed distribution is equivalent to find the maximum likelihood solution h∗ of

M i=1

ln(h(si )).

Since the BN provides a natural factorization of the joint probability in terms of the conditional probabilities at each node (see (2.4)), it is convenient to express the KL distance between two joint distributions in terms of the corresponding conditional distributions. Let h and c be two possible (joint) distributions of the variables in a

69

BN. For i = 1, 2, . . . , n, let hi (xi | πi ), ci (xi | πi ) be the corresponding conditional distribution at node i, where xi is the variable at node i and πi is the set of parents of node i. Following [Das97], define a distance dCP (P, ci , hi ) between hi and ci with respect to the true distribution P :

dCP (P, ci , hi ) =

πi

P (πi )

P (xi | πi ) ln(

xi

ci (xi | πi ) ). hi (xi | πi )

(4.7)

It is then easy to show that

dKL (P, h) − dKL (P, c) =

n

dCP (P, ci , hi ).

(4.8)

i=1

Equations (4.7) and (4.8) provide a useful decomposition of the KL distance between the true distribution P and two different hypotheses c, h. This will be useful in our analysis of sample complexity in the following sub-section.

Sample Complexity We now derive a relationship between the accuracy of collective BN and the number of samples transmitted to the central site. We consider the unrestricted multinomial class BN, where all the node variables are Boolean. The hypothesis class H is determined by the set of possible conditional distributions for the different nodes. Given

70

a BN of n variables and a hypothesis class H, we need to choose a hypothesis h ∈ H which is close to a unknown distribution P . Given an error threshold and a confidence threshold δ, we are interested in constructing a function N (, δ), such that if the number of samples M is larger than N (, δ)

P rob(dKL (P, h) < dKL (P, hopt ) + ) > 1 − δ,

(4.9)

where hopt ∈ H is the hypothesis that minimizes dKL (P, h). If smallest value of N (, δ) that satisfies this requirement is called the sample complexity. This is usually referred to as the probably approximately correct (PAC) framework. Friedman and Yakhini [FY96] have examined the sample complexity of the maximum description length principle (MDL) based learning procedure for BNs. Dasgupta [Das97] gave a thorough analysis for the multinomial model with Boolean variables. Suppose the BN has n nodes and each node has at most k parents. Given and δ, an upper bound of sample complexity is

N (, δ) =

288n2 2k 2 3n 18n2 2k ln(1 + 3n/) ln (1 + ln ). 2 δ

(4.10)

Equation (4.10) gives a relation between the sample size and the (, δ) bound. For

71

the conditional probability hi (xi | πi ) = P (Xi = xi | Πi = πi ), we have (see (4.7))

dCP (P, hopt , h) ≤

n

(4.11)

We now use the above ideas to compare the performance of the collective learning method with the centralized method. We fix the confidence δ and suppose that an cen can be found for the centralized method, for a given sample size M using (4.10). Then, following the analysis in [Das97, Section 5],

cen dCP (P, hcen )≤ opt , h

cen , n

(4.12)

cen where hcen is the hypothesis obtained based on a opt is the optimal hypothesis and h

centralized approach. Then from (4.8)

dKL (P, hcen ) − dKL (P, hcen opt ) =

n

cen dCP (P, hcen i,opt , hi ) ≤

i=1

n cen i=1

n

= cen .

(4.13)

From (4.9), with probability at least 1 − δ,

cen dKL (P, hcen ) ≤ dKL (P, hcen opt ) +

(4.14)

For the collective BN learning method, the set of nodes can be split into two parts.

72

Let Vl be the set of local nodes and Vc be the set of cross nodes. For ASIA model in figure 3.1, Vl = {A, S, T, L, B, X} and Vc = {E, D}. We use nl and nc to denote the cardinality of the sets Vl and Vc . If a node x ∈ Vl , the collective method can learn the conditional probability P (x | pa(x)) using all data because this depends only on the local variables. Therefore, for x ∈ Vl ,

col dCP (P, hcol opt , h ) ≤

col cen 1 = , n n

(4.15)

cen . For the nodes in Vc , only the data transmitted where, for the local terms, col 1 =

to the central site can be used to learn its conditional probability. Suppose Mc data samples are transmitted to the central site, and the error threshold col 2 satisfies (4.10), for the same fixed confidence 1 − δ. Therefore, for x ∈ Vc , we have from (4.11) col that dCP (P, hcol opt , h ) ≤

col 2 , n

cen where col , in general, since the in the collective 2 ≥

learning method, only Mc ≤ M samples are available at the central site. Then from (4.8) and (4.15)

dKL (P, hcol opt )

− dKL (P, h ) = col

n

col dCP (P, hcol i,opt , hi )

i=1

=

col dCP (P, hcol i,opt , hi ) +

i∈Vl

=

i∈Vc

nl cen nc col + 2 n n

73

col (4.16) dCP (P, hcol i,opt , hi )

Comparing (4.13) and (4.16), it is easy to see that the error threshold of the collective method is col =

nl cen n

+

nc col . n 2

The difference of the error threshold between the

collective and the centralized method is

col − cen =

nc col ( − cen ) n 2

(4.17)

Equation (4.17) shows two important properties of the collective method. First, the difference in performance is independent of the variables in Vl . This means the performance of the collective method for the parameters of local variables is same as that of the centralized method. Second, the collective method is a tradeoff between accuracy and the communication overhead. The more data we communicate, more cen cen closely col . When Mc = M , col , and col − cen = 0. 2 will be to 2 =

4.4

Collective Bayesian Network Algorithm: CM2

In this section we introduce the Collective Method 2 (CM2) [CS02], which uses the notion of cross set to reduce the number of variables whose observations need to be transmitted from each local site to the central site. We then discuss a method to select a subset of the observations to be transmitted to the central site. Finally, a comparison between CM1 and CM2 is presented.

74

One drawback of CM1 is that the likelihood computation in local site is not a trivial job and introduces some computational overhead. This may not be acceptable for real-time applications such as online monitoring of stock market data. For real time or online learning applications, we need to considerably reduce the amount of computation at the local sites. Towards that end, we propose a new collective learning method called CM2 for learning the parameters of a BN (CM2 assumes that the structure of the BN is known). Using the parameter modularity property, CM2 provides a new data selection method, which dramatically reduces the local computation time and whose performance is almost identical to that of CM1. The main steps in CM2 are similar to that of CM1 and are as follows: Algorithm 4 Collective Parameter Learning Algorithm: CM2 1: Learn local BN Blocal involving the variables observed at each local site. 2: Compute likelihood of variables in cross set CS(Ci ) based on local BNs. 3: Transmit the index set of low likelihood samples from each local site to the central site. 4: Compute the intersection of these index sets in central sites. 5: Transmit variables in cross set corresponding to the intersection set from all local sites. At the central site, a limited number of observations Dcoll of all the cross set variables are now available. 6: Learn a new BN Bcross using Dcoll in central site. 7: Set the cross node parameters using Bcross and local node parameters using Blocal .

The main steps in CM1 and CM2 are similar. CM2 is different from CM1 in two main respects: (a) In CM2, only the observations of variables in the cross set CS(Si ) at site Si is transmitted. This reduces the data transmission overhead. (b) In CM2, 75

the selection of samples to be transmitted is based on the joint distribution of the cross set variables (not the joint distribution of all site variables). As we shall discuss in the following subsection, this results in significant computational savings.

4.4.1

Data selection in CM2

We now discuss details of data selection and cross learning in the CM2 algorithm. As noted earlier, in CM2, the selection of samples to be transmitted is based on the joint distribution of the cross set variables (not the joint distribution of all site variables). We shall first discuss why this is justified. Later, we will describe the actual details of the data selection process. Most of local computation in CM1 can be attributed to the likelihood computation step. This involved computing the joint distribution of all local variables after the local BN is constructed. It corresponds to an inference process, which is computationally intensive. Consider a local site A with cross set variables CS and local set variables LS. The joint distribution of all variables in site A can be denoted as p(CS, LS). Consider the marginal distribution of P (CS): p(CS) =

LS

p(CS, LS).

Therefore, if a sample configuration for CS variables has low likelihood value under p(CS), then there is at least one configuration of LS variables such that the corresponding joint configuration (CS, LS) has low likelihood value under p(CS, LS). So

76

the samples with low likelihood in p(CS) is a subset of those with low likelihood in p(CS, LS). This is depicted in figure 4.1. Therefore, the identification of samples for transmission can be based on the likelihood under p(CS) instead of p(CS, LS). In general, the cardinality of CS is much smaller than the total number of variables at the local site, which results in faster computation. Moreover, the total number of configurations (CS, LS) is usually prohibitively large, so that we cannot save all the p(CS, LS) values and create a lookup table — the likelihoods have to be computed sample by sample in this case. However, the total number of configurations (CS) is relatively small. Therefore, we can create a table of p(CS) values and use a table lookup method to get the likelihood for each sample. Recall that distributed learning is a kind of problem decomposition, that is, a large problem is decomposed into several relatively small sub-problems. In CM2, ee use this decomposition idea again for each of the sub-problems at the local sites. We divide the local variables into CS and LS. The CPT of cross variables is entirely determined by the cross set. We reduce the problem domain and use fewer variables. This can save lot of computations. As observed in CM1, those samples with low likelihood under p(CS, LS) are possible evidence of “cross relationships” between variables across sites. In CM1, we set a threshold for each local site and choose the samples whose likelihood is smaller 77

than this threshold. The set of those samples is denoted as dataset D2 in Figure 4.1. In CM2, we consider the set of samples with low likelihood under p(CS). This is denoted as dataset D3 in Figure 4.1. As discussed earlier, D3 ⊆ D2 . Some of samples with low likelihood under p(CS, LS) are in D2 but not in D3 . Therefore, we introduce a random sampling for the samples not in D3 . That is, if a sample has likelihood (under p(CS)) larger than the local threshold, we select it with some probability α. For example, if α = 0.1, 10% of samples not in D3 will be selected (along with the set D3 ) for transmission. The main steps in the data selection process can now be described as follows: • At each local site, identify the cross set CS and local set LS. Based on the local BN, compute joint distribution p(CS) of variables in cross set. • Based on p(CS) and local threshold, identify set D3 of samples with low likelihood. • Select all samples in D3 and a fraction α of the samples (uniform random sampling) in D3c . Get the index set of selected samples and transmit it to the central site. Figure 4.2 depicts the local site data filtering process in CM2 framework for distributed BN parameter learning.

78

(D1)Local Dataset

(D2)Samples with low likelihood in p(CS,LS) (D3)Samples with low likelihood in p(CS)

Figure 4.1: Local Dataset in Collective Learning Removing variables not in cross set

Sample filtering: Identification and Selecting

Local dataset

Figure 4.2: Distributed Parameter Learning Algorithm CM2 Framework

4.4.2

Comparison between CM1 and CM2

In a sense, the collective approach to learning the parameters of cross variables involves a selective sampling of the given dataset that is most relevant to the identification of coupling between the sites. This is a type of importance sampling, where we select the samples that have high conditional probabilities corresponding to the terms involving variables from both sites. Naturally, when the samples selected are pooled together at the central site, we can estimate the associated CPTs of cross variables. In both CM1 and CM2, we try 79

to choose the samples corresponding to the coupling between the sites. In our description of CM1 [CSK02], we showed that samples with low likelihood under p(CS, LS) are possible evidence of the coupling between variables across sites. However, this step is computationally intensive. The proposed method CM2 is a computationally simpler approach. It selects a subset of samples with low likelihood under p(CS, LS) (dataset D3 ) and uses a fraction of the remaining samples (chosen at random). This random sampling of observations in D3c is slightly worse than the earlier importance sampling in CM1. However, as verified by our experiments, the performance of CM2 is comparable to that of CM1. One slight drawback of CM2 that due to random sampling, the performance varies over different runs. Since we use the random sampling, we may get a different subset of samples in different runs. However, as verified by experiments, the variance of this performance is small.

4.5

Summary

In this chapter, two collective methods for distributed parameter learning were proposed. Both of them are based on the importance sampling of samples that correspond to coupling between the local and non-local variables. Variables in the problem domain are classified into local and cross variables. The parameters of local variables

80

can be correctly learnt in the local learning step and that of cross variables can be learnt in the cross learning step with high accuracy. Using these algorithms, a collective BN which is close to Bcntr can be achieved with a small portion of samples transmitted to a single site. When applying CM1 and CM2 to parameter learning, we assume the structure is known. However, the steps in CM1 can be used to learn the structure and parameter at the same time. Thus CM1 is more powerful than CM2. CM2 is designed to achieve fast local learning capability so that it can be used in the application with real-time constraint.

81

Chapter 5 Experimental Results In this chapter, we present experimental results to evaluate the proposed collective learning methods. We have proposed two collective parameter learning methods, CM1 and CM2, and one collective structure learning method. They all have four steps: local learning, data selection, cross learning, and combination. The key point of these algorithms is to identify the low likelihood samples in local site and use them for cross learning. The collective structure learning algorithm and CM1 have similar data selection step but uses the selected data for different purposes. CM2 uses cross set for data selection. In section 5.1, we define the error metrics used to evaluate the parameter and structure learning algorithms. Sections 5.3 and 5.4 present the experimental results for algorithm CM1 and CM2 for two distributed BN models: ASIA model and ALARM 82

network. Section 5.5 focuses on the structure learning experiment. We summarize the results in section 5.6.

5.1

Evaluation Methodology

The performance analysis of the collective method is based on the comparison of the BN Bcoll learnt from the collective method with the BN Bcntr learnt by centralizing the all data. In collective parameter learning, we use the conditional probabilities from local BN for the local nodes and those estimated at the global site for the cross nodes. A BN represents a joint distribution. So the difference between two BNs can be evaluated by the difference between two joint distributions represented by BNs. In distributed parameter learning, these are between the joint probabilities computed using our collective approach and the one computed using a centralized approach. A widely used metric is the Kullback-Leibler (KL) distance (cross-entropy measure) dKL (p, q) between two discrete probability distributions, {pi } and {qi }, i = 1, 2, . . . , N : dKL (p, q) =

N i=1

pi pi ln( ), qi

where N is the number of possible outcomes. A more important test of the collective approach is the error in estimating the conditional probabilities of the cross terms, estimated at the global site, based on a 83

selective subset of data. A metric called conditional KL distance is defined as follows:

DCKL (i, Bcntr , Bcoll ) =

ij pcntr (j) · DKL (pij cntr , pcoll ).

(5.1)

j

DCKL (i, Bcntr , Bcoll ) is the distance between two conditional probability tables (CPT) of node xi . Note that each row of CPT is a distribution with fixed parent configuration ij pa(i) = j. So DKL (pij cntr , pcoll ) is the KL distance of variable xi with a specific parent

configuration j. For structure learning, the error metric between Bcoll and Bcntr is the sum of the number of links in Bcoll but not in Bcntr (extra links) and links not in Bcoll but in Bcntr (missing links).

5.2

DistrBN system

A software package called DistrBN has been developed to implement the distributed Bayesian network learning algorithms. DistrBN defines a Bayesian network learning agent. Each learning agent associates with a dataset, BN learning algorithms (parameter and structure), learning task description, and evaluation metric. If we load the data into an agent and set the task type, it can automatically learn a BN. The software partially uses a BN C++ library called SMILE (http://www2.sis.pitt.

84

edu/~genie/). DistrBN has a client/server architecture. At each local site, a client program is run. The server can run in one of the local sites or any other site acting as the central site. The functionality of client/server is described in table 5.1. When we start a distributed BN learning task, the clients in local sites take charge of the local BN learning and likelihood compution. The server decides the data needed to be transmitted and gets these samples from each site. Then server learns the collective BN. Server Sample selecting Data transmission control (including data and local BN) Cross learning Combination

Client Local learning Likelihood computing

Table 5.1: Distributed learning system: DistrBN

5.3

Parameter Learning: CM1

We tested our approach on two datasets — ASIA model and ALARM network. The results for the two cases are presented in the following subsections.

85

5.3.1

ASIA Model

Our experiments were performed on a dataset that was generated from the BN depicted in Figure 5.1 (ASIA Model). The conditional probability of a variable is a multidimensional array, where the dimensions are arranged in the same order as ordering of the variables, viz. {A, S, T, L, B, E, X, D}. Table 5.2 (left) depicts the conditional probability of node E. It is laid out such that the first dimension toggles fastest. From Table 5.2, we can write the conditional probability of node E as a single vector as follows: [0.9, 0.1, 0.1, 0.01, 0.1, 0.9, 0.9, 0.99]. The conditional probabilities (parameters) of ASIA model are given in Table 5.2 (right) following this ordering scheme. We generated n = 6000 observations from this model, which were split into two sites as illustrated in Figure 5.1 (site A with variables A, T, E, X, D and site B with variables S, L, B). Note that there are two edges (L → E and B → D) that connect variables from site A to site B, the rest of the six edges being local. A

S

T

L

B

E

X

D

Figure 5.1: Distributed ASIA Model: {A, T, E, X, D} in Site A and {S, L, B} in Site B

86

No. 1 2 3 4 5 6 7 8

T F T F T F T F T

L F F T T F F T T

E F F F F T T T T

Probability 0.9 0.1 0.1 0.01 0.1 0.9 0.9 0.99

A 0.99 S 0.5 T 0.1 L 0.3 B 0.1 E 0.9 X 0.2 D 0.9

0.01 0.5 0.9 0.6 0.8 0.1 0.6 0.1

0.9 0.1 0.7 0.4 0.9 0.2 0.1 0.01 0.8 0.4 0.1 0.01

0.1

0.9

0.9

0.99

0.1

0.9

0.9

0.99

Table 5.2: (Left) The conditional probability of node E and (Right) All conditional probabilities for the ASIA model The estimated parameters of these two local Bayesian networks are depicted in Table 5.3. Clearly, the estimated probabilities at all nodes, except nodes E and D, are close to the true probabilities given in Table 5.2. In other words, the parameters that involve only local variables have been successfully learnt at the local sites. A fraction of the samples, whose likelihood are smaller than a selected threshold T , were identified at each site. In our experiments, we set

Ti = µi + ασi ,

i ∈ {A, B},

(5.2)

for some constant α, where µi is the (empirical) mean of the local likelihood values and σi is the (empirical) standard deviation of the local likelihood values. The samples with likelihood less than the threshold (TA at site A and TB at site B) at both sites

87

Local A A 0.99 0.01 T 0.10 0.84 E 0.50 0.05 X 0.20 0.60 D 0.55 0.05 Local B S 0.49 0.51 L 0.30 0.59 B 0.10 0.81

0.90 0.50 0.80 0.45

0.16 0.95 0.40 0.95

0.70 0.90

0.41 0.19

Table 5.3: The conditional probabilities of local site A and local site B were sent to a central site. The central site learned a global BN based on these samples. Finally, a collective BN was formed by taking the union of parameters learnt locally and those at the central site. Now we assessed the accuracy of the estimated conditional probabilities. Figure 5.2 (right) depicts the KL distance D(pcntr (V), pcoll (V)) between the joint probabilities computed using our collective approach and the one computed using a centralized approach. Clearly, even with a small communication overhead, the estimated conditional probabilities based on our collective approach is quite close to that obtained from a centralized approach. The KL distance between the conditional probabilities is computed based Bcoll and Bcntr for the cross terms: p(E | T, L) and p(D | E, B). Figure 5.3 (top left) depicts the CKL distance of node E and figure 5.3 (top right) depicts the CKL distance of node D. Clearly, even with a small data communication, the estimate of the conditional 88

Errors in BN structure

KL Distance between the joint probabilities

3

1

0.9 2.5 0.8

0.7

0.6

KL Distance

# incorrect edges

2

1.5

0.5

0.4 1 0.3

0.2 0.5 0.1

0

0

0

0.2 0.4 0.6 0.8 1 Fraction of observations communicated

0

0.2 0.4 0.6 0.8 Fraction of observations communicated

Figure 5.2: Performance of Collective BN in CM1 probabilities of the cross-terms, based on our collective approach, is quite close to that obtained by the centralized approach. To further verify the validity of our approach, the transmitted data at the central site was used to estimate two local terms, node S and node B. The corresponding CKL distances are depicted in the bottom row of Figure 5.3 (left: node S and right: node B). It is clear that the estimate of these probabilities is quite poor. This clearly demonstrates that our technique can be used to perform a biased sampling for discovering relationships between variables across sites.

5.3.2

ALARM Network

This experiment illustrates the scalability of our approach with respect to number of sites, features, and observations. In this experiment, we use a real world BN

89

Node E (cross term)

Node D (cross term)

0.45

40

30

CKL Distance

CKL Distance

0.4 0.35 0.3

10

0.25 0.2

20

0

0

0.2 0.4 0.6 0.8 fraction of observations commuicated

0

Node S (local term)

0.2 0.4 0.6 0.8 fraction of observations commuicated Node B (local term)

0.2

1.4 1.2

CKL Distance

CKL Distance

0.15

0.1

1 0.8 0.6

0.05 0.4 0

0

0.2


0


Figure 5.3: KL Distance between Conditional Probabilities for ASIA Model in CM1

90

application called ALARM network. The ALARM network has been developed for on-line monitoring of patients in intensive care units and generously contributed to the community by Ingo Beinlich and his collaborators [BSCC89]. It is a successful application of BN in the medical diagnosis area. ALARM network is a widely used benchmark network to evaluate the algorithm. The structure of ALARM network is shown in figure 5.4. ALARM network is not a trivial BN. It has 37 nodes and 46 edges. These nodes are discrete but not necessary binary. The node definition of ALARM is 1-Anaphylaxis, 2-Intubation, 3-KinkedTube, 4-Disconnect, 5MinVolSet, 6-VentMach,7-VentTube, 8-VentLung, 9-VentAlv, 10-ArtCO2, 11-TPR, 12-Hypovolemia, 13-Lvfailure, 14-StrokeVolume, 15-InsuffAnesth, 16-PulmEmbolus, 17-Shunt, 18-FiO2, 19-PVSat, 20-SaO2, 21-Catechol, 22-HR, 23-CO, 24-BP, 25-LVEDVolume, 26-CVP, 27-ErrCauter, 28-ErrLowOutput, 29-ExpCO2, 30-HRBP, 31-HREKG, 32HRSat, 33-History, 34-MinVol, 35-PAP, 36-PCWP, 37-Press. In order to test our the scalability of our approach with respect to number of nodes and observations, a dataset with 15000 samples was generated. These 37 nodes were split into 4 sites as follows – site 1: {3, 4, 5, 6, 7, 8, 15}, site2: {2, 9, 10, 18, 19, 29, 34, 37}, site3: {16, 17, 20, 21, 22, 27, 30, 31, 32, 35}, and site4: {1, 11, 12, 13, 14, 23, 24, 25, 26, 28, 33, 36}. Note there are 13 cross links. We assumed that the structure of the Bayesian network was given, and tested our approach for estimating the conditional probabilities. The KL distance between the 91

5

4

6

7

3

37

8

2

17

34

1

11

16

9

18

10

19

29

20

35

15

12

13

21

25

14

33

28

22

27

26

36

23

30

31

32

24

Figure 5.4: ALARM Network

92

conditional probabilities estimated based on our collective BN and a BN obtained using a centralized approach was computed. In particular, we illustrate the results for the conditional probabilities at two different nodes: 20, 21, both of which are cross terms. Figure 5.5 (left) depicts the CKL distance of node 20 between the two estimates. Figure 5.5 (right) depicts a similar CKL distance for node 21. Clearly, even with a small data communication, the estimates of the conditional probabilities of the cross-terms, based on our collective approach, is quite close to that obtained by the centralized approach. Note that node 21 has 4 parents — one of them being local (in the same site as node 21) with the other three being in a different sites. Also the conditional probability table of node 21 has 54 parameters, corresponding to the possible configurations of node 21 and its four parents. Consequently, learning the parameters of this node is a non-trivial task. Our experiments clearly demonstrates that our technique can be used to perform a biased sampling for discovering relationships between variables across sites. This simulation also illustrates the fact that the proposed approach scales well with respect to number of nodes and samples. Next, we test the scalability with respect to number of sites. As the number of sites increases, there are more cross terms and cross edges. This not only means there are more nodes whose conditional probability nees to be learnt at the central site, but also affects the likelihood compution at local sites, which in turn affects the sample selection step. We use four different splitting cases and two nodes to test this. As 93

Node 20

Node 21

1.4

10

9 1.2 8

7

CKL Distance

CKL Distance

1

0.8

0.6

6

5

4

3

0.4

2 0.2 1

0

0

0

0.2 0.4 0.6 0.8 1 fraction of observations commuicated

0


Figure 5.5: KL Distance between Conditional Probabilities for ALARM Network table 5.4 shows, node 20 and node 21 are always set to be cross terms. We increase the number of sites and make the learning problem more difficult. The experimental result is shown in figure 5.6. Clearly, increasing of number of sites does make the collective learning more difficult and the performance with smaller number of sites is better than that with larger number of sites. However, our approach converges rapidly even for large number of sites. And the performance of our approach is similar for different number of sites after a relative small portion of samples transmitted (from figure 5.6, about 35% samples transmitted). This clearly illustrates that our approach scales well with respect to number of sites.

94

2 3 4 5

sites sites sites sites

n16 n16 n16 n16

parents in local, in local, in local, in local,

of n20 n2 in other n2 in other n2 in other n2 in other

parents of n21 n11, n20 in local, n15, n29 n15, n20 in local, n11, n29 n20 in local, n11, n15, n29 n20 in local, n11, n15, n29

in in in in

other other other other

Table 5.4: Splitting cases

Node 20 (cross term)

Node 21 (cross term)

14

35 2 3 4 5

12


30


25

CKL Distance

CKL Distance

10

8

6

20

15

4

10

2

5

0

2 3 4 5

0

0


0


Figure 5.6: KL Distance Between Conditional Probabilities for ALARM Network for Different Splitting Cases

95

5.4

Parameter Learning: CM2

We provide two sets of experimental results to illustrate the accuracy and efficiency of the proposed CM2 algorithm. The first experiment is based on the distributed ASIA model. The results show that the performance of CM2 is very close to that of CM1. The second experiment is based on ALARM network. This simulation is designed to test the scalability of the collective method CM2.

5.4.1

ASIA model

In this experiment, the ASIA model is used to test CM2 algorithm. The distributed ASIA model is shown in figure 5.1. There are two sites and E, D are the cross variables. Dataset has 6000 samples. By changing the threshold of high vs. low likelihood samples in the sample selecting step, we can change the size of set Dcoll , which is the subset of data transmitted to central site. Since we use a random sampling method in CM2 for samples in D3c , the CKL distances would depend on the specific run. Therefore, we repeated each experiment 20 times and computed the average CKL distance. Algorithm CM1 was applied to the same dataset and the result of CM1 is used for comparison. The CKL distances for nodes E, D for the BNs BCM 1 and BCM 2 was computed (both with respect to Bcntr ). Figure 5.7 depicts the results graphically as a function of the fraction |Dcoll |/|D| of

96

data transmitted. For our experiment |D| = 6000. It is clear from Figure 5.7 that the performance of CM2 is almost identical to that of CM1. The local computation times for the two sites are depicted in Table 5.5. Clearly, CM2 achieves a significant Table 5.5: Comparison of local ASIA model Site # CM1 Site 1 5.7 Site 2 3.9

computation time (in second) for CM1 and CM2 in CM2 Speedup factor: CM1/CM2 0.012 475 0.0094 415

computational savings (speedup factor in excess of 400) over CM1 for almost identical performance. Node E

Node D

0.045

0.25 CM1 CM2

CM1 CM2

0.04 0.2

0.035

0.03 0.15

CKL

CKL

0.025

0.02 0.1 0.015

0.01

0.05

0.005

0

0

0.2

0.4 0.6 |Dcoll|/|D|

0.8

0

1

0

0.2

0.4 0.6 |Dcoll|/|D|

0.8

1

Figure 5.7: ASIA Model: CKL Distance for Cross Variables in CM2

97

5.4.2

ALARM Network

This experiment illustrates the scalability of CM2 with respect to number of features and observations. The distributed ALARM network was used in this experiment. In order to test the scalability of our approach with respect to number of nodes and observations, a dataset with 15000 samples was generated. The 37 nodes were split into 3 sites as follows – site 1: {2, 3, 4, 5, 6, 7, 8, 9, 10, 18, 19, 29, 34, 37}, site2: {15, 16, 17, 20, 21, 22, 27, 30, 31, 32, 35}, and site3: {1, 11, 12, 13, 14, 23, 24, 25, 26, 28, 33, 36}. Four cross variables 20, 21, 23, and 30 were used to test the performance of CM2. As before, for each threshold, the program was run 20 times and the mean and variance of CKL distances were computed. We applied CM1 to the same dataset and obtained the corresponding CKL distances for these four variables. In this experiment, we also compared the performance of CM2 to a completely random selection of samples method. This random selection method randomly selects a subset of samples from the entire dataset. There is no importance sampling in the random selection procedure. It can be considered as a naive way and is used here mainly for comparison purposes. Figure 5.8 depicts the comparison of CKL distances for CM1, CM2, and random sampling methods. All the CKL distances are computed with respect to Bcntr . These results show that: (a) CM2 scales well with the number of samples

98

and features (b) The performance of CM2 and CM1 are comparable. They are both better than the random method. This clearly shows that importance sampling we used in the collective methods is better than the random sampling method. Figure 5.9 depicts the variance of CKL distance of CM2 and random sampling over 20 different runs. Here the random method runs 20 times for a fixed number |Drand | of samples transmitted. Since most of samples in Dcoll are selected using a deterministic way (those samples in D3 , with likelihood below a threshold), the randomness of performance in CM2 is not as much as that in the random method. This is verified in Figure 5.9. Finally, Table 5.6 shows the computation time for CM1 and CM2. Clearly, we see an almost 900 fold decrease in computation time. Table 5.6: Comparison ALARM network Site # Site 1 Site 2 Site 3

5.5

of local computation time (in second) for CM1 and CM2 in CM1 44.12 55.46 44.97

CM2 Speedup factor: CM1/CM2 0.034 1298 0.063 880 0.04 1124

Structure Learning

Two sets of experimental results are provided to evaluate the performance of the proposed collective BN structure learning algorithm. The first experiment is based on the ASIA model. The results of ASIA model show that the collective method can 99

Node 20

Node 21

6

30 CM1 CM2 Rand

5

20

CKL

CKL

4 3

15

2

10

1

5

0

CM1 CM2 Rand

25

0

0.2

0.4 |D

0.6 |/|D|

0.8

0

1

0

0.2

coll

Node 23

0.8

1

Node 30 4 CM1 CM2 Rand

CM1 CM2 Rand

3

CKL

6

CKL

0.6 |/|D|

coll

8

4

2

0

0.4 |D

2

1

0

0.2

0.4 |D

0.6 |/|D|

0.8

0

1

0

0.2

coll

0.4 |D

0.6 |/|D|

0.8

1

coll

Figure 5.8: CKL of CM1, CM2, and Random method

Node 20

Node 21

2.5

3.5 CM2 Rand

1.5 1 0.5 0

CM2 Rand

3

Variance of CKL

Variance of CKL

2

2.5 2 1.5 1 0.5

0

0.2

0.4 0.6 |Dcoll|/|D|

0.8

0

1

0

0.2

Node 23

1

2 CM2 Rand

CM2 Rand

Variance of CKL

2

Variance of CKL

0.8

Node 30

2.5

1.5 1 0.5 0

0.4 0.6 |Dcoll|/|D|

0

0.2

0.4 |D

0.6 |/|D|

0.8

1.5

1

0.5

0

1

coll

0

0.2

0.4 |D

0.6 |/|D|

coll

Figure 5.9: Variance of CM2 and Random method

100

0.8

1

learn the correct structure. The second experiment is based on the ALARM network and is designed to test the scalability of the collective method.

5.5.1

ASIA Model

This experiment illustrates the ability of the proposed collective learning approach to correctly obtain the structure of the BN. We test the local learning, cross learning, and combination steps, separately. Firstly, our experiment was performed on a dataset with 6000 samples that was generated from the ASIA model. Node ordering is {A, T, S, L, E, X, B, D}. We generated 2 distributed models to test our algorithm. In figure 5.10 (a), there are two cross links L → E, E → D and two cross nodes {E, D}. The local learning detected the correct structure for all local nodes and the links T → E and B → D. An extra local link L → D was also detected, because of the hidden path L → E → D. After transmitting only 15% of the samples, the two cross links were detected by cross learning. In the combination step, we assemble all local BNs together and add cross links. Finally, we found that D is a cross node and there exists a path L → E → D. This means that the link L → D found during local learning could be an extra local link. We found that this link was clearly not supported by our cross learning result. Therefore, we removed it from the final BN structure. Clearly, after transmitting about 15% of the observation, we can obtain

101

the correct BN structure. Figure 5.10 (b) gives a distributed BN model with three cross links T → E, S → L, S → B. In this case, local learning gave a hidden parent extra link L → B. Experimental result shows that we can get all cross links and remove extra links with 15% samples transmitted. Sometimes, the local links that seem to be due to the hidden variable phenomenon may actually exist in the real BN structure. This scenario is illustrated in the BN of Figures 5.10 (c) and (d). Here the link L → D and L → B are actual links which the combination step may consider as extra links. We need to make sure that they are not eliminated during the final combination step when we remove extra local links. This was tested in the following experiment. We applied the collective learning algorithms to distributed BN model in Figures 5.10 (c) and (d). All local links were correctly identified during local learning and all cross links were learnt in the cross learning step. During the combination step, when we checked the cross learning result, we found that these links L → D and L → B were supported by the cross learning. So we do not remove these links. This experiment illustrates that our collective learning method will not remove the correct local links.

102

(a)

(b)

S

A

L

(c)

S

T

A

L

S

T

B

E

B

E

D

X

D

X

(d) A

L

B

D

S

T

A

L

T

E

B

E

X

D

X

Figure 5.10: Distributed ASIA Model Structure Learning

5.5.2

ALARM Network

The accuracy and efficiency of collective structure learning method for a bigger and more complicated distributed BN model was tested in this experiment. The structure of the distributed ALARM network is shown in figure 5.11. The 37 nodes were split into 3 sites as follows: site A: 13 nodes, {2 3 4 5 6 7 8 9 10 18 19 34 37}, site B: 12 nodes,{ 13 15 16 17 20 21 22 27 30 31 32 35}, site C: 12 nodes, {1 11 12 14 23 24 25 26 28 29 33 36}. There are 9 cross nodes 14, 17, 20, 21, 23, 25, 29, 30, 33 and 11 cross links 2 → 17, 8 → 29, 10 → 21, 10 → 29, 11 → 21, 13 → 14, 13 → 25, 13 → 33, 19 → 20, 22 → 23, 28 → 30 in this distributed ALARM model. In our experiment, local learning detected the correct structure of local nodes and gave local links for cross nodes. There were 6 extra local links introduced by local learning. Link 11 → 23 is a hidden path extra link due to the path 11 → 21 → 22 → 23. Links 14 → 25, 14 → 33, and 25 → 33 are hidden parent extra links. Link 103

12 → 33 is also a hidden parent extra link because node 25 and node 33 have a hidden parent 13 and there exists link 12 → 33. Extra link 23 → 29 was a combination of hidden path and hidden parent problem. The hidden path 10 → 21 → 22 → 23 makes 10 a hidden parent of 23 and 29. Note that all these extra links are inside the cross set and the children of extra links are cross nodes. In the cross learning step, we obtained all 11 cross links correctly. Table 5.7 gives the cross learning result for |Dcoll | = 2748. All real cross links were detected. Three false cross links were removed since they were weak links. In the combining step, we can remove all extra cross links by checking whether they were supported by cross learning. After about 20% of total samples are transmitted to the central site, we got the exact BN structure as the original ALARM network. This experiment clearly shows that our collective learning algorithm can learn the correct structure with a small amount of data transmission, for a complicated distributed BN model and large dataset. Next, we tested the scalability of our approach with respect to number of sites. As the number of sites increases, there are more cross links. The local learning will introduce more extra links. The interaction of nodes inside cross set will cause a very complicated local BN structure. The experimental result is shown in table 5.8. Although increasing the number of sites makes the structure learning more difficult, the collective method can still get the correct structure with a small portion of samples transmitted. For the four sites case in table 5.8, there are 15 cross links. Considering 104

that the total number of links is 46, it is remarkable that collective method can detect all cross links and remove 8 extra local links. 5

4

16

35

6

2

7

17

3

8

18

9

19

37

34

10

20

15

1

29

11

21

12

13

14

25

28

27

22

30

31

32

23

36

33

26

24

Figure 5.11: ALARM Network Structure Learning Experiment: 3 Sites

105

Table Cross link 2 → 12 2 → 17 8 → 29 9 → 13 10 → 21 10 → 29 11 → 15 11 → 21 13 → 14 13 → 25 13 → 33 19 → 20 22 → 23 28 → 30

Number of sites

5.7: Cross learning r Strong link 0.000625 False 0.597367 True 0.480873 True 0.030429 False 0.370837 True 0.624849 True 0.006789 False 0.079446 True 0.107932 True 0.227451 True 0.676715 True 0.459421 True 0.306649 True 0.210296 True

Table 5.8: Scalability experiment results Site Number of Number of Information cross links extra links

|Dcoll |/|D| when we can get the correct structure

2

site A: 1 2 3 4 5 6 7 8 9 10 11 18 19 25 26 29 33 34 36 37 site B: 12 13 14 15 16 17 20 21 22 23 24 27 28 30 31 32 35

8

3

15.6%

3

site A:2 3 4 5 6 7 8 9 10 18 19 34 37 site B: 13 15 16 17 20 21 22 27 30 31 32 35 site C: 1 11 12 14 23 24 25 26 28 29 33 36

11

6

18.3%

4

site A: 3 4 5 6 7 8 37 site B: 2 9 10 18 19 34 site C: 13 15 16 17 20 21 22 27 30 31 32 35 site D: 1 11 12 14 23 24 25 26 28 29 33 36

15

8

25.4%

106

5.6

Summary

In this chapter, we presented several experiments to evaluate the learning algorithms described in Chapters 3 and 4. These results strongly support the proposed learning algorithm. From these results, we can conclude: (1) For parameter learning, both CM1 and CM2 can learn the parameter accurately with a small portion of data transmitted to a single site. They scale well with number of features and number of samples. (2) The performance of CM2 is comparable to that of CM1. However, CM2 requires much fewer local computations as compared to CM1, with speedup factors of 500-1000 in our experiments. It is therefore suitable for real-time applications. (3) The collective BN structure learning method will learn the same structure as that obtained from a centralized approach, even with a small data communication for a fairly complicated distributed BN model.

107

Chapter 6 Applications Two real world distributed Bayesian network learning applications are provided in this chapter. These applications are in the areas of scientific/engineering mining and web log analysis. In a real world application, we should first get information about the customer requirements and decide the goal such as descriptive learning, predictive learning, or both. Then preprocess the data, which includes data extraction, cleaning, and transformation based on customer requirement. After that, we choose an appropriate learning algorithm. In this chapter, the customer requirement is to find models and patterns from distributed heterogeneous databases and Bayesian network is used as the descriptive model. In these applications, lot of effort is involved in the data preparation step. We need to the find the best data preprocessing and transformation method so that the collective learning algorithms can learn useful models 108

from the dataset. In fact, in some applications, data preparation could be crucial for a successful data mining application and represent up to 80% of the total work load. The collective learning algorithms described in Chapters 3 and 4 have been used in both the applications.

6.1

Application: NASA DAO and NOAA AVHRR Pathfinder Datasets

In this Earth science distributed data mining application, we use two datasets: NASA DAO subset of monthly mean and NOAA AVHRR Pathfinder product.

6.1.1

Description of the Datasets

The data model in these two datasets is multidimensional time series as shown in figure 6.1. These dimensions are (time, longitude, latitude, features). Each spatial grid point contains many features. longitude • • •

latitude

(f1, f2, …, fn) Jan 1983

Dec 1992

Time

Figure 6.1: Earth Science Data model

109

NASA Data Assimilation Office (DAO) provides comprehensive and dynamically consistent datasets that represent the best estimates of the state of the atmosphere at that time. The current product, GEOS-1, uses meteorological observations and an atmospheric model. The dataset we use is a subset of the DAO monthly mean data set. The DAO monthly mean data set, in turn, is based on the DAO’s full multi-year assimilation. The DAO monthly mean has 180 grid points in the longitude direction from west to east with the first grid point at 180W and with a grid spacing of 2 degrees. There are 91 grid points in the latitude direction from north to south with the first grid point at the 90N and with a grid spacing of 2.0 degrees. The description of DAO monthly subset is as follows: 1. Temporal Coverage: March 1980 - November 1993 2. Temporal Resolution: All gridded values are monthly means 3. Spatial Coverage: Global 4. Horizontal Resolution: 2 degree x 2 degree, grid point data (180 x 91 values per level, proceeding west to east and then north to south) The dataset we use from NOAA is a product of NOAA AVHRR Pathfinder. The description of the dataset is as follows: 1. Temporal Coverage: July 1981 - November 2000 110

2. Temporal Resolution: All gridded values are monthly means 3. Spatial Coverage: Global 4. Horizontal Resolution: 1 degree x 1 degree, grid point data (360 x 180 values per level, proceeding west to east and then north to south)

6.1.2

Preprocessing

Feature Selection There are 26 features in DAO and 9 features in NOAA. We try to utilize as many features as possible. Some features contain lot of missing values. One possibility is to use some interpolation technique such as nearest neighbor averaging to handle this problem. However some features have the missing value at some grid points because these features do not exist at that grid point. For example, some features from NOAA dataset are only valid over the ocean region. Although other features have values at that grid, these missing value features make whole record at that grid useless since we try to build a model to represent the relationship among all variables. So we decided to drop the variable containing lots of missing values. We also drop some multi-layer features and very deterministic features (those that show little variability). After dropping these feaures, we were left with 15 DAO and 7 NOAA features. These features are listed in Tables 6.1 and 6.2. 111

Index Feature 1 Cldfrc 2 Evaps 3 Olr 4 Osr 5 Pbl 6 preacc 7 qint 8 radlwg 9 radswg 10 t2m 11 tg 12 ustar 13 vintuq 14 vintvq 15 winds

Index 16 17 18 19 20 21 22

Table 6.1: NASA DAO features Description Units 2-dimensional total cloud fraction Unitless Surface evaporation mm/day outgoing longwave radiation W/m**2 outgoing shortwave radiation W/m**2 planetary boundary layer depth HPa total precipitation mm/day precipitable water g/cm**2 net upward longwave radiation at ground W/m**2 net downward shortwave radiation at ground W/m**2 temperature at 2 meters K Ground temperature K Surface stress velocity m/s vertically averaged uwnd*sphu (m/s)(g/kg) vertically averaged vwnd*sphu (m/s)(g/kg) Surface wind speed m/s

Table 6.2: NOAA features Feature Description asfts Absorbed Solar Flux total/day olrcs day Outgoing Long Wave Radiation clear/day olrcs night Outgoing Long Wave Radiation clear/night olrts day Outgoing Long Wave Radiation total/day olrts night Outgoing Long Wave Radiation total/night tcf day Total Fractional Cloud Coverage day tcf night Total Fractional Cloud Coverage night

112

Coordination The next preprocessing step is to coordinate the distributed datasets. It is used to link an observation among different sites. Since DAO and NOAA have different grid format, we re-grid the NOAA data into DAO format. The temporal coverage of the merged dataset is January 1983 - December 1992. Spatial coverage is global. Using the mapping key (time, longitude, latitude), we get a distributed database.

Clustering In general, the Global dataset does not have a homogenous pattern. It is impossible to use one BN to represent the climate model of the entire Globe over these ten years. Clustering is used to segment the Global dataset into small regions in which relatively homogenous patterns exist. We use k-means clustering algorithm. In our experiment, we set the cluster number of k to be 5 or more and the results show that the data in a region of Pacific ocean is always in same cluster. This region is approximately from (-170, -60) (longitude, latitude) to (-90, 0). So we extract data in this region to build a subset. The next step is to aggregate the same month data together. That is, we extract all January data for year 1983-1992 and put them into a dataset. The reason for this is that same month data tend to have similar models (climate behavior is periodic over time). This is a sort of clustering in temporal domain.

113

Z-score Z-score is a standard technique in statistics to transform a random variable into one with zero-mean and unit variance i.e.,

xz =

x−µ σ

(6.1)

where x is the random variable and xz is the Z-score.

Quantization This step is to quantize the continuous feature value into discrete value. We use the histogram to quantize the values. If the histogram of xz is like that of a Gaussian shape, we quantize it into 3 levels:{0-low, 1-average, 2-high}. If xz resembles a uniform distribution or has two modes, it is quantized into 2 levels: {0-low, 1-high}. Note that we do not choose quantization level more than 3. The reason for this is: (1) too many quantization levels will cause large size CPTs. This makes the BN very complex and hard to learn. (2) Many entries in CPTs will be zero because the dataset size is relatively small, considering we have large size CPTs. The quantization level of the features used are [3, 3, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 3]. Figure 6.2 shows the histogram of raw, z-score, and quantized value of f8 and f20. After above preprocessing steps, we have 12 sets of distributed data. Each dataset 114

Histogram of f8

Histogram of f20

1500

1000

1000 500 500 0 20

40 60 80 Histogram of f8 z−score

0 100

100

1500

150 200 250 Histogram of f20 z−score

300

−4 −2 0 2 Histogram of quantized f20

4

1.2

2

1000

1000 500 500 0 −6

−4 −2 0 2 Histogram of quantized f8

0 −6

4

4000

6000

3000

4000

2000 2000

1000 0

1

1.5

2

2.5

0

3

1

1.4

1.6

1.8

Figure 6.2: Histogram of f8 and f20 corresponds to a collection of monthly data for year 1983-1992 in a rectangular region from (-170, -60) to (-90, 0). Each database has 22 features (15 of them from DAO and 7 from NOAA) and these features are discrete. All samples are complete with no missing value.

6.1.3

Distributed BN Learning

We compare Bcoll with Bcntr to evaluate the performance of collective method. We use structure difference to describe the similarity between Bcoll and Bcntr . It is defined as the sum of missing links (a link in Bcntr but not in Bcoll ) and extra links a link in Bcoll but not in Bcntr ). March dataset was used in the application. It has 9130 samples. The node ordering used was [10 11 8 7 6 3 1 9 14 2 13 5 12 15 16 18 17 20 19 21 22 4]. The centralized 115

BN structure is shown in figure 6.3. Bcntr is very complicated. It has 64 local links and 9 cross links. The cross links are: 2 → 16, 3 → 16, 3 → 17, 3 → 18, 7 → 16, 10 → 17, 10 → 18, 11 → 16, 11 → 20. Cross nodes are {16, 17, 18, 20}. In local learning step, there are no extra links. In cross learning step, when we transmit 35% samples, we can get 7 correct cross links and no extra cross links. Links 2 → 16 and 3 → 18 are missing. If we transmit 66% samples, we can get all correct cross links and no extra cross links. The collective learning result is in figure 6.4. The fact that there are 9 cross links and many local links makes the distributed learning a very hard problem. The performance of collective learning is fairly good, given the complexity. This experiment again demonstrates the effectiveness of collective method.

6.2

Distributed Web Log Mining

Our distributed web log mining system consists of three parts: preprocessing, classification, and Bayesian network learning. The input to the system is raw web log files. We use the NCSA httpd web log format (No., IP Address, User Id, Time, Method, URL, Protocol, Status, Size). A typical record in a web log file is as follows: 1 202.149.81.50 - [15/Jan/2001:17:08:42 -0800] ”GET / irl/acoustics/acoustics.html HTTP/1.0” 200 3148. The output of the learning system is a Bayesian network which can be used to optimize web server design or dynamic links generation.

116

10

2

5

1

13

11

9

6

7

16

14

12

8

3

4

18

15

17

20

19

21

22

Figure 6.3: Bcntr of March Dataset

6.2.1

Preprocessing

The preprocessing part transforms the raw log record into a session form. It cleans the raw log file in order to eliminate less important information. First it removes the record of some type of image or script files since these files are in general downloading with the html file. Then remove the extra part of the record such as User Id part and re-format the record to get a compact form. For example, we transform the previous raw log record as: 202.149.81.50 15/Jan/2001:17:08:42 / irl/acoustic.html 117

NASA DAO/NOAA Structure learning result 6

5

Structure Error

4

3

2

1

0 0.25

0.3

0.35

0.4

0.45

0.5 |Dcoll|/|D|

0.55

0.6

0.65

0.7

0.75

Figure 6.4: NASA DAO/NOAA Structure Learning 200 3148. Which part should be kept depends on the application. After the cleaning process, the size of records decreases drastically. The next step of preprocessing is user identification. We can consider one IP as one user. But this can be complicated due to the existing of proxy servers or fire-wall. We assume the same IP corresponds to same user since in general a user has unique IP in a network. The last step in preprocessing is session identification. A session is one transaction of a user. If the time between two records of the same user is less than some limit, we consider them to be in the same session. In this way we combine several records in a log file into one session. Note that all the preprocessing is done in the local sites. The session in each site which belongs to the same user and around the same time is part of a global session. This global session has all information about the user’s transaction. For example, a user loads a page from site 1 and this page has some images in site 2, 118

then in each site, only part of the session is recorded. But we can get a information about the entire session using the method described above. In this application, the raw web log file was obtained from the web server of the School of EECS at Washington State University — http://www.eecs.wsu.edu.

6.2.2

Transformation

After preprocessing we need to transform the individual sessions into feature form. We classify each resource (or file) on the server into a unique feature. This is done so that the number of features we need to handle is a small and meaningful number. If any resource corresponding to a particular feature is accessed during a session, we set the corresponding feature entry to be one in that session. Otherwsie it is set to zero. The resulting sessions in feature form forms a sample in our BN learning. Assignment of resources to specific features is a classification problem. There are two approaches for this. First, we start with a predetermined set of features (depending on the application) and we add resources to the web server, each resource file is assigned to one feature. This can be implemented by giving each resource a “tag.” The drawback of this approach is that the designer’s classification can be greatly different from that of a user. The other way is to learn the category for each resource from user log data. There are many clustering algorithm to do this.

119

In this application, we categorize the resource (html, video, audio etc.) requested from the server into different categories. For our example, based on the different resources on the EECS web server, we considered eight categories: E-EE Faculty, CCS Faculty, L-Lab and facilities, T-Contact Information, A-Admission Information, U-Course Information, H-EECS Home, and R-Research. These categories are our features. Finally, each feature value in a session is set to one or zero, depending on whether the user requested resources corresponding to that category. An 8-feature, binary dataset was thus obtained, which was used to learn a BN. Figure 6.5 illustrates this process schematically.

6.2.3

Bayesian Network Learning

A central BN was first obtained using the whole dataset. Figure 6.6 depicts the structure of this centralized BN. We then split the features into two sets, corresponding to a scenario where the resources are split into two different web servers. Site A has features E, C, T, and U and site B has features L, A, H, and R. We assumed that the BN structure was known, and estimated the parameters of the BN using our collective BN learning approach. Figure 6.7 shows the KL distance between the Bcntr and the Bcoll as a function of the fraction of observations communicated. Clearly the parameters of collective BN is close to that of central BN even with a small fraction

120

Cleaning/User Identification/ Session Identification



Samples (Feature Form)



Local BN

Local BN

Local BN

Collective BN

Figure 6.5: Schematic Illustrating Preprocessing and Mining of Web Log Data

121

Figure 6.6: Bayesian Network Structure learnt from Web Log Data KL distance between the joint probabilities 35

30

KL Distance

25

20

15

10

5

0

0

0.1

0.2

0.3 0.4 0.5 0.6 Fraction of observations communicated

0.7

0.8

0.9

Figure 6.7: KL Distance between Joint Probabilities in Distributed Web Log Mining Application of data communication.

122

Chapter 7 Conclusion and Future Work In this dissertation we explored issues that arise when the dataset is distributed among several sites. Distributed data mining is an important area within the field of KDD. For heterogeneous distributed scenario, we need to develop distributed data mining techniques that save communication overhead, offer better scalability, and require minimal communication of possibly secure data. We focus on the distributed Bayesian network learning in this dissertation. The reason we choose Bayesian network as the base model is that it is a very powerful and promising model to represent the independence/dependence of the variables in the problem domain, and supports decision-making under uncertainty. There are many successful applications of Bayesian networks in expert systems, realistic medical domains, and text data mining etc. Our work is a combination of distributed data mining techniques and 123

Bayesian network learning techniques. There are two kinds of learning problems in distributed BN learning: parameter learning and structure learning. The objective of our work is to get a BN Bcoll which is close to Bcntr using just a small portion of samples. We try to transmit as little data as possible to a central site. The key idea in the proposed algorithms is problem decomposition. The collective BN learning algorithms decompose the BN learning problem into sub-problems in which the variable number is fewer than that of original problem. Our methods use the important property of BN: decomposability. This property is crucial to the proposed learning algorithms. The proposed collective learning algorithms all have four steps: local learning, sample selection, cross learning, and combination. The key step is to identify the samples that are most likely to be the evidence of coupling between local and non-local variables. We prove that low likelihood samples in local sites are most likely to be the evidence for cross terms. For distributed structure learning, we proposed a collective learning algorithm. We first classify the variables in problem domain into local variables and cross variables. The edges in distributed BN are categorized into cross links and local links. By choosing a centralized structure learning algorithm with decomposability characteristic to do the local structure learning and cross learning, we prove that the collective method can identify the correct local structure of local variables and the strong local links of cross variables. Extra local links may be introduced due to the “hidden variable” problem 124

in local learning step. However, these extra local links are inside the cross set and their child nodes are cross nodes. Sample selection chooses the low likelihood samples in all sites and transmits them to the central site. Cross learning step can detect the structure of cross variables. In combination step, we use the result of local learning and cross learning to aggregate all local and cross links and remove extra local links. For parameter learning, we proposed two distributed learning algorithms: CM1 and CM2. They can get the correct parameters for local variables in local learning. Cross learning can learn the parameters of cross variables with high degree of accuracy using a small portion of samples. Combination is to combine the parameters of local variables and that of cross variables to get a collective BN. We analyze the sample complexity of CM1 and conclude that: (1) Local learning can achieve the same performance as centralized method for local variables. (2) Bcoll is close to Bcntr with a small portion of samples transmitted to a central site. Experiments verify that the collective BN Bcoll is very close to Bcntr with about 10% samples transmitted. There are two ways to reduce the data transmission overhead: removing features and selecting part of samples. CM1 uses the second method and CM2 uses both. Using a notion of cross set, CM2 chooses a subset of features in local site to do the likelihood computation and data selection. In general, the number of variables in this subset is much fewer than that of local site. So the local computation overhead can be reduced drastically. By transmitting only features in cross set, CM2 can reduce the 125

data transmission overhead. Thus it can be used in the applications with real-time constraints. We now discuss some directions for future work and some problems that need to be addressed. • Structure Learning: Even when the data is centralized, learning the structure of BN is considerably more involved than estimating the parameters or probabilities associated with the network. In a distributed data scenario, the problem of obtaining the correct network structure is even more pronounced. Although the proposed structure learning algorithm works well on sparsely connected distributed model. It may not work well in the densely connected case. In this case, the result of local learning is not good enough to correctly identify the samples corresponding to cross terms. In general, this dense connection means different sites have very strong correlation and may not be an appropriate distributed learning problem. • Performance Bounds: Our approach to “selective sampling” of data that may be evidence of cross-terms is reasonable based on the discussion in chapter 4. The sample complexity of parameter learning algorithm is provided in section 4.3.2. The sample complexity analysis of structure learning is only for MDL metric in centralized case. There are no published works on the sample 126

complexity of K2 metric. If such papers appear in the near future, we may use them to get the sample complexity analysis of our collective structure learning method. • Assumptions about the Data: As mentioned earlier, we assume the existence of a key that links observations across sites. Moreover, we consider a simple heterogenous partition of data, where the variable set at different sites are non-overlapping. We also assume that our data is stationary (all data points come from the same distribution) and free of outliers. These are simplifying assumptions to derive a reasonable algorithm for distributed Bayesian learning. Suitable learning strategies that would allow us to relax of some of these assumptions would be an important area of research. • Distributed Algorithm for Centralized Dataset: For a centralized learning problem with a large number of domain variables, we may use the proposed distributed learning algorithms to handle it by changing them to fit for the centralized case. The reason to use the proposed collective method is scalability. The existing centralized learning algorithms may not scale well on the large number of domain variables. Distributed learning algorithm is a promising way to solve this problem. We believe the algorithms developed in this dissertation are important to the 127

successful application of Bayesian network in a distributed heterogeneous scenario.

128

Appendix A Notation • B = (G, θ) - a BN • Bcntr - BN leant by centralized learning method • Bcoll - BN leant by distributed collective learning method • CPT - Conditional Probability Table • X - a variable • x - state of variable X • V a(X) - support set of variable X • Xi - node i

129

• X - a set of variables • x - state of a set of variables • pa(X) - parents of variable X • de(X) - set of descendent of X • nd(X) - set of non-descendent of X • LV - local variable, its parents are all in the same site • CV - cross variable, some parents of this variable is in other sites • CS(sitei ) - cross set of site i, the set of variable in this site that is a CV or the parent of a CV in this site • LS(sitei ) - local set of site i, the set of variables that are not in cross set of site i • θijk = P (xi = k | pa(xi ) = j) - the probability of variable xi in state k when pa(xi ) in state j • θij = P (xi | pa(xi ) = j) - the distribution of variable xi when pa(xi ) in state j • θi = P (xi | pa(xi )) - the CPT of variable xi . • Nijk - the number of samples in which xi = k and pa(xi ) = j. 130

• Nij -

k

Nijk

• D - dataset

131

Bibliography [AIS93]

R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. In Proc. of the ACM SIGMOD Int’l Conference on Management of Data, pages 207–216, 1993.

[AKPB97] J. Aronis, V. Kulluri, F. Provost, and B. Buchanan. The WoRLD: Knowledge discovery and multiple distributed databases. In Proceedings of the Florida Artificial Intellegence Research Symposium (FLAIRS-97), pages 11–14, 1997. Also available as Technical Report ISL-96-6, Intelligent Systems Laborotory, Department of Computer Science, University of Pittsburgh. [AS94]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th Int’l Conference on Very Large Databases, 1994.

[ATW91]

N. Abe, J. Takeuchi, and M. Warmuth. Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence. In Proceedings of the 1991 Workshop on Computational Learning Theory, pages 277–289, 1991.

[BDB95]

R. Bethea, B. Duran, and T. Boullion. Statistical Methods for Engineers and Scientistes. Marcel Dekker Inc, 1995.

[Ber02]

Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. http://www.accrue.com/ products/rp_cluster_review.pdf.

[BHK98]

J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborativefiltering. In G. F. Cooper and S. Moral, editors, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1998.

132

[BKRK97] J. Binder, D. Koller, S. Russel, and K. Kanazawa. Adaptive probabilistic networks with hidden variables. Machine Learning, 29:213–244, 1997. [BKS97]

E. Bauer, D. Koller, and Y. Singer. Update rules for parameter estimation in Bayesian networks. In D. Geiger and P. Shanoy, editors, Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, pages 3–13. Morgan Kaufmann, 1997.

[Bou94]

R. R. Bouckaert. Properties of Bayesian network learning algorithms. In R. Lopez de Mantaras and D. Poole, editors, Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pages 102–109. Morgan Kaufmann, 1994.

[BP96]

D. Billsus and M. Pazzani. Revising user profiles: The search for interesting web sites. In Proceedings of the Third International Workshop on Multistrategy Learning. AAAI Press, 1996.

[BP97]

D. Billsus and M. Pazzani. Learning probabilistic user models. In Workshop notes of Machine Learning for User Modeling — Sixth International Conference on User Modeling, Chia Laguna, Sardinia, 1997.

[BS97]

R. Bhatnagar and S. Srinivasan. Pattern discovery in distributed databases. In Proceedings of the AAAI-97 Conference, pages 503–508, Providence, July 1997. AAAI Press.

[BSCC89] I. Beinlich, H. Suermondt, R. Chavez, and G. Cooper. The alarm monitoring system: A case study with two probabilistic inference techniques for belief networks. In Proceedings of the Second European Conference on Artificial Intelligence in Medical Care, pages 247–256. Springer-Verlag, 1989. [Bun91]

W. Buntine. Theory refinement on Bayesian networks. In B. D. D’Ambrosio and P. Smets amd P. P. Bonissone, editors, Proceedings of the Seventh Annual Conference on Uncertainty in Artificial Intelligence, pages 52–60. Morgan Kaufmann, 1991.

[CBL97]

J. Cheng, D. A. Bell, and W. Liu. Learning belief networks from data: An information theory based approach. In Proceedings of the Sixth ACM International Conference on Information and Knowledge Management, 1997.

133

[CH92]

G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992.

[CH97]

D. M. Chickering and D. Heckerman. Efficient approximation for the marginal likelihood of incomplete data given a Bayesian network. Machine Learning, 29:181–212, 1997.

[Cha91]

E. Charniak. Bayesian networks without tears. AI Magazine, 12:50–63, 1991.

[Chi96]

D. M. Chickering. Learning equivalence classes of Bayesian network structure. In E. Horvitz and F. Jensen, editors, Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1996.

[Chi02]

D. M. Chickering. Learning equivalence classes of Bayesian-Network structures. Journal of Machine Learning Research, 2:445–498, 2002.

[CS96]

P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory and results. In U. Fayyad, G. P. Shapiro, P. Smyth, and R. S. Uthurasamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996.

[CS99]

V. Crestana and N. Soparkar. Mining decentralized data repositories. Technical Report CSE-TR-385-99, University of Michigan, Ann Arbor, MI, 1999.

[CS02]

R. Chen and K. Sivakumar. A new algorithm for learning parameters of a Bayesian network from distributed data. In Proceedings of the 2002 IEEE International Conference on Data Mining, 2002.

[CS03]

R. Chen and K. Sivakumar. Collective Bayesian network structure learning for distributed heterogeneous databases. In To be submitted, 2003.

[CSK01a]

R. Chen, K. Sivakumar, and H. Kargupta. An approach to online Bayesian learning from multiple data streams. In Proceedings of Workshop on Mobile and Distributed Data Mining, PKDD ’01, 2001.

[CSK01b]

R. Chen, K. Sivakumar, and H. Kargupta. Distributed web mining using Bayesian networks from multiple data streams. In Proceedings of the IEEE Conference on Data Mining, pages 75–82, 2001.

134

[CSK02]

R. Chen, K. Sivakumar, and H. Kargupta. Collective mining of Bayesian networks from distributed heterogeneous data. Knowledge and Information Systems, (Accepted) 2002.

[CSK03a]

R. Chen, K. Sivakumar, and H. Kargupta. Bayesian network learning for NASA DAO/NOAA Pathfinder databases. In To be submitted, 2003.

[CSK03b]

R. Chen, K. Sivakumar, and H. Kargupta. Learning Bayesian Network structure from distributed data. In Proceedings of 2003 SIAM Conference on Data Mining, 2003.

[Das97]

S. Dasgupta. The sample complexity of learning fixed-structure Bayesian networks. Machine Learning, 29:165–180, 1997.

[Dun03]

M. Dunham. Data Mining Introductory and Advanced Topics. Pearson Education Inc, 2003.

[ET95]

K. J. Ezawa and Schuermann T. Fraud/uncollectable debt detection using Bayesian network based learning system: A rare binary outcome with mixed data structures. In P. Besnard and S. Hanks, editors, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages 157–166. Morgan Kaufmann, 1995.

[FG97]

N. Friedman and M. Goldszmidt. Sequential update of Bayesian network structure. In D. Geiger and P. Shanoy, editors, Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1997.

[FGG97]

N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning, 29:131–163, 1997.

[Fri97]

N. Friedman. Learning Bayesian networks in the presence of missing values and hidden variables. In D. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann, 1997.

[Fri98]

N. Friedman. The Bayesian structural EM algorithm. In G. F. Cooper and S. Moral, editors, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1998.

[FY96]

N. Friedman and Z. Yakhini. On the sample complexity of learning Bayesian networks. In Proceedings of the twelfth conference on uncertainty in artificial intelligence, 1996. 135

[GRS96]

W. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman and Hall, 1996.

[Hec98]

D. Heckerman. A tutorial on learning with Bayesian networks. In M. I. Jordan, editor, Proceedings of the NATO Advanced Study Institute on Learning in Graphical Models. Kluwer Academic Publishers, 1998.

[HG95]

D. Heckerman and D. Gieger. Learning Bayesian networks: A unification for discrete and Gaussian domains. In P. Besnard and S. Hanks, editors, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages 274–284. Morgan Kaufmann, 1995.

[HGC95]

D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197–243, 1995.

[HK01]

D. Hershberger and H. Kargupta. Distributed multivariate regression using wavelet-based collective data mining. Journal of Parallel and Distributed Computing, 61:372–400, 2001.

[HMC97]

D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to causal discovery. Technical Report MSR-TR-97-05, Microsoft Research, 1997.

[HMS01]

D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. The MIT Press, 2001.

[Jen96]

F. Jensen. An Introduction to Bayesian Networks. Springer, 1996.

[JK99]

E. Johnson and H. Kargupta. Collective, hierarchical clustering from distributed, heterogeneous data. In Lecture Notes in Computer Science, volume 1759, pages 221–244. Springer-Verlag, 1999.

[Kac82]

Kachigan. Statistical Analysis. Radius Press, 1982.

[KBL+ 04] H. Karguptac, R. Bhargava, K. Liu, M. Powersc, P. Blair, and M. Klein. Vedas: A mobile and distributed data stream mining system for real-time vehicle monitoring. In Proceedings of SIAM Data Mnining, 2004. [Ken97]

Y. Kenji. Distributed cooperative Bayesian learning strategies. In Proceedings of the Tenth Annual Conference on Computational Learning Theory, pages 250–262, Nashville, Tennessee, 1997. ACM Press.

136

[KHK+ 00] H. Kargupta, W. Huang, S. Krishnamrthy, H. Park, and S. Wang. Collective principal component analysis from distributed, heterogeneous data. In D. Zighed, J. Komorowski, and J. Zytkow, editors, Proceedings of the Principles of Data Mining and Knowledge Discovery Conference, volume 1910, pages 452–457, Berlin, September 2000. Springer. Lecture Notes in Computer Science. [KHSJ00a] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. Distributed clustering using collective principle component analysis. In Proceedings of the ACM SIGKDD Workshop on Distributed and Parallel Knowledge Discovery in Databases, pages 8–19, August 2000. [KHSJ00b] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. Distributed clustering using collective principal component analysis. Knowledge and Information Systems Journal, 3(4), 2000. [KN96]

R. King and M. Novak. Supporting information infrastructure for distributed, heterogeneous knowledge discovery. In Proceedings of SIGMOD 96 Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada, 1996. http://www.cs.colorado. edu/~sanctuary/Papers/datamining.ps.

[KP02]

H. Kargupta and B Park. Mining time-critical data streams from mobile devices using decision trees and their fourier spectrum. IEEE Transaction on Knowledge and Data Engineering (in press), 3(4), 2002.

[KPHJ00] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Collective data mining: A new perspective toward distributed data mining. In H. Kargupta and P. Chan, editors, Advances in Distributed and Parallel Knowledge Discovery, pages 133–184. AAAI/ MIT Press, Menlo Park, California, USA, 2000. [KPJ+ 01]

H. Kargupta, B. Park, E. Johnson, R. E. Sanseverino, L. D. Silvestre, and D. Hershberger. Distributed, collaborative data analysis from heterogeneous sites using a scalable evolutionary technique. (Accepted for publication) Journal of Applied Intelligence, 2001.

[Lau95]

S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191–201, 1995.

137

[LB94]

W. Lam and F. Bacchus. Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence, 10:262–293, 1994.

[LKR03]

K. Liu, H. Kargupta, and J. Ryan. Random projection, and privacy preserving data mining from distributed multi-party data. Technical report, School of Computer Science and Electrical Engineering, UMBC, 2003.

[LPO00]

A. Lazarevic, D. Pokrajac, and Z. Obradovic. Distributed clustering and local regression for knowledge discovery in multiple spatial databases. In Proc. 8th European Symposium on Artificial Neural Networks, Bruges, Belgium, April 2000.

[LS88]

S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems (with discussion). Journal of the Royal Statistical Society, series B, 50:157–224, 1988.

[LS97]

W. Lam and A. M. Segre. Distributed data mining of probabilistic knowledge. In Proceedings of the 17th International Conference on Distributed Computing Systems, pages 178–185, Washington, 1997. IEEE Computer Society Press.

[Mit97]

T. Mitchell. Machine Learning. McGraw-Hill, 1997.

[MJ98]

M. Meila and M. I. Jordan. Estimating dependency structure as a hidden variable. In NIPS, 1998.

[MR94]

D. Madigan and A. Raftery. Model selection and accounting for model uncertainty in graphical models using Occam’s window. Journal of the American Statistical Association, 89:1535–1546, 1994.

[MSG00]

S. McClean, B. Scotney, and K. Greer. Clustering heterogeneous distributed databases. In Workshop on Distributed and Parallel Knowledge Discovery at KDD-2000, pages 20–29, Boston, 2000.

[PB95]

F. J. Provost and B. Buchanan. Inductive policy: The pragmatics of bias selection. Machine Learning, 20:35–61, 1995.

[PB97]

M. Pazzani and D. Billsus. Learning and revising user profiles: The identification of interesting web sites. Machine Learning, 27:313–331, 1997. 138

[Pea88]

J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.

[Pea93]

J. Pearl. Graphical models. causality and intervention. Statistical Science, 8:266–273, 1993.

[PMB96]

M. Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying interesting web sites. In Proceedings of the National Conference on Artificial Intelligence. 1996.

[PO00]

S. Parthasarathy and M. Ogihara. Clustering distributed homogeneous datasets. In D. Zighed, J. Komorowski, and J. Zytkow, editors, Proceedings of the Principles of Data Mining and Knowledge Discovery Conference, pages 566–574, September 2000.

[PS00]

A. Prodromidis and S. Stolfo. Cost complexity-based pruning of ensemble classifiers. In Workshop on Distributed and Parallel Knowledge Discovery at KDD-2000, pages 30–40, Boston, 2000.

[PV91]

J. Pearl and T. Verma. A theory of inferred causation. In KR’91, pages 441–452, 1991.

[SGS93]

P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Number 81 in Lecture Notes in Statistics. Springer-Verlag, 1993.

[Sin97]

M. Singh. Learning Bayesian networks from incomplete data. In Proceedings of the National Conference on Artificial Intelligence, pages 27–31. AAAI Press, 1997.

[SL90]

D. J. Spiegelhalter and S. L. Lauritzen. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20:570– 605, 1990.

[SP95]

M. Singh and G. M. Provan. A comparison of induction algorithms for selective and non-selective Bayesian classifiers. In A. Prieditis and S. Russel, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 497–505. Morgan Kaufmann, 1995.

[SS00]

M. Sayal and P. Scheuermann. A distributed clustering algorithm for webbased access patterns. In Workshop on Distributed and Parallel Knowledge Discovery at KDD-2000, pages 41–48, Boston, 2000.

139

[Suz93]

J. Suzuki. A construction of Bayesian networks from databases based on an MDL scheme. In D. Heckerman and A. Mamdani, editors, Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence, pages 266–273. Morgam Kaufmann, 1993.

[TG00a]

K. Tumer and J. Ghosh. Robust order statistics based ensemble for distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, pages 185–210. MIT, 2000.

[TG00b]

A. Turinsky and R. Grossman. A framework for finding distributed data mining strategies that are intermediate between centeralized strategies and in-place strategies. In Workshop on Distributed and Parallel Knowledge Discovery at KDD-2000, pages 1–7, Boston, 2000.

[Thi95]

B. Thiesson. Accelerated quantification of Bayesian networks with incomplete data. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 306–311. AAAI Press, 1995.

[TMCH98] B. Thiesson, C. Meek, D. M. Chickering, and D. Heckerman. Learning mixtures of Bayesian networks. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1998. [TSG92]

A. Thomas, D. Spiegelhalter, and W. Gilks. Bugs: A program to perform Bayesian inference using Gibbs sampling. In J. Bernardo, J. Berger, A. Dawid, and A. Smith, editors, Bayesian Statistics, pages 837–842. Oxford University Press, 1992.

[VP90]

T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth UAI, pages 220–227, 1990.

[Web]

Timo koskela’s decision tree page http://www.hut.fi/~timoko/ treeprogs.html/.

[ZR98]

G. Zweig and S. J. Russel. Speech recognition with dynamic Bayesian networks. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, 1998.

140