Comparison of Three Parallel Implementations of ... - Semantic Scholar

11 downloads 0 Views 185KB Size Report
[t 2uj 2ext + ej] to choose appropriate literal(s) Lm from l can- didates in serial. We assume to get ready a computer as boss and n other computers as followers.
Comparison of Three Parallel Implementations of an Induction Algorithm Tohgoroh MATSUI, Nobuhiro INUZUKA, Hirohisa SEKI, and Hidenori ITOH

f tohgoroh, inuzuka, seki, itoh [email protected] Department of Intelligence and Computer Science, Nagoya Institute of Technology, Japan.

Abstract Recently, researchers have tried to apply ILP to KDD because ILP enlarges the applicability of Machine Learning to cover KDD and Data Mining: it enables them to learn from multiple relational tables. Many scienti c discovery systems are motivated from the desire to deal with larger databases. However the larger the databases are, the more computational power we need. Parallel computing is a possible solution to this problem. This research also aims to implement QUINLAN's Foil in parallel. Foil nds de nitions of relations using other relations in background knowledge with a top-down approach. There are two approaches to designing parallel algorithms for inductive learning, the search space parallel approach and the data parallel approach. In ILP data sets consist of training sets and background knowledge. Thus we examine three approaches, to part the search space, the training set, and the background knowledge. We experimented on FUJITSU AP3000 to compare among these approaches. Experimental results are shown and we discuss which is the most ecient approach to parallelize ILP systems.

1 Introduction Many researches in inductive learning concentrate on Inductive Logic Programming (ILP), which studies inductive learning within the framework of logic programming. ILP constructs logic programs that express target relations from their positive and negative examples with background knowledge. ILP enlarges the applicability of Machine Learning to cover

Knowledge Discovery in Databases (KDD) and Data Mining: it enables them to learn from multiple relational tables. Recently, researchers have tried to apply ILP to KDD (e.g. [1] [2] [3]). However, many researches in inductive learning deal with only problems with relatively small amounts of data. Many scienti c discovery systems are motivated from the desire to deal with larger databases. However the larger the databases are, the more computational power we need. Parallel computing is a possible solution to this problem. A parallel implementation of an ILP algorithm has been proposed recently. FURUKAWA's group has tried to implement MUGGLETON's Progol [4] in parallel using the bottom-up inference engine MGTP [5]. Since MGTP has already implemented in parallel, the proposed implementation of parallel-Progol will be realized. Performance of the system has not been reported yet, however. This research aims to implement QUINLAN's Foil [6] in parallel. Both of Foil and Progol are known as the most successful ILP systems. Foil nds de nitions of relations using other relations in background knowledge with a top-down approach. It uses ecient method adopted from attribute-value learning systems. We introduce a general approach to parallelizing Data Mining algorithms and apply it to Foil. Parallel approaches to scaling AI systems have been pursued. Parallel approaches to classi cation-rule learning or concept learning include partitioning the data set among processors and partitioning the search space among available processors (e.g. [7] [8] [9] [10]). Search space parallel approaches have also been applied

P1-U-1

Initialization:

definition := null program remaining := all tuples belonging to target relation R

While remaining is not empty /* Grow a new clause */ clause := \R(A; B; 1 1 1) : 0" specialize clause(clause, remaining) Remove from remaining tuples in R covered by clause Add clause to definition Figure 1: An outline of Foil algorithm specialize clause(clause, remaining) T := remaining While clause covers tuples known not to belong to R

/* Find appropriate literal(s) */ Construct candidates L for appropriate literal(s) L For each candidate Extend T to T 0 Evaluate T 0 Choose appropriate literal(s) L Add L to body of clause T := T 0

Figure 2: An outline of specialize clause procedure to an Abductive Reasoning system [11]. We also study these approaches to parallelize ILP.

clause. According to [12], choosing gainful literals is summarized as below. Consider the partially developed clause

2 Overview of FOIL

R(V1 ; V2 ; : : : ; Vk )

Figure 1 and 2 show outlines of Foil's algorithm and specialize clause procedure in it, respectively. Foil requires 8 and 9 labeled positive and negative examples of the form R(c1 ; c2 ; : : : ; cn ) where the ci 's are constants, and tuples of the same form belonging to relations in background knowledge. Foil starts with the left-hand side of the clause and specializes it by adding literals to the right-hand side, stopping when no 9 tuples are covered by the clause or when encoding-length heuristics indicate that the clause is too complex. As new variables are introduced by the added literals, the size of the tuples in the training set increases so that each tuple represents a possible binding for all variables that appear in the partially-developed

L1 ; L2 ; : : : ; Lm01

containing variables V1 ; V2 ; : : : ; Vx . Each tuple in the training set T looks like hc1 ; c2 ; : : : ; cx i for some constants fcj g, and represents a ground instance of the variables in the clause. Now consider what happens when a literal Lm of the form P (Vi1 ; Vi2 ; : : : ; Vip )

is added to the right-hand side. If the literal contains one or more new variables, the arity of the new training set will increase; let x0 denote the number of variables in the new clause. Then, each tuple in the new training set T 0 will be of the form hd1 ; d2 ; : : : ; dx i for constants fdj g, and will have the following properties:

P1-U-2

0

 hd1; d2 ; : : : ; dx i is a tuple in T , and

 hdi1 ; di2 ; : : : ; di i is in the relation P . p

That is, each tuple in T 0 is an extension of one of the tuples in T , and the ground instance that it represents satis es the literal. Every tuple in T thus gives rise to zero or more tuples in T 0 with the 8 or 9 label of a tuple in T 0 being copied from its ancestor tuple in T . Let T+ denote the number of 8 in T and T+0 the number in T 0 . The e ect of adding a literal Lm can be assessed from an information perspective as follows. The information conveyed by the knowledge that a tuple in T has label 8 is given by I (T ) = 0 log2 (T+ = jT j) and similarly for I (T 0 ). If I (T 0 ) is less than I (T ) we have `gained' information by adding the literal Lm to the clause; if s of the tuples in T have extensions in T 0 , the total information gained about the 8 tuples in T is gain(Lm ) = s 2 (I (T ) 0 I (T 0 )): Foil explores the space of possible literals that might be added to a clause at each step, looking for the one with greatest positive gain.

3 Parallel implementations In Data Mining algorithms including ILP the discovered knowledge in each step depends heavily on what has been discovered in previous steps. In contrast to a problem like matrix multiplication (in its standard form), it is impossible to part the task without increasing the synchronization and communication between the parallel processors. Synchronization and communication cause usually poor performance. We carefully design parallel algorithms to divide tasks in independent sub-tasks. Two main approaches to designing parallel algorithms for inductive learning are the search space parallel approach and the data parallel approach. In the search space parallel approach tasks for di erent candidates of target relations are assigned to di erent processors, while in the data parallel approach each processor deals with tasks concerning a di erent part of a data set. In ILP a data set consists of a training set and background knowledge. Thus we part

the search space, the training set, and the background knowledge as follows. In this paper we concentrate on distributed memory architectures because of the improved scalability of such architectures compared to shared memory architectures. 3.1

Preparation

Before we introduce the three approaches, we sort out the step evaluating literals in Foil algorithm. This is the common part of the three approaches. Consider a literal Lm of the form P (Vi1 ; Vi2 ; : : : ; Vip ) to be added to the right-hand side of a partially developed clause C of the form R(V1 ; V2 ; : : : ; Vk )

L1 ; L2 ; : : : ; Lm01 :

In this time T , U , and Lm denote a training set, the tuple set belonging to the relation P , and a candidate set for Lm , respectively. We can prepare U from the background knowledge using Lm , and compute a new training set T 0 from T with U . Now consider Lm to be fLm1 ; Lm2 ; : : : ; Lml g. Let uj denote the number of tuples belonging to the relation for the literal Lmj , and t denote the number of tuples in T . Let us consider init, ext, and ej as time costs to construct Lm , to extend a tuple, and to evaluate Lmj , respectively. Then Foil takes totally init +

l X j

[t 2 uj 2 ext + ej ]

to choose appropriate literal(s) Lm from l candidates in serial. We assume to get ready a computer as boss and n other computers as followers. Boss lets followers to compute divided tasks. 3.2

Search space parallel approach {

SSP-Foil{

The main idea behind this approach is to divide Foil's search space among processors. Each candidate for Lm is independent of another, so that we can evaluate a candidate without any other candidates. We divide Lm to sub candidate sets fL1m ; L2m ; : : : ; Lnm g having following properties:

P1-U-3

Boss Follower 1

1

1

n+1

Follower 2

2

Follower n

n

communicate

n

l-1

n+1

2

2n

n+2

l-1

l

n+2

l

2n

construct candidates

extend training set

evaluate

Figure 3: A time chart of SSP-Foil specialize clause(clause, remaining) T := remaining While clause covers tuples known not to belong to R

/* Find appropriate literal(s) */ Send clause and T to followers Construct candidates L for L Take Li from L For each candidate in Li Extend T to T 0 Evaluate T 0 Choose appropriate literal(s) Li from Li Send Li and T 0 to Boss Choose appropriate literal(s) L Add L to body of clause T := T 0

Figure 4: An outline of specialize clause procedure in SSP-Foil algorithm

 L1m [ L2m [ 1 1 1 [ Lnm = Lm ,

where =



Lim \ Ljm =  (1  i; j

 n; i 6= j ), and



the sizes of sub candidate set are all the same as far as possible.

The follower numbered i chooses the most gainful literal(s) in Lim , and reports it to boss. Boss collects all literals reported, and compares them to choose the most gainful literal(s). Figure 3 shows a time chart when it chooses appropriate literal(s). Let c and c0 denote a time cost to communicate from boss to a follower and one in the opposite direction, respectively. The cost for choosing appropriate literal(s) is 0

1

max i 2 c + init +  + c0 ; 1in

(1)

X

j jLmj 2Lim

[t 2 uj 2 ext + ej ] :

Figure 4 shows an outline of specialize clause procedure in SSP-Foil. Follower numbered i carries out the underlined part, and boss does the others. 3.3

Training set parallel approach {

TSP-Foil{

The rst data partitioning approach divides a training set among processors. Each tuple in T is independent of another in T , so that we can extend a tuple in T without any other tuples in T . We divide a training set T to sub training sets fT 1 ; T 2 ; : : : ; T n g having the following properties:

P1-U-4

Boss

1

Follower 1

1

Follower 2

i 2

1

communicate

i

2 1

Follower n

2

l l

i

l

2

i

construct candidates

l

extend training set

evaluate

Figure 5: A time chart of TSP-Foil specialize clause(clause, remaining) T := remaining While clause covers tuples known not to belong to R

/* Find appropriate literal(s) */ Send clause and T to followers Construct candidates for L Take T i from T For each candidate Extend T i to T i 0 Send T i 0 to boss For each candidate Attach all T i 0 's to T 0 Evaluate T 0 Choose appropriate literal(s) L Add L to body of clause T := T 0

Figure 6: An outline of specialize clause procedure in TSP-Foil algorithm

  

T1 [ T2 [ 111 [ Tn = T,

T i \ T j =  (1  i; j  n; i 6= j ), and

the sizes of sub training set are all the same as far as possible. The follower numbered i computes T i 0 from T i

with U , and reports the result to boss. Boss collects all these results, and attaches them to get T 0 . Then it evaluates T 0 , and chooses the most gainful literal(s). Figure 5 shows a time chart when it chooses appropriate literal(s). Since each follower has dt=ne tuples in T i at the largest, the cost to choose appropriate literals(s) is n 2 c + init +

l 2 X j =1

dt=ne 2 uj 2 ext + c03 + el ;

(2)

where dxe denotes the ceiling of x. Figure 6 shows an outline of specialize clause procedure in TSP-Foil. Follower numbered i carries out the underlined part, and boss does the others. 3.4

Background

knowledge

parallel

approach {BKP-Foil{

The other data partitioning approach divides tuples belonging to relations in background knowledge. Each tuple in U is independent of another in U , so that we can extend a tuple in T without any other tuples in U . For each relation in the background knowledge, we divide U to sub tuple sets fU 1 ; U 2 ; : : : ; U n g having following properties:

P1-U-5



U1 [ U2 [ 1 1 1 [ Un = U,

initialization:

definition := null program remaining := all tuples belonging to target relation R

For each relation in background knowledge Take U i from all tuples belonging to the relation

Figure 7: An outline of initialization in BKP-Foil algorithm specialize clause(clause, remaining) T := remaining While clause covers tuples known not to belong to R

/* Find appropriate literal(s) */ Send clause and T to followers Construct candidates for L For each candidate Extend T to T 0i with U i Send T 0i to boss For each candidate Attach all T 0i 's to T 0 Evaluate T 0 Choose appropriate literal(s) L Add L to body of clause T := T 0

Figure 8: An outline of specialize clause procedure in BKP-Foil algorithm

 

U i \ U j =  (1  i; j  n; i 6= j ), and

4

the sizes of sub tuple set are all the same as far as possible. The follower numbered i computes T 0i from T with U i , and reports the result to boss. Boss collects all these results, and attaches them to get T 0. Then it evaluates T 0 , and chooses the most gainful literal(s). The time chart is very similar to that of TSPFoil shown in Figure 5. Since each follower has du=ne tuples in U i at the largest, the cost to choose appropriate literals(s) is n 2 c + init +

l 2 X j =1

3

t 2 duj =ne 2 ext + c0 + el : (3)

Experiments

To compare among the three approaches, we implemented the three di erent algorithms in Java with TCP/IP communication. We experimented on FUJITSU AP3000 with a processor as boss and 1, 2, 3, 4, 5, 10 and 15 processor(s) as follower(s). In a paper introducing his Induce system, MICHALSKI describes an arti cial task of learning to predict whether a train headed east or west [13]. Trains have di erent numbers of cars and some cars carry more than one load. This time there is a plethora of relations showed in Table 1. There are 216 kinds of cars. If a train that has a short and closed car is eastbound, the eastbound relation can be de ned as:

Figure 7 and 8 show outlines of initialization and specialize clause procedure in BKP-Foil, respectively. Follower numbered i carries out the underlined part, and boss does the others. P1-U-6

eastbound(A)

has car(A; B); short(B); closed top(B):

train T is eastbound car C is a car of train T car C is short car C is shaped as an open rectangle 111 similar relations for ve other shapes jagged top(C) car C has a jagged top 111 similar relations for four other tops car C is open open top(C) closed top(C) car C is closed car C has one load item 1 item(C) 111 similar relations for two or three load items car C has two wheels 2 wheels(C) car C has three wheels 3 wheels(C) double(C) car C is double-sided eastbound(T) has car(T, C) short(C) open rect(C)

sec. 2200 TSP-FOIL BKP-FOIL SSP-FOIL

2000 1800 1600

runtime

1400 1200 1000 800 600 400 200 0

1

2

4

6

8 # of processors

10

12

14

15

14

15

times 10 TSP-FOIL BKP-FOIL SSP-FOIL expectation

9 8

speedup

7 6 5 4 3

Table 1: A plethora of relations in east-west challenge We experimented with this relation as a target relation. We prepared 2000 trains, the half of them are for positive examples, and the others for negative. Figure 9 shows runtime and speedup values of experimental results for di erent types of partitioning approach.

5 Discussions In this problem if some relations in background knowledge have much more tuples than the others, the cost to extend training sets is much larger than the others. Therefore the SSP-Foil is worse than the other approaches because the total time cost is the maximum cost among all of followers as expressed in Equation (1). The di erence between TSP-Foil and BKP-Foil is a very little as we understand also from Equations (2) and (3). The performances of them are not di erent so much as a result. If we use only one processor as follower, SSPFoil is more than 40% better than the other approaches. It indicates the cost to communicate is very large. We can guess the cost to communicate is totally more than 50%. It comes from the necessity of our system to communicate the training set: the larger training set, the larger

2 1

1

2

4

6

8 # of processors

10

12

Figure 9: Runtime and speedup communication. It is the reason why our systems cannot gain bene t of the number of processors using more than 5 processors. On the other hand, Foil and our systems need all positive tuples belonging to relations in background knowledge. If the relations are given as a logic program, using a so-called intensional evaluation, we cannot divide the background knowledge eciently. We therefore guess TSP-Foil is generally better than BKP-Foil.

6 Conclusion In this paper, we have described three di erent parallel implementations of ILP system based on QUINLAN's Foil. We have proposed to part the search space, the training set, and the background knowledge at inner loop of Foil. We have implemented these methods on a boss and followers, and investigated the performances of them. In summary, for parallelizing ILP system:

P1-U-7



Search space parallel approach is inferior to the other data parallel approaches because

the sizes of divided tasks may not be all the same,



[8] PROVOST and HENNESSY. Scaling Up: Distributed Machine Learning with Cooperation. Proc. of the 13th National Conference on Arti cial Intelligence (AAAI'96).

Training set parallel approach is superior to background knowledge parallel approach because the background knowledge may not be given as eciently dividable form, as described above.

However, the experimental results highlighted the problems posed by the amount of communications. Part of our future work includes a reduction of the amount of communications.

References [1] POMPE, KONONENKO and MAKSE. An application of ILP in a musical database: learning to compose the two-voice counterpoint. MLnet Sponsored Familiarization Workshop Data Mining with Inductive Logic Programming (1996). [2] DZEROSKI, et al. Applying ILP to Diterpene Structure Elucidation from 13 C NMR Spectra. MLnet Sponsored Familiarization Workshop Data Mining with Inductive Logic Programming (1996). [3] LORENZO. Application of Clausal Discovery to Temporal Databases. MLnet Sponsored Familiarization Workshop Data Mining with Inductive Logic Programming (1996).

[9] CHAN and STOLFO. Sharing Learned Models among Remote Database Partitions by Local Meta-learning. Proc. of International Conference on Knowledge Discovery and Data Mining (1996). [10] GALAL, COOK and HOLDER. Exploiting Parallelism in Knowledge Discovery Systems to Improve Scalability. Proc. of the 31st Hawaii International Conference on System Sciences (1998). [11] KATO, SEKI and ITOH. Parallel Costbased Abductive Reasoning for Distributed Memory Systems. PRICAI'96, LNAI-1114, pp.300-311, Springer-Verlag (1996). [12] QUINLAN, CAMERON-JONES. FOIL: A Midterm Report. Proc. of European Conference on Machine Learning, pp.3-20, Springer-Verlag (1993). [13] MICHALSKI. Pattern recognition as ruleguided inductive inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, pp.349-361 (1980).

[4] MUGGLETON. Inverse Entailment and Progol. New Generation Computing, Vol.13, pp.245-286 (1995). [5] FUJITA, et al. A New Design and Implementation of PROGOL by bottom-up Computation. Proc. of the 6th International Workshop on ILP (1996). [6] QUINLAN. Learning Logical De nitions from Relations. Machine Learning, 5, pp.239-266 (1990). [7] DARLINGTON, et al. Parallel Induction Algorithms for Data Mining. Proc. of the Second International Symposium on Intelligent Data Analysis (IDA-97). P1-U-8

Suggest Documents