We follow the Warren Abstract Machine (WAM) [16] organisation as found ..... [8] E. Lusk, R. Butler, T. Disz, R. Olson, R. Overbeek, R. Stevens, D. H. D. Warren,.
DAOS — Scalable And-Or Parallelism Lu´ıs Fernando Castro1 , V´ıtor Santos Costa2 , Cl´ audio F.R. Geyer1 , 2 1 Fernando Silva , Patr´ıcia Kayser Vargas , and Manuel E. Correia2 1
Universidade Federal do Rio Grande do Sul, Porto Alegre - RS, Brasil {lfcastro,geyer,kayser}@inf.ufrgs.br 2 Universidade do Porto, Porto, Portugal {vsc,fds,mcc}@ncc.up.pt
Abstract. This paper presents DAOS, a model for exploitation of Andand Or-parallelism in logic programs. DAOS assumes a physically distributed memory environment and a logically shared address space. Exploiting both major forms of implicit parallelism should serve a broadest range of applications. Besides, a model that uses a distributed memory environment provides scalability and can be implemented over a computer network. However, distributed implementations of logic programs have to deal with communication overhead and inherent complexity of distributed memory managent. DAOS overcomes those problems through the use of a distributed shared memory layer to provide single-writer, multiple-readers sharing for the main execution stacks combined with explicit message passing for work distribution and management. Keywords: Parallel Logic Programming, And/Or Model, Scheduling, Distributed Shared Memory
1
Introduction
Logic programs are amenable to the exploitation of two major forms of implicit parallelism: or- and and-parallelism. Or-parallelism (ORP) aims at exploring different alternatives to a goal in parallel and arises naturally in search problems. And-parallelism (ANDP) consists in the parallel execution of two or more goals that cooperate in determining the solutions to a query. One important form of and-parallelism is independent and-parallelism (IAP), where parallel goals do not share variables. This form of parallelism arises in divide-and-conquer algorithms. Another type is called dependent and-parallelism (DAP), that allows goals to share variables. This is common in consumer-producer applications. Parallel logic programming systems (PLPs) should support both ANDP and ORP due to serve the broadest range of applications and to become popular. In practice, exploiting just one form of implicit parallelism requires sophisticated system design and exploiting two distinct forms of parallelism is even harder. Shared memory PLPs have been the most successful and widespread so far. There are several examples of systems supporting full Prolog. Aurora [8] and Muse [1] are well-known ORP systems while &-Prolog, DASWAM [13], and &-ACE [9] are IAP systems that have been used to parallelise sizeable applications. Andorra-I [12] is a further example that supports both determinate P. Amestoy et al. (Eds.): Euro-Par’99, LNCS 1685, pp. 899–908, 1999. c Springer-Verlag Berlin Heidelberg 1999
900
Lu´ıs Fernando Castro et al.
and-parallelism and or-parallelism. Several distributed memory PLPs have been proposed. Some support pure IAP [15], or pure ORP [2]. The differences between shared and distributed memory machines have become less significant in the last few years due to distributed shared memory systems (DSMs) that gives a shared memory abstraction. Work on PLPs for hardware DSMs has given interesting results. Dorpp ORP system [14] achieved good performance of a DSM machine. More recently, Santos Costa et al [10] had analyzed Andorra-I system on a DASH-like simulated machine. Their numbers confirm that most read cache misses result from scheduling and from accessing the code area. Misses to the execution stacks (most part eviction or cold misses) varied from 8% on a ORP parallel application to 30% on an ANDP application. Only the ORP application has significant sharing misses for an execution stack, the choice-point stack (60%), because this stack is also used for scheduling. We argue that the previous analysis suggests a new approach to distributed PLPs. First, most large data structures in these systems are built in the execution stacks. We should take the best advantage of caches to reduce network traffic. In contrast, the previous studies show high rates of sharing misses in scheduler data-structures. This suggests using explicit messages for scheduling. DAOS: Distributing And/Or in Scalable machines is the first PLP model for distributed systems that supports these innovations. Its main contributions are: – binding data representation: it is both simple to implement on PLPs and it naturally adapts to DSM techniques allowing the use of previously designed synchronization and scheduling algorithms; – combined DSM and message passing techniques in the same framework: DAOS innovates over shared memory PLPs by explicitly addressing the distribution problems inherent to scalable machines. Next, some considerations about And/Or exploitation are presented. Then, the DAOS model is presented in Section3. Section 4 analyze how workers can be implemented in DAOS. Finally, there are some conclusions.
2
Exploiting And/Or Parallelism
ORP and IAP are arguably two of the more interesting forms of parallelism available in Prolog programs. Several methods have been proposed to exploit And/Or parallelism in PLPs. This section will present the three main techniques to deal with IAP. Consider the following and-parallel query: ?- a(X) & b(Y). where & represents a parallel conjunction. Figure 1 shows that both goals have several solutions. Besides running the goals a(X) and b(Y) in parallel, one needs to combine the solutions. One alternative is to use a special node, that maintains pointers to each solution for each goal. Solutions to the conjunction are obtained by calculating the cross-product between values for X and for Y. This approach is known as reuse as presented in Figure 1(a).
DAOS — Scalable And-Or Parallelism
901
Recomputation-based models are based on having an Or search tree, such that and-parallel goals correspond to building bits of the search tree in advance. Thus, when one starts a(X) and b(X) in parallel, the solution to b(X) is a continuation branch for a(X), and is thus associated with a specific solution for a(X), as shown in Figure 1(b). These models are named recomputation-based because to exploit several alternatives for a(X), the solution for b(X) has to be recomputed for each alternative. Recomputation avoids cross-product node implementation overheads and symplifies Full Prolog semantic support. Reuse saves some effort in recomputing goal but Prolog programmers usually try to reduce search space [6].
a(X)
a(X)
b(Y) a(X) & b(Y)
a(X) & b(Y)
X=a b(Y) X=b
Y=f Y=c
Y=c
X=a
(a)
(b)
Y=c
(c)
Fig. 1. (a) Reuse (b) Recomputation (c) C-Tree
The C-tree is shown in Figure 1(c). In the C-tree [6], whenever a worker picks an alternative from another process, it creates a copy of all parallel conjunctions found along the way and restarts work to the right. Traditionally, the C-tree has been associated with a notion of team, i.e, a group of workers. Normally, IAP work is explored inside a team and ORP work is executed between teams. DAOS gives freedom to system designer to decide if he or she wants to use or not teams. The use of teams has the advantage of simplifying scheduling allowing (re)use of IAP schedulers within teams and ORP scheduler between teams. This organization also simplifies the solution propagation in IAP through the possibility of use multicast inside a group.
3
DAOS: Distributed And-Or in Scalable System
DAOS aims at two goals: improve efficiency over traditional distributed systems, and preserve Prolog semantics. It is a fundamental point in DAOS establish which data areas should be private to a worker, and which ones should be virtually shared. There two opposite approachs: (a) all stacks must be private as in distributed PLPs, or (b) all stacks must be shared through a DSM layer. The
902
Lu´ıs Fernando Castro et al.
later option seems interesting because we could use a previous implementation to shared memory systems. However, this may be inefficient due to scheduling data structures. DAOS presents a intermediate solution: the major data-structures area will be logically shared and the work management areas will be private. 3.1
A Shared Address Space
How to implement the virtually shared address space is one of the key aspects of DAOS. This shared space must contains the major data structures used in Prolog, such as all Prolog variables and compound terms. Or-parallelism exploitation in a shared memory space normally is done using a binding array (BA) based approach as in Aurora [8] and Andorra-I [12]. The original BA was designed for ORP and keeps a private slot for every variable in the current search-tree. This slot stores all conditional bindings made by a worker, instead of writing on the shared space. Accesses to other memory areas are read-only This gives a important single-writer, multiple-reader pattern for the shared memory. Unfortunately, the original BA design is not suitable to IAP, because the number of cells for each and-goal is not know beforehand. Management of slots between workers running in and-parallel becomes highly complex [6]. In DAOS, we propose to use the Sparse Binding Array (SBA) [11] to manage bindings to shared variables. The SBA addresses the memory management problems in traditional BAs by shadowing the whole shared stacks, that is, every cell in a shared stack has a private “shadow” in a worker or team’s SBA. SBA was designed to organize workers in teams. Each team should share the same choicepoints and thus the same SBA. This approach is not a good one to DAOS since the SBA is write-intensive. So, we propose a different SBA solution: Each worker will have a private SBA. SBAs are synchronized, through the trail, both when sharing ORP and IAP work. 3.2
Sharing Work in DAOS
In this section we present the shared and private areas in the Prolog execution environment. We follow the Warren Abstract Machine (WAM) [16] organisation as found in most current PLPs, where each worker has a set of stacks. Heap and Environment Stack support forward execution. Control Stack is used both for backtracking and for parallelism. Trail supports backtracking. Sparse Binding Array (SBA) provides access to private bindings. Last, Goal Stack has goals to be exploited in IAP. The Forward Execution Stacks The two largest stacks store terms and logical variables. Both have data structures that are virtually shared: The Environment stack stores environments, corresponding to the activation records of traditional imperative languages. The Global stack or Heap accommodates compound terms and the remaining logical variables. False sharing may happen in two situations [14] First, workers can make previous work public, and then continue execution on the same stack. In this
DAOS — Scalable And-Or Parallelism
903
case, their next updates to the stack might be sent to the sharing workers. Such sharing updates can be treated by relaxed consistency techniques [3] which ensures that new updates from the worker will only be sent at synchronisation points. A second source of false sharing is backtracking. In general, when all alternatives have been exploited, both global and environment stack space may be reclaimed, and next reutilised to store new work. Updates to this space may then be sent to the workers which had originally shared the work, unless one uses an invalidate-based protocol. The Control Stack This stack stores Choice-points. Choice-point is a structure that includes pointers to the stacks and to the goal’s arguments before creating the choice-point plus pointers to the available alternatives. Some ORP systems also store scheduling informations in choice-points while IAP systems extend Control stack to include parcall frames that describe the conjunctions to be executed in parallel. ORP systems had used Control stack to manage the available work. For instance, in the chat-80 ORP-only benchmark running under Aurora, about a third of the sharing misses originated from Control stack (the rest originated from the scheduler data structures). Although a similar study is not available for IAP systems, parcall-frames are expected to also be a source of sharing misses. The Trail Trail Stack stores all conditional bindings. In a sequential execution this is required to undo bindings to variables when backtracking. In a BA-based system, the Trail is a fundamental data-structure, as it is used to synchronise bindings between different branches. Since conditional bindings may not be placed directly on the stacks, the alternative is to store these bindings in the Trail. When workers fetch work, they read the bindings from the Trail and stored them in the SBA. So, the first operation to be performed in DAOS when sharing work is installing all conditional bindings in the SBA. This requires access to corresponding Trail section. Deciding whether to keep Trail under DSM or to use explicit distribution is a fundamental open question: – All chunks of the Trail must be present in a worker before it can start work. This argues for sending the required trail segments immediately when we share choice-points or goals and thus for a fully-distributed solution. – The Trail tends to be a relatively small stack. After an initial delay, one may take advantage of the sharing facilities in a DSM system to actually have the Trail segments before they are asked. – IAP programs tend to require much larger stacks than ORP programs, as they perform much less backtracking, but they do tend to perform less conditional bindings. Trail segments are expected to grow larger. A final decision will depend on several factors, such as the efficiency of the DSM system and of the message passing implementation. In the IDAOS implementation, as discussed next, we have decided to initially follow a fully distributed implementation because trail copying can be naturally integrated with Control-stack copying.
904
4
Lu´ıs Fernando Castro et al.
IDAOS: A DAOS Implementation
We have so far discussed the DAOS model. We now concentrate on the design of our DAOS prototype, IDAOS (an Implementation of DAOS). In our design, each IDAOS processor or worker consists of three modules that are implemented as threads. The Engine is responsible for the execution of Prolog code. Most of the execution time will be spent in this code which should have performance close to good sequential implementations. We base our work on Diaz and Codognet’s wamcc [4] which has performance close to the best Prolog implementations currently available. The Work-dispatcher module controls the exportation of both and- and or-work. This module and the Engine have exclusive access to the Control Stack, Trail, and Goal stack. The Memory-Manager module controls the major execution stacks, namely the Environment stack and the Heap through a page-based software DSM.
Engine
Shared
Private
Stack
Control Stack
SBA
Heap
Goal Stack
Trail
Work Dispatching Memory Management Components
Data Areas
Fig. 2. Worker Organisation
IDAOS uses MPI to implement message passing and the commercial software TreadMarks [3] for DSM. Having both a message passing and DSM mechanisms creates an interesting problem: both Treadmarks and MPI want to initialise the distributed processors. One solution is to give management to TreadMarks and use dynamic process functionality from current MPI implementations (that will be standardised in MPI-2). Next main issues in the implementation of the IDAOS Engine, and in section 4.2 issues about implementation of Trail and Control stacks are presented. 4.1
And/Or Support in Engine Module
As we have explained, the Engine thread implements an SBA based abstract machine. It combines the use of SBA [5] to deal with or-parallelism, with mechanisms based on Hermenegildo’s RAP-WAM [7] to deal with and-parallelism. We had to perform major changes to the wamcc to support both IAP and ORP. Regarding ORP, the major changes in wamcc are as follows:
DAOS — Scalable And-Or Parallelism
905
– conditional bindings must be stored and read from the SBA. We have found that this has an impact of from 19% up to 38% on the execution time on a Pentium-II 333MHz PC with wamcc-2.2, depending on the benchmark. – in ORP the trail is used to both install and deinstall bindings. Thus, each trail entry receives an extra field containing the new value to be marked. We found the overhead to vary from 2% up to 10% on the same machine. – Last, choice-points must include new fields, to support sharing or-work. Supporting IAP requires supporting an extensive set of new data-structures[7]: the parcall-frame (representing a parallel conjunction), a goal-slot (representing a sub-goal in the conjunction), a goal-stack, and markers (to perform stack management). The data structures must handle three forms of execution: forward execution; inside backtracking, that is backtracking within the parallel conjunction; and outside backtracking, that is, backtracking caused by a goal in the continuation that failed. Last, the compiler must also support the sequential conjunction. Our design tries to follow closely Hermenegildo’s, and more recent implementations such as Shen’s DASWAM [13], Pontelli and Gupta’s &-ACE [9], and Correia’s SBA [5]. We discuss in detail some issues specifically important for IDAOS. First, in IDAOS there is not a shared and-tree of goals, rooted at the parcallframe. Goals are instead sent to an external worker, that executes in its stacks. The parcall-frame therefore must contain information on which processors are executing a goal, not just direct links. Second, on receiving a goal, the receiver will not have direct access to the parent’s parcall-frame, as it is stored in the sender. This is a problem because traditional implementations of IAP use markers to mark the new space being allocated. These markers are linked to the goal slot, and from there to the parcall frame, resulting in an involved chain of pointers in shared memory 1 . In IDAOS a new goal is started with a starting choice-point (SCP). The SCP marks the stacks allocated for this goal, and thus fulfills the task of the markers. The only alternative in SCP is the code to be executed when backtracking out of the task. When completing the task, we install a final choice-point (FCP). The FCP stores the final values for the stacks, points to the SCP, and its alternative is the code to be executed when backtracking to the task. A further advantage of using choice-points is that all memory management is now performed through choicepoints, simplifying integration with ORP. Memory allocation in this model will be based in segments to implement the so-called cactus stack. The idea is that all segments, for ORP or for IAP, start with a choice-point and can be allocated and recovered using the same techniques.
1
This could be a good argument for storing the Control stack in the DSM, but this process is closely related to scheduling and, thus, should be made explicit.
906
4.2
Lu´ıs Fernando Castro et al.
Trail and Control Management
A key data-structure in IDAOS is the Trail. It is used both to propagate conditional bindings during exportation of and- and or-work, and to return conditional bindings performed during remote execution of and-goals: – Whenever a team imports or-work, workers have to move up or down in the search tree. This is performed by copying the trail from the exporter and installing the bindings in their SBAs. – When a worker receives an and-goal to execute, it starts its execution in its own memory address space. Any conditional bindings, or any bindings to variables created prior to the and-parallel conjunction , is stored in the SBA, and trailed. At the end of execution of the goal, the worker returns its trail to the exporter.
TS0
Desc
TS2
TS1 Previous Segment Segment Start Sharing Bitmap
Desc
Desc
Fig. 3. Trail Segments To support And/Or parallelism the trail needs to be segmented. The trail is physically divided into trail segments (TSs) which corresponds to a contiguous computation. In PLPs the Trail forms a tree: each TS is a children from the one it got work. Each TS starts with a descriptor, followed by a sequence of bindings, and terminates with a special parent pointer, that points to where previous bindings are stored. Figure 3 shows a situation where trail segment TS0 corresponds to several choicepoints generated in a row. TS1 and TS2 correspond to new segments that are rooted in TS0, but the computation for TS3 starts from an older choicepoint than TS1 (we assume the Trail grows downwards). Each trail segment descriptor contains a pointer to the start of the segment, a direct pointer to the ancestor node, and a bitmap indicating which nodes already have this segment. Copying the trail implies start copying from the segment of the goal or choice-point being exported and then follow to the root until a segment that has already been sent is found. MPI’s buffering can be used to send all the TSs in a single message.
DAOS — Scalable And-Or Parallelism
907
Note that in this case we are implementing our own DSM mechanism. So, coherence problem must be treated. There are several solutions to this problem: keep the bitmaps or the whole Trail in a DSM area; use broadcast for trail segments; or simply ignore the problem and accept duplicated broadcasts of TSs. Lack of space prevents us from discussing the issues here in detail, but in IDAOS we are using the ostrich algorithm. Performance evaluation will then tell us if more sophisticated solutions are required. Other important stack in IDAOS is the Control Stack. Our principle in designing DAOS was to use scheduling to avoid unnecessary pressure over the DSM subsystem. To reduce sharing in this area, one solution would be to split the Control stack into a choice-point stack that would be under the DSM, and in a Control stack that would be managed by the message-passing system. For simplicity reasons, we will favour a simpler solution: maintaining the stack as fully distributed. The owner of a parcall-frame or of a choice-point to be the one that can directly access the data-structure, and that accesses from other workers will be performed through the communication protocol.
5
Conclusions
We have proposed a scheme for Distributed And/Or execution in Scalable systems, DAOS. The model innovates by taking advantage of recent work in DSM technology to obtain efficient sharing, and efficiently supporting both IAP and ORP in a distributed environment. We have found that the DSM mechanism is quite effective in simplifying our design allowing us to focus on the issues that we believe have a major impact in performance. Work in the IDAOS prototype is progressing at Universidade Federal do Rio Grande do Sul (UFRGS), Federal do Rio de Janeiro (UFRJ) and Porto (UP). Our target is a network of workstations connected by a fast network, such as Myrinet or Fast Ethernet. Both the UFRGS and the UP groups have access to such networks. Changes required to support IAP and ORP have already been included to the base system wamcc, and work will next move on to experimenting with the distributed platform. We expect that after implementing the message mechanism on top of MPI most of the work will move on to scheduler design, as it is traditional in parallel logic programming systems.
Acknowledgments We would like to acknowledge Inˆes de Castro Dutra, Ricardo Bianchini, Gopal Gupta, Enrico Pontelli, Cristiano Costa, and Kish Shen for their contribution and influence. This work has been partially supported by the CNPq/ProTemCC project Appelo and by funds granted to LIACC through the Programa de Financiamento Plurianual, Funda¸c˜ao para a Ciˆencia e Tecnologia and Programa PRAXIS.
908
Lu´ıs Fernando Castro et al.
References [1] K. A. M. Ali and R. Karlsson. The Muse Or-parallel Prolog Model and its Performance. In Proceedings of the North American Conference on Logic Programming, pages 757–776. MIT Press, October 1990. [2] J. Briat, M. Favre, C. Geyer, and J. Chassin. Scheduling of or-parallel Prolog on a scaleable, reconfigurable, distributed-memory multiprocessor. In Proceedings of Parallel Architecture and Languages Europe. Springer Verlag, 1991. [3] C. Amza et al. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, 19(2):18–28, February 1996. [4] P. Codognet and D. Diaz. wamcc: Compiling Prolog to C. In 12th International Conference on Logic Programming. The MIT Press, 1995. [5] M. E. Correia, F. M. A. Silva, and V. Santos Costa. The SBA: Exploiting orthogonality in OR-AND Parallel Systems. In Proceedings of the 1997 International Logic Programming Symposium, October 1997. [6] G. Gupta, M. Hermenegildo, and V. Santos Costa. And-Or Parallel Prolog: A Recomputation based Approach. New Generation Computing, 11(3,4):770–782, 1993. [7] M. V. Hermenegildo. An Abstract Machine for Restricted And-Parallel Execution of Logic Programs. In E. Shapiro, editor, Third International Conference on Logic Programming, London, pages 25–39. Springer-Verlag, July 1986. [8] E. Lusk, R. Butler, T. Disz, R. Olson, R. Overbeek, R. Stevens, D. H. D. Warren, A. Calderwood, P. Szeredi, S. Haridi, P. Brand, M. Carlsson, A. Ciepelewski, and B. Hausman. The Aurora or-parallel Prolog system. In International Conference on Fifth Generation Computer Systems 1988, pages 819–830. ICOT, Tokyo, Japan, Nov. 1988. [9] E. Pontelli, G. Gupta, M. Hermenegildo, M. Carro, and D. Tang. Efficient Implementation of And-Parallel Logic Programming Systems. Computer Languages, 22(2/3), 1996. [10] V. Santos Costa, R. Bianchini, and I. C. Dutra. Parallel Logic Programming Systems on Scalable Multiprocessors. In Proceedings of the 2nd International Symposium on Parallel Symbolic Computation, PASCO’97, pages 58–67, July 1997. [11] V. Santos Costa, M. E. Correia, and F. Silva. Performance of Sparse Binding Arrays for Or-Parallelism. In Proceedings of the VIII Brazilian Symposium on Computer Architecture and High Performance Processing – SBAC-PAD, August 1996. [12] V. Santos Costa, D. H. D. Warren, and R. Yang. Andorra-I: A Parallel Prolog System that Transparently Exploits both And- and Or-Parallelism. In Third ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming PPOPP, pages 83–93. ACM press, April 1991. SIGPLAN Notices vol 26(7), July 1991. [13] K. Shen. Initial Results from the Parallel Implementation of DASWAM. In M. Maher, editor, Proceedings of the 1996 Joint International Conference and Symposium on Logic Programming. The MIT Press, 1996. [14] F. M. A. Silva. An Implementation of Or-Parallel Prolog on a Distributed Shared Memory Architecture. PhD thesis, Dept. of Computer Science, Univ. of Manchester, September 1993. [15] A. R. Verden and H. Glaser. Independent And-Parallel Prolog for Distributed Memory Architectures. Technical report, Department of Electronics and Computer Science, University of Southampton, Apr. 1990. [16] D. H. D. Warren. An Abstract Prolog Instruction Set. Technical Note 309, SRI International, 1983.