Performance Study of a Parallel Shortest Path Algorithm

0 downloads 0 Views 207KB Size Report
Jan 22, 1996 - Because the shortest path solution is the major com- ponent of these ... solve the equilibrium problems in a reasonable amount of time. ... the initial parallel shortest path algorithm attempts to minimize the frequency that a processor. 2 ... contains nodes whose distance labels are possibly not the shortest.
Performance Study of a Parallel Shortest Path Algorithm Michelle Hribar

Valerie E. Taylor

Northwestern University EECS Department 2145 Sheridan Road Evanston, Illinois 60208-3118 fmichelle,[email protected]

David E. Boyce

Urban Transportation Center University of Illinois at Chicago Chicago, Illinois 60607 [email protected]

Report No. CSE-96-002 January 22, 1996

Abstract

Trac equilibrium analyses are generally very large and computationally intensive. Parallel processing provides the memory and computational power needed to solve the equilibrium problems in a reasonable amount of time. Because the shortest path solution is the major component of these equilibrium algorithms, we focus on developing an ecient parallel algorithm to solve this problem. We investigate in detail how three factors{computation, communication and idle time{a ect the performance of these algorithms. This analysis leads to methods of reducing the total execution time of a parallel shortest path algorithm.

1 Introduction Trac equilibrium analyses used in trac network planning are generally very large and computationally intensive. Parallel processing provides the memory and computational power needed to solve the equilibrium problems in a reasonable amount of time. This paper provides an in-depth study of the parallel execution time for xed sized problems using a xed number of processors. Based on this study, we then present techniques for reducing the parallel execution time. In contrast, previous studies have focused primarily on parallel speedup. The ultimate goal for using parallel machines for trac equilibrium applications is to reduce the execution time so that it is possible to conduct analyses in a shorter time. Increases in single processor speeds are signi cant, but insucient for the computing requirements of trac problems. For example solving a dynamic equilibrium problem consisting of 18 time intervals (corresponding to 3 hours of trac ow) for a 9701-node trac network requires 55 hours (or 2.3 days) to complete on one 60 MHz, Convex C3880 processor [5]. Using this same processor for 10 hours of trac

ow would require 183.3 hours (or 7.6 days, approximately a week). Hence parallel processing is necessary for this class of applications. Generally, trac equilibrium problems are solved using the Frank-Wolfe algorithm. The most computationally intense step of this algorithm is the solution of multiple shortest path trees. For example, using one processor of the Intel Paragon, the shortest path solution step accounts for 67% of the total execution time for a 9701-node trac network with 20 source nodes. The percentage is dependent on the number of source nodes in the problem; an increase in source nodes results in an increase in the percentage. For the same network with 200 source nodes, solving for the shortest path required 95% of the total execution time. Therefore, we focus on the parallel execution of shortest path algorithm. Solving for the shortest path trees with trac applications typically involves using single-source labeling algorithms for each source. In these algorithms, each node has a label that represents the distance from the source to that node. During the execution of the algorithm, the node labels are updated iteratively; at the conclusion of the algorithm each node's label is the value of the shortest distance from the source to that node. The execution time of these algorithms is dependent upon

number of node updates; the larger the number of node updates, the longer the execution time. Further, the number of node updates is dependent on the order that nodes are updated. A di erent order can result in a di erent number of updates for each node. For the parallel implementation, we partition the trac network into P subnetworks (for P processors) and assign each subnetwork to a processor. Distributing the network among the processors takes advantage of the aggregate memory of the parallel system. Each processor's subnetwork has two types of nodes: interior and boundary nodes. Boundary nodes are those nodes which are connected by arcs to nodes in other subnetworks on other processors. Therefore, when boundary nodes' labels are updated, they must be communicated to other processors. For a parallel implementation of a shortest path algorithm, the number of node updates is dependent on the order in which other processors complete execution on its subnetwork, the order that a processor accepts messages or data from other processors, as well as the order that nodes in the assigned subnetwork are updated. Therefore, the execution time is dependent on the partitioning of the network as well as the parallel algorithm itself. In this paper, we analyze the factors that signi cantly a ect the parallel execution time for parallel shortest path problems. These factors, which generally occur in most parallel applications, consists of the following: (1) computation time, (2) communication time, and (3) idle time. The computation time is determined by the load assigned to a given processor. The communication time is determined by the frequency of sending messages to other processors and the latency or time to communicate a message; the latency is determined by the message size. The idle time results from a synchronization step in the algorithm. Load imbalance causes processors to be idle waiting to synchronize with other processors; processors cannot continue execution until this synchronization completes. Ecient parallel shortest path algorithms require the following characteristics: balanced computational load assigned to each processor, and minimal communication and synchronization time. This paper focuses on achieving these characteristics for the parallel shortest path application. In most commercially available machines, such at the Intel Paragon used in the study, the communication time is approximately three orders of magnitude larger than computation time per byte of data. Given this large di erence between the communication and computation time, the initial parallel shortest path algorithm attempts to minimize the frequency that a processor 2

sends a message and the size of the messages. The experimental results indicate that our initial scheme had indeed minimized communication, but the processors were idle approximately 50% of the time. For example, for a 9,701 network analyzed on 4 Intel Paragon processors, the total time for the shortest path algorithm was 81.1 seconds of which 41.9 seconds the processor was idle. To reduce this idle time, we increase the frequency at which messages were sent. Further, we conduct experiments to heuristically determine the best frequency. The results illustrate that messages sent more frequently results in at least a 42% reduction in total execution time compared to the initial scheme for all the networks tested. For example, for the same 9,701 node network, the total execution time was reduced from 81.1 seconds to 36 seconds. Hence, our method involves increasing the communication time slightly, but reducing the synchronization time substantially. The result is an overall reduction in parallel execution time. The remainder of the paper is organized as follows: Section 2 gives background information about trac equilibrium and shortest path algorithms; Section 3 discusses previous work done in parallel shortest paths; Section 4 describes the parallel shortest path algorithm; and Section 5 analyzes the performance of this algorithm. We summarize in Section 6.

2 Background

2.1 Network Terminology Trac networks are directed, weighted networks with n nodes and m arcs. The set of nodes, N = f1; 2; :::ng, generally represents intersections. Arc (i; j ) from node i to node j represents the road segment connecting the corresponding intersections. Additionally, nodes and arcs may be added to the model network to represent turning movements at intersections. Each arc has a travel time cij that is modeled by a delay function that depends on the trac ow on the arc. In trac networks, travel times are non-negative. The set of outgoing arcs from node i is A[i]. Trac networks typically have a set of z nodes designated as source nodes. These nodes are the sources and destinations of all trips in the network. Generally, trac networks are sparse; that is, the number of arcs is O(n) instead of O(n2 ). In other words, each node is connected to a constant number of other nodes instead of O(n) other nodes. For example, a trac network in the North and Northwest suburbs of Chicago [3] has 9,701 3

nodes and 22,918 arcs giving each node an average out-degree of 2.4.

2.2 Trac Equilibrium Transportation scientists assume that trac ows tend toward an equilibrium state. As mentioned previously, theoretically predicting this state allows the engineers to analyze and plan trac systems. There are several de nitions of equilibrium, as well as static and dynamic models for each type of equilibrium. This study focuses on static user equilibrium. At user equilibrium, travel times of used paths between two nodes are equal and no unused path has a travel time that is less than that of a used path. Thus, a single user cannot unilaterally improve his/her travel time by changing paths [21]. The problem formulation for the user equilibrium problem is stated in Figure 1. One algorithm commonly used to solve for this equilibrium is the Frank-Wolfe algorithm (convex combination method) [21] given in Figure 2. Steps 1, 3 and 4 of the algorithm require O(m) computation time, while Step 5 requires a constant amount of time. Step 2 requires solving a shortest path tree for the z origins in the network. The amount of time required by the shortest path solution depends on the particular algorithm used, but it is signi cantly greater than O(m), and will be discussed in the following subsection. As stated previously, the shortest path solution accounts a large percentage of the total execution time of the Frank-Wolfe algorithm. For this reason, the shortest path algorithm is the focus of this research.

2.3 Shortest Path Multiple source shortest paths on sparse trac networks are generally solved using a single-source, iterative labeling algorithm multiple times. These algorithms nd the shortest distance from one node, the source, to every other node in the network; these shortest paths form a shortest path tree in the trac network. In the labeling algorithm, each node i has a distance label, dl[i], which is iteratively updated until the conclusion of the algorithm where it represents the shortest distance from the source to node i. Figure 3 gives a general labeling algorithm. The list in the algorithm contains nodes whose distance labels are possibly not the shortest. During each iteration a node is removed from the list and its adjacent nodes' labels are updated. How these nodes are removed from the list determines whether the algorithm is label-setting or label-correcting. 4

Label-setting algorithms always remove from the list the node with the smallest distance label. Because the arc costs are non-negative, once a node is removed from the list, it will never reenter. Label-correcting algorithms, on the other hand, do not require that the removed node have the smallest distance label. Therefore, a node may enter and re-enter the list multiple times. As mentioned previously, the data structure used to store the list a ects the number of times each node is updated and enters the list. At worst, the running time of the best label-correcting algorithm is O(nm); however, in practice, this worst case time is rarely experienced. A previous study [13] determined that label-correcting algorithms, in particular Pallottino's two queues algorithm [17], were best for trac networks. This algorithm has a worst-case running time of O(mn). We will use this serial algorithm as the basis for the parallel algorithm.

3 Previous Work The three general approaches to solving for multiple-source shortest paths on distributed-memory, message-passing machines are 1) an all-pairs matrix-based algorithm [14], 2) network replication [15] and 3) multiple solution of a single-source shortest path algorithm using a distributed network. The rst approach is a parallelization of the serial matrix-based all-pairs algorithm. This approach is not appropriate for trac networks because of their sparsity. The second approach, network replication, assigns a replicated copy of the entire trac network to each processor who then solves the shortest path tree for a subset of sources. This shortest path calculation requires no interprocessor communication, but the subsequent demand assignment in the trac equilibrium algorithm requires all to all communication to update the arc ows. The drawback to this is its limited scalability. First, the algorithm does not take advantage of the large aggregate memory of a distributed-memory machine; instead, the size of the trac network is limited by the amount of local memory available on a single processor. Secondly, the number of processors is limited to z , the number of sources. Finally, the global communication required by the demand update increases as the number of processors increases. The nal approach, solving a single-source multiple times in parallel, is the focus of this research. There has been much work done in the area of parallel single-source shortest paths, particularly in proving the worst case bounds of execution time [2, 6, 7, 16, 19]. These studies show that the 5

theoretical worst case communication bounds are the same as the worst case execution times of the serial algorithms; therefore, ecient parallel execution would not be possible in this case. However, experimental studies have shown that this worst case behavior is not experienced in practice and that speedup is possible [1, 4, 11, 20, 23]. Adamson and Tick [1] and Tra [23] use a distributed shared memory machine to solve labelsetting algorithms, and partition the network among processors so that each processor works on its own subnetwork. They observe that label-setting algorithms have little parallelism since only one node with the minimum distance label is extracted from the queue at a time. Romeijn and Smith [20] attempt to decrease the communication requirements for a label-setting algorithm by not solving for the exact shortest path tree on a distributed network. However, not solving for the exact shortest paths may a ect the convergence of the trac equilibrium algorithm. In contrast, we focus on label-correcting algorithms, which have been shown to be appropriate for trac applications [8, 9, 10, 13, 18]. These algorithms are suitable for parallel execution since the processors access their own local queues simultaneously. Bertsekas et al [4] provide an experimental comparison of several label-correcting algorithms on a shared memory machine with 8 processors. Some of the algorithms achieved superlinear speedup, while others achieved no speedup at all. No detailed study of the performance was given. Habbal et al [11] present an algorithm where the for a distributed network. Habbal et al achieve speedup with the algorithm, but acknowledge that the performance is dependent on the decomposition of the network. In contrast to these studies which focus on speedup, our study focuses on the analysis of the execution time and methods for decreasing the parallel execution time of the slowest processor. The goal of our work is to develop an ecient parallel algorithm.

4 Parallel Shortest Path Algorithm As mentioned previously, we implement a parallel version of the label-correcting two queue algorithm. In our implementation, the trac network is partitioned and distributed among the processors, thus taking advantage of the large aggregate memory of the parallel machine. Since the trac equilibrium algorithm performs operations on links, each processor owns unique links instead of nodes. This means that multiple processors share nodes; these nodes are called boundary 6

nodes. The other nodes are called interior nodes. Figure 4 shows the partitioning of one network into two and the resulting boundary nodes that are shared by both partitions. The parallel shortest path algorithm solves for multiple source shortest path similarly to the serial algorithm. Each processor solves the portion of the shortest path tree on its subnetwork for each source. Unlike the serial algorithm where single source shortest path trees are solved one at a time, the parallel algorithm solves for multiple source shortest path trees on the assigned subnetwork simultaneously. Since each processor only has information about its local subnetwork, communication must occur between processors in order to compute the shortest path trees. Only distance label information about the boundary nodes needs to be communicated since any path to each interior node that contains nodes on other processors must include a boundary node. Similar to the serial algorithm, termination occurs when all processor's queues are empty. This condition is detected in the algorithm by a synchronization step where the processors determine if any other processor has a non-empty queue after processing its messages. In short, the three main steps of the parallel shortest path algorithm are (1) shortest path computation, (2) boundary node communication and (3) detection of the termination condition (synchronization). Figure 5 gives the details of the parallel algorithm. Since the algorithm is solving for the shortest path trees for multiple sources simultaneously, each processor must store a queue for each source j , denoted as queuej . Each processor rst solves and stores the local shortest path tree(s) for all sources until their local queues are empty. As mentioned previously, local shortest path solution is performed using the two queues label-correcting algorithm. Next, it sends and receives label update information about its boundary nodes. Finally, the termination condition is evaluated by the processors synchronizing to determine if any processor has a non-empty queue.

5 Parallel Performance Since the goal of our study is to develop an ecient parallel algorithm, we analyze the e ects of computation, communication and idle time (due to synchronization) on the performance of the previously described parallel shortest path algorithm. We use three trac networks in our study: Waltham, New York City [4] and Chicago ADVANCE Test Area [3]. The Waltham network has 26,347 nodes and 64,708 links; The New York City network has 4,795 nodes and 16,458 links; and 7

the ADVANCE network has 9,701 nodes and 22,918 links. In order to provide a very detailed analysis of the overall execution time, we solve the shortest path problem for 20 sources on four processors on the Intel Paragon. The concepts, however, extend to any number of processors. All the network partitions were generated using simulated annealing [22] with the exception of the random networks that were generated using the random option of the Chaco graph partitioning software package [12]. Simulated annealing is a heuristic nds partitions that minimize a given objective function. In our case, the objective function is a weighted sum of square of the number of boundary nodes and the square of the number of links in each partition.

5.1 Computation As mentioned previously, ecient parallel algorithms balance the computational work load among the processors. Typically, the computational load for a parallel algorithm is dependent on the size of the local data partition. However, because of the non-deterministic behavior of labelcorrecting shortest path algorithms, it is dicult to use a partition's characteristics to predict its computational requirements. We examine two network characteristics, number of links and number of source nodes, as means of predicting work load. First, we attempted to balance the workload by balancing the number of links. We used simulated annealing to generate two partitions for each network: one where the number of links were balanced among the processors and another where it was not. The execution time and load imbalance are given in Figure 6. The load imbalance is equal to the ratio of the maximum number of links in any partition to the average number of links in a partition minus one. The experimental data shows that balancing the number of links does not necessarily result in a signi cant performance gain. In fact, the imbalanced partitions all required less execution time, especially for the Waltham network. Therefore, balancing the links among the processors does not guarantee better performance. Similarly, we examined partitions with a balanced and imbalanced number of source nodes. For example, for the New York City network, one partition had 5 sources per processor, the second had 5, 8, 7 and 0 sources per processor. The balanced partition required 57.8 seconds for completion while the imbalanced partition required 54.18 seconds. Hence, balancing the sources among the 8

processors also does not guarantee better performance.

5.2 Communication Communication must be performed when a boundary node's label changes; therefore, the amount of communication is a ected by the number of boundary nodes in a subnetwork. If the number is large, then there will be more nodes' information to be communicated. This will result in larger messages and greater communication time. Figure 7 presents the di erence in the execution times partitions with large and small numbers of boundary nodes. The partitions with large numbers of boundary nodes (more than 19% of the total nodes) were generated using Chaco [12] and the partitions with fewer boundary nodes (less than 7% of the total nodes) were generated using simulated annealing [22]. The percentage of boundary nodes for each partition is given along the x axis of the graph. As the gure shows, the overall execution time is dramatically less for the networks with relatively few boundary nodes. In these case of the Waltham network, the execution time is over 99% less for the connected subnetwork partitions than the disconnected random partitions. Furthermore, the amount of communication performed depends on the frequency that the processors send updates. When a boundary node label changes, it must be sent to the neighboring processor(s) that shares the node. Recall that this label can change frequently during the course of a shortest path solution for a single source. Therefore, to reduce the amount of communication, we do not communicate any boundary node information until the processors' queue for that source is empty. Furthermore, we can further reduce the communication time by delaying communication until the queues for all the sources are empty during an iteration. This initial approach is what we used so far to execute the algorithm. This message aggregation reduces the total communication cost, but leads to an increase in total execution time as will be discussed in in a following subsection.

5.3 Idle Time The nal factor a ecting performance is idle time. Recall that the processors synchronize at the end of each iteration. As shown in Figure 7, the synchronization time is the greatest of all three steps of the algorithm for the slowest processor. This synchronization time is composed of both the time to synchronize the processors and the time the processor has to wait idle for the other 9

processors to reach the synchronization step. For four processors, the amount of time required to perform the ghigh operation used to synchronize four processors on the Intel Paragon is 0.0001 seconds. With the exception of the random partitions, all of the problems tested required 100 or fewer iterations. This results in less than 0.01 seconds spent on the synchronization operation itself; therefore, the idle time is the majority of the synchronization time. Because the communication is aggregated, a processor solves for the local shortest path trees for all sources whose queue is not empty during a single iteration. When the processors synchronize at the end of each iteration, all processors must wait for the slowest processor(s) to complete before they can continue execution. Each processor's computation, communication and synchronization times for the ADVANCE network are given in in Figure 8. While some processors' computation time is greater than others, all processors are idle for over 50% of their total time. Therefore, the busiest processor is not the same processor from iteration to iteration. That is, the idle time does not result from an overall load imbalance among the processors; instead it results from an imbalance of work between synchronization steps. This is demonstrated in Figure 9 which shows the distribution of execution time on each processor for the rst ve iterations of the algorithm for the ADVANCE network. During each iteration, only one or two of the processors are busy while the others are idle. To compensate for the large imbalance of work between synchronization, we propose increasing the frequency of this synchronization. We accomplished this by rede ning an iteration. Instead of solving shortest path trees for all sources before communicating and synchronizing, we solve for only one source in the new iteration. While this approach increases the time for both the communication and synchronization operations, it allows processors that are waiting idle with empty queues to begin shortest path computation sooner than the original approach with mininimal synchronization and communication. To further decrease the time processors are idle, we wish to balance the amount of time processors are solving for shortest paths in between synchronization steps. Even with the increased frequency of synchronization, some processors may be busier than others during each new iteration. Therefore, instead of synchronizing at the end of each new iteration, we synchronize every x iterations to attempt try to balance the amount of work between synchronization steps. Figure 10 10

shows the e ect of synchronizing each x new iterations where x is 1, 2, 5, 10, 15, 20, 25, 30, and 35 iterations. As shown in the graph, x between 15 and 25 is optimal for all 3 networks. Using more frequent synchronization results in a better balance of work in between synchronization steps. Figure 11 shows the amount of time for each of the three factors for the ADVANCE network where synchronization occurs every 20 new iterations. Far less time is spent idle in this new approach than with the original minimal scheme shown in Figure 9. For most synchronization steps in the new approach, all processors spend the majority of time computing. Finally, Figure 12 shows the savings in execution time where synchronization occurs more frequently than the original minimal approach. In the frequent synchronization, the processors synchronized every 20 new iterations in all 3 networks. This frequent synchronization results in a decrease in execution time of 51% for the New York City network, 59% for the ADVANCE network and 42% for the Waltham network.

6 Summary Trac equilibrium algorithms, particularly their shortest path calculation step, are computationally intensive. To reduce the execution time, we investigate using parallel processors to solve the shortest path problem. Three factors that determine the execution time of a parallel algorithm are computation, communication and idle time. Because of the non-deterministic behavior of parallel shortest path algorithms, it is dicult to predict their execution time and the time for each of the three factors. We observed, however, that the idle time had the greatest e ect on the overall execution time. We identi ed some techniques to reduce the parallel execution time. First, partitions with few boundary nodes (less than 7% of the total nodes) required less execution time than those with many boundary nodes (greater than 19% of the total nodes). Also, because idle time has the greatest e ect on the performance of the algorithm, we wish to balance the amount of work in between synchronization steps so that no processors are busier than others. By synchronizing frequently, we were able to reduce the execution time of the algorithm by at least 42%. In the future, we plan to further study ways to decrease the idle time. In particular we wish to predict the amount of work a particular subnetwork will generate. This will lead to nding 11

partitions with a good load balance.

7 Acknowledgements Michelle Hribar is supported by a National Science Foundation Graduate Fellowship and the Department of Energy. Valerie Taylor is supported by NSF Young Investigator award under Grant CCR-9215482. David Boyce is supported by the National Science Foundation through Grant DMS9313113 to the National Institute of Statistical Sciences. The authors would like to acknowledge Stanislaw Berka for his help with the trac networks and Jane Hagstrom for her suggestions and encouragement. We thank Robert Dial for his comments on the basic approach to the parallel shortest path algorithm. Finally, we acknowledge Bruce Holmer for the use of his simulated annealing software. This research was performed in part using the CACR parallel computer system operated by Caltech on behalf of the Concurrent Supercomputing Consortium. Access to this facility was provided by NSF.

References [1] P. Adamson and E. Tick. Greedy partitioned algorithms for the shortest path problem. International Journal of Parallel Programming, 20:271{298, 1991. [2] Baruch Awerbach and Robert S. Gallager. Communication complexity of distributed shortest path algorithms. Technical Report LIDS-P-1473, MIT, 1985. [3] S. Berka, D.E. Boyce, J. Raj, B. Ran, A. Tarko, and Y. Zhang. A large-scale route choice model with realistic link delay functions for generating highway travel times. Technical Report to Illinois Department of Transportation, 1994. Urban Transportation Center, University of Illinois, Chicago, IL. [4] Dimitri Bertsekas, Francesca Guerriero, and Roberto Musmanno. Parallel asynchronous labelcorrecting methods for shortest paths. Journal of Optimization Theory and Applications, 89(1), 1996. To appear. 12

[5] D.E. Boyce, D.H. Lee, B.N. Janson, and S. Berka. Dynamic user-optimal route choice model of a large-scale trac network, 1995. Presented at the 14th Paci c Regional Science Conference, Taipei, Taiwan, R.O.C. [6] K. M. Chandy and J. Misra. Distributed computation on graphs: Shortest path algorithms. Communications of the ACM, pages 833{837, November 1982. [7] Greg N. Fredrickson. A distributed shortest path algorithm for a planar network. Information and Computation, 86:140{159, 1990. [8] Giorgio Gallo and Stefano Pallottino. Shortest path methods in transportation methods. In M. Florian, editor, Transportation Planning Models. Elsevier Science Publishers B.V., 1984. [9] Giorgio Gallo and Stefano Pallottino. Shortest path methods: A unifying approach. Mathematical Programming Study, 26:38{64, 1986. [10] Bruce Golden. Shortest-path algorithms: A comparison. Operations Research, 24(6):1164{ 1168, 1976. [11] Mayiz Habbal, Haris Koutsopoulos, and Steven Lerman. A decomposition algorithm for the all-pairs shortest path problem on massively parallel computer architectures. Transportation Science, December 1994. [12] Bruce Hendrickson and Robert Leland. The Chaco user's guide. Sandia National Laboratories, Albuquerque, 1993. [13] Michelle Hribar, Valerie Taylor, and David Boyce. Choosing a shortest path algorithm. Technical Report CSE-95-004, Computer Science-Engineering, EECS Department, Northwestern University, 1995. [14] J. Jenq and S. Sahni. All pairs shortest paths on a hypercube multiprocessor. In International Conference on Parallel Processing, pages 713{716, 1987. [15] Vipin Kumar and Vineet Singh. Scalability of parallel algorithms for the all-pairs shortest-path problem. Journal of Parallel and Distributed Computing, 13:124{138, 1991. 13

[16] Richard C. Paige and Clyde P. Kruskal. Parallel algorithms for shortest path problems. In International Conference on Parallel Processing, pages 14{19, 1985. [17] S. Pallottino. Adaptation de l'algorithme de d'esopo-pape pour la determination de tous les chemins les plus courts: ameliorations et simpli cations. Technical Report 136, Centre de Recherche sur les Transports, Universite de Montreal, 1979. [18] U. Pape. Implementation and eciency of moore-algorithms for the shortest path problem. Mathematical Programming, 7:212{222, 1974. [19] K.V.S. Ramarao and S. Venkatesan. On nding and updating shortest paths distributively. Journal of Algorithms, 13:235{257, 1992. [20] H. Edwin Romeijn and Robert L. Smith. Parallel algorithms for solving aggregated shortest path problems. The University of Michigan, September 1993. [21] Y. She. Urban Transportation Networks: Equilibrium Analysis with Mathematical Programming Methods. Prentice Hall, 1985. [22] Valerie E. Taylor, Bruce K. Holmer, Eric J. Schwabe, and Michelle R. Hribar. Balancing load versus decreasing communication: Exploring the tradeo s. In Hawaii International Conference on System Sciences, 1996. [23] Jesper Larson Tra . An experimental comparison of two distributed single-source shortest path algorithms. Parallel Computing, 1995. To appear.

14

List of Figures 1 2 3 4 5 6 7 8 9 10 11 12

User-Equilibrium Problem Statement : : : : Frank-Wolfe algorithm : : : : : : : : : : : : General Labeling Algorithm : : : : : : : : : Network Partitioning and Boundary Nodes

: : : : Parallel Shortest Path Algorithm : : : : : : : Balanced vs. Imbalanced Sub-Networks : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

Execution Time for Large vs. Small Number of Boundary Nodes Execution Time for ADVANCE Network : : : : : : : : : : : : : : Execution Time for First Five Shortest Path Iterations : : : : : : Frequency of Synchronization vs. Execution Time : : : : : : : : Execution Time for First Five Synchronization steps : : : : : : : Comparison of Frequent vs. Minimal Synchronization : : : : : :

15

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

16 16 17 18 19 20 21 22 23 24 25 26

|||||||||||||||||||||||||||||| ||||||||{ f = (f1 ; :::; fm) the ow on each of the m arcs tc = (tc1; :::; tcm) the travel cost functions Rij = the set of routes from origin i to destination j R = the set of all routes from all origins to all destinations, fR11; R12; :::Rnng Tij = the ow from origin i to destination j hr = the  ow on route r 2 R 1 if arc l belongs to route r 2 R lr = 0 otherwise minimize subject to

P Rf tcl(x) dx Pl 0 r2R hr = Tij 8i; j h  0 8r 2 R Pr fl = r2R lr hr l = 1; :::; m l

ij

|||||||||||||||||||||||||||||| ||||||||Figure 1: User-Equilibrium Problem Statement

1. 2. 3. 4. 5.

|||||||||||||||||||||||||||||| |||||||Compute the current travel cost vector tckl = tcjl (flk ) for l = 1,...,m where k is the number of the Frank-Wolfe iteration. Find a new direction ylk using all-or-nothing assignment with the arc costs tckl . Determine the optimal step size k by minimizing the objective function of user-equilibrium problem with tc. This is done using a gradient technique or bisection method. Compute a new solution flk+1 = flk + (ylk ? flk ); l = 1; :::m. If the convergence criterion is satis ed, then flk+1 is the solution. Otherwise, k = k + 1 and return to step 1. |||||||||||||||||||||||||||||| |||||||{ Figure 2: Frank-Wolfe algorithm

16

|||||||||||||||||||||||||||||| |||||||{

General Labeling Algorithm begin dl[s] = 0; pred[s] = 0; for i = 1::n, i 6= s do dl[i] = 1

end for list = fsg; while list 6= ;

end

remove i from list; for each edge (i; j ) 2 A[i] do if dl[j ] > dl[i] + cij then dl[j ] = dl[i] + cij ; pred[j ] = i; if j 62 list then list = list [fj g; end if end if end while

|||||||||||||||||||||||||||||| |||||||{ Figure 3: General Labeling Algorithm

17

Network Partitioning

Boundary Nodes

Figure 4: Network Partitioning and Boundary Nodes

18

|||||||||||||||||||||||||||||| |||||||{

Parallel Shortest Path for each source i contained in the local subnetwork, i = 1,..z do put source i in queuei

end for while any processor's queues not empty for all sources i, i = 1,..z do

solve and store local shortest path tree for source i until queuei is empty

end for for each sourcei, i = 1,..z do for all changed labels of boundary nodes

send changed label and sourcei to neighboring processor

end for end for

receive all changed labels and source j from neighbor put boundary nodes with received changed labels in queuej synchronize to determine if any processor has a non-empty queue

end while

|||||||||||||||||||||||||||||| |||||||{ Figure 5: Parallel Shortest Path Algorithm

19

120

104.2

100

80

Time (secs)

73.8

Imbalanced Balanced

60

40

37.63 36

26.6 23.8

20

0

LI=0.37NYC

LI=0

ADV. LI=0.55

LI=0

Walt. LI=0 LI=0.22

Figure 6: Balanced vs. Imbalanced Sub-Networks

20

2500

23,165

2000

1500

Time (secs)

Sync. Comm. Comp.

1000

500

0 NYC 28.5% Bound.

ADVANCE 7% Bound.

Waltham 19% Bound.

Figure 7: Execution Time for Large vs. Small Number of Boundary Nodes

21

90

80

70

60

Time (secs)

50 Sync. Time Comm. Time Comp. Time 40

30

20

10

0 Proc 0

Proc 1

Proc 2

Proc 3

Figure 8: Execution Time for ADVANCE Network

22

30

25

Time (secs)

20

15 Sync. Time Comm. Time Comp. Time

10

5

0 P0

P2

P1 Iter 1

P3

P0

P2

P1 Iter 3

P3

P0

P2

Figure 9: Execution Time for First Five Shortest Path Iterations

23

70 Waltham ADVANCE NYC

Total Execution Time (Seconds)

65 60 55 50 45 40 35 30 25 0

5

10 15 20 25 30 Number of Iterations between Synchronization Steps

Figure 10: Frequency of Synchronization vs. Execution Time

24

35

16

14

12

10

Sync. Time Comm. Time

Time (secs)

Comp. Time

8

6

4

2

0 P0

P2

P1 Sync. Step 1

P3

P0

P2

P1 Sync. Step 3

P3

P0

P2

Figure 11: Execution Time for First Five Synchronization steps

25

90

80

70

60

50

Time (secs)

Minimal Sync. Freq. Sync

40

30

20

10

0 NYC

ADVANCE

Waltham

Figure 12: Comparison of Frequent vs. Minimal Synchronization

26