Run-Time Support for the Automatic Parallelization of Java ...

Run-Time Support for the Automatic Parallelization of Java Programs

by

Bryan Chan

A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto

Copyright

© 2002 by Bryan Chan

Abstract Run-Time Support for the Automatic Parallelization of Java Programs Bryan Chan Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2002 We describe and evaluate a novel approach for the automatic parallelization of programs that use pointer-based dynamic data structures, written in Java. The approach exploits parallelism among methods by creating an asynchronous thread of execution for each method invocation in a program. At compile time, methods are analyzed to characterize the data they access, parameterized by their context. A description of these data accesses is transmitted to a run-time system during program execution. The run-time system utilizes this description to determine when a thread may execute, and to enforce dependences among threads. The focus of this thesis is on the run-time system. More specifically, we describe the static representation of data accesses in a method and the framework used by the run-time system to detect and enforce inter-thread dependences. Experimental evaluation of an implementation of the run-time system on a 4-processor Sun multiprocessor indicates that close to ideal speedup can be obtained for a number of benchmarks. These results validate our approach as a viable one for automatic parallelization.

ii

Dedication For my late grandfather, Mr. Lau Kee-Ping

iii

Acknowledgements This work would not have been possible without the following people. First of all, I would like to thank my supervisor, Prof. Tarek Abdelrahman, for sparking my interest in compilers, for teaching me what research is, and for his motivation, guidance and unfailing patience throughout my course of study. My colleagues, Neil Brewster, Dion Lew and Sirish Pande, provided invaluable input and collaboration on many aspects of the zJava project. Prof. Duane Szafron kindly provided the source code to the 15-puzzle benchmark. I am also grateful to the compiler reading group; the weekly group meetings gave me an opportunity to learn about many interesting compiler and run-time technologies. Most importantly, I would like to thank my parents for their loving support and encouragement. And I thank my friends, Miranda Leong, Albert Lai, Leo Lam, Gary Liu, Aaron Ng, and others, whose advice and company and made even the most difficult times endurable.

iv

Contents

1 Introduction 1.1

1

System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.1.1

Model of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.1.2

System Architecture . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2

Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.3

Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2 Summarizing Data Accesses 2.1

2.2

2.3

9

The Symbolic Access Path Notation . . . . . . . . . . . . . . . . . . . . .

9

2.1.1

Classification of Access Paths . . . . . . . . . . . . . . . . . . . .

11

2.1.2

Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.1.3

Accesses to Array Data Structures . . . . . . . . . . . . . . . . .

14

2.1.4

Accesses to Recursive Data Structures . . . . . . . . . . . . . . .

15

Constructing Data Access Summaries . . . . . . . . . . . . . . . . . . . .

16

2.2.1

Examples of Data Access Summaries . . . . . . . . . . . . . . . .

18

Program Re-structuring . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.3.1

Class Initialization . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.3.2

The main Method . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.3.3

Call Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.3.4

Uses of Shared Variables . . . . . . . . . . . . . . . . . . . . . . .

25

v

2.3.5 2.4

Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

Analysis Necessary for Computing Summaries . . . . . . . . . . . . . . .

28

3 The Run-Time System

30

3.1

Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.2

The Registry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.3

Registry Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.3.1

Region Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.3.2

Thread Creation . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.3.3

Thread Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.3.4

Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . .

41

3.3.5

Thread Termination . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.4

Deadlock Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.5

Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.5.1

Selective Parallelization

. . . . . . . . . . . . . . . . . . . . . . .

46

3.5.2

Profile-directed Parallelization . . . . . . . . . . . . . . . . . . . .

46

3.5.3

Method Versioning . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.5.4

Proxy Synchronization . . . . . . . . . . . . . . . . . . . . . . . .

47

4 Experimental Results

48

4.1

Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . .

49

4.2

Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.2.1

ReadTest: A Synthetic Micro-benchmark . . . . . . . . . . . . . .

51

4.2.2

Traversal: Measuring Overhead

. . . . . . . . . . . . . . . . . . .

54

4.2.3

The Matrix Benchmark . . . . . . . . . . . . . . . . . . . . . . . .

55

4.2.4

The Mergesort Benchmark . . . . . . . . . . . . . . . . . . . . . .

56

4.2.5

The TSP Benchmark . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.2.6

The 15-puzzle Benchmark . . . . . . . . . . . . . . . . . . . . . .

58

vi

5 Related Work

60

5.1

Array Access Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

5.2

Access Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5.3

Exploiting Method-level Parallelism . . . . . . . . . . . . . . . . . . . . .

63

5.4

Alias Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

6 Conclusions

66

6.1

Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .

66

6.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

A Notes on the Implementation

74

A.1 The zJava Compiler Framework . . . . . . . . . . . . . . . . . . . . . . .

74

A.2 Transferring Context Via Reflection . . . . . . . . . . . . . . . . . . . . .

75

A.3 Constructors and Finalizers . . . . . . . . . . . . . . . . . . . . . . . . .

76

B A Grammar of the Symbolic Access Path Notation

77

C Benchmarks Used

78

C.1 ReadTest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

C.1.1 ReadTest.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

C.2 Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

C.2.1 Timer.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

C.2.2 LinkedList.java . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

C.3 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

C.3.1 Matrix.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

C.4 Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

C.4.1 MergeSort.java . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

C.5 TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

C.5.1 TSP.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

vii

C.6 15-puzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

C.6.1 CostLookupTable.java . . . . . . . . . . . . . . . . . . . . . . . .

88

C.6.2 FifteenPuzzle.java . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

C.6.3 FrontierNode.java . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

C.6.4 InitialNode.java . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

C.6.5 Label.java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

C.6.6 MasterNode.java . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

C.6.7 MoveLookupTable.java . . . . . . . . . . . . . . . . . . . . . . . .

95

C.6.8 PuzzleFactory.java . . . . . . . . . . . . . . . . . . . . . . . . . .

96

C.6.9 SearchNode.java . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

viii

List of Figures 1.1

A simple program and its execution time-line. . . . . . . . . . . . . . . .

4

1.2

Overview of the zJava parallelization system.

. . . . . . . . . . . . . . .

6

2.1

An example program that illustrates symbolic access paths. . . . . . . . .

10

2.2

An example that illustrates anchors in symbolic access paths. . . . . . . .

13

2.3

An example that illustrates symbolic access paths for the traversal of a recursive data structure. . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.4

A simple method midpoint. . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.5

Data access summary for the midpoint method. . . . . . . . . . . . . . .

18

2.6

An example that illustrates data access summaries. . . . . . . . . . . . .

19

2.7

Another example that illustrates data access summaries. . . . . . . . . .

20

2.8

A third example that illustrates data access summaries. . . . . . . . . . .

21

2.9

Adding static initializers to allow the run-time system to initialize a class with parallelized methods. . . . . . . . . . . . . . . . . . . . . . . . . . .

2.10 Transformation of the main method.

24

. . . . . . . . . . . . . . . . . . . .

25

2.11 Transformation of a call to method bar in method foo. . . . . . . . . . .

26

2.12 Transformation of uses of shared variables. . . . . . . . . . . . . . . . . .

27

3.1

The zJava registry maintains sorted thread lists for shared regions. . . .

33

3.2

A matrix multiplication example. . . . . . . . . . . . . . . . . . . . . . .

35

3.3

Data access summary for readMatrix. . . . . . . . . . . . . . . . . . . .

36

ix

3.4

Region nodes are added for all regions that the readMatrix method will access.

3.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

New thread nodes are created for the readMatrix thread; the nodes are inserted into the thread lists corresponding to regions accessed by the thread.

3.6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The registry is updated as the two multiply threads are created; the thread lists of shared regions keep the threads in serial execution order. .

3.7

39

40

The zJava compiler inserts code at synchronization points to ensure that data dependences are not violated at run time.

. . . . . . . . . . . . . .

43

4.1

The speedup of the benchmarks. . . . . . . . . . . . . . . . . . . . . . . .

51

4.2

The effect of thread granularity on the performance of the run-time system. 53

4.3

The effect of the number of threads on the performance of the run-time system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4

53

The run-time overhead of multi-threaded list traversal as a function of the size of the linked list and the amount of work done on each visited node.

55

4.5

The speedup of Matrix at 4 processors as a function of matrix size. . . . .

56

4.6

The speedup of Mergesort on 2M integers as a function of number of the unit size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

The speedup of the 15-puzzle application.

. . . . . . . . . . . . . . . . .

59

A.1 The zJava compiler framework. . . . . . . . . . . . . . . . . . . . . . . .

75

4.7

x

List of Tables 4.1

Summary of the benchmarks and their sequential execution time. . . . .

xi

50

Chapter 1 Introduction Since the first parallel computers demonstrated their impact on scientific computing in the early 1960s, they have become both more powerful and more affordable. However, developing software that fully exploits their computational power remains a difficult task for the average programmer. It has become the “Holy Grail” of the parallel processing community to build a software tool that automatically transforms, or parallelizes, a sequential program into a parallel program, which can make full use of a multiprocessor machine. Automatic parallelization offers many benefits. Parallel machines become much easier to use and program for; a programmer merely needs to write a sequential program, and the compiler will take care of the details. The main difficulty in writing parallel programs is that they are error prone. A compiler can help by ensuring that the parallelized program is correct, whereas the subtle interactions among concurrently running threads can often completely confuse a human programmer trying to debug the code. The latter scenario means more time and money must be spent in the least productive phase of the software development cycle. Furthermore, a program that has been parallelized automatically is also more portable than a manually written parallel program, because a compiler can be equipped with back-ends that have intimate knowledge of various target 1

Chapter 1. Introduction

2

parallel hardware, and can perform hardware-specific parallelization and optimization. In contrast, a manually written parallel program must be adapted for each target platform. Given the attractiveness of automatic parallelization, it is no surprise that parallelizing compilers have received much attention from the research community in the past two decades. Traditionally, the research has focused on scientific programs, which are typically written in FORTRAN [BENP93, BDE+ 96, HAA+ 96]. Many of these programs employ arrays with statically known bounds, and are abundant in loop-level parallelism, which allows the division of loop iterations for parallel execution. Regrettably, this focus has limited the widespread use of automatic parallelization in industry, where the majority of programs are written in C, C++, or more recently, Java [GJS97]. These modern programs are based on objects (or pointers), deal with dynamic data structures such as linked lists and trees, and often employ recursion. Consequently, existing parallelizing compilers that are developed for arrays and nested loops are helpless on programs with these modern features. The goal of this work is to describe the design and implementation of a system for the automatic parallelization of Java programs. The novelty of the system stems from its combination of compile-time analyses and run-time support to extract and exploit parallelism in programs. On the one hand, our approach leads to increased run-time overhead by discovering parallelism at run time, but on the other hand, it facilitates automatic parallelization of pointer-based programs, where compile-time-only approaches have been limited by the imprecision of static pointer alias analyses. One of the motivating factors for our use of Java is the popularity it has attained since its emergence in 1994. The language offers powerful object orientation, without the design faults or historical baggage that are inherent in languages like C++. It is also highly portable, thanks to the use of Java virtual machines (JVM) [LY99] that have been developed for various hardware platforms. Standard Java comes with a large, re-usable class library [CLK98, CLK99], which encourages rapid application development. Most


3

importantly, the language has built-in support for threads and concurrent execution (of which we make extensive use), making it an excellent foundation for the construction of an automatic parallelization system. The system is intended to be used on shared memory multiprocessors with a small to moderate number of processors, which include most symmetric multiprocessing (SMP) systems today. The Java memory model assumes a flat, single-address-space shared memory for multi-threaded programs, and may not be effective on, for instance, distributed parallel systems, or non-uniform memory access (NUMA) systems, without additional memory management facilities. The design could easily be extended to other shared memory targets like single-chip multiprocessors [HNO97], which are common in the digital signal processing arena, and newer processors that support simultaneous multi-threading (SMT) [MBH+ 02, TEL95]. Besides shared memory, these processors also offer cheap context switching, fine-grain synchronization primitives, dynamic partitioning of resources, as well as high hardware utilization, all of which help improve performance of multithreaded programs [Lo98, TLEL99].

1.1

System Overview

In this section, we give an overview of the zJava parallelization framework. We explain the model of parallelism extracted from a sequential Java program, and briefly describe the architecture of the system.

1.1.1

Model of Parallelism

The zJava system exploits method-level parallelism in a program. In other words, for each method invocation in the program, an independent thread is created to execute the method (a.k.a. subroutine, function, or procedure, in other programming languages) asynchronously. This thread may in turn create more threads by simply by calling more

4

Chapter 1. Introduction float m, b, x, y; void main(String[] args) { m2(); /* ... */ y = m * x + b; } void m2() { /* ... */ m3(); /* ... */ }

main

m2

m3

invoke invoke wait

time

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

signal

void m3() { /* ... */ x = /* ... */; }

Figure 1.1: A simple program and its execution time-line.

methods. Two threads thus created may run concurrently, if and only if there is no data dependence between the two methods that they are executing, taking into consideration the contexts of the method calls. Otherwise, the threads run one at a time, in the order in which the corresponding methods would have been dispatched in the original program. A thread terminates when it reaches the end of its method. The program terminates when all its threads have terminated. To illustrate, consider the execution time-line for a simple program, as shown in Figure 1.1. When the program is executed, the JVM locates the main method1 and launches the primordial thread to execute the method. When main invokes m2, a new thread is created to execute the method, after which the first thread continues to run in parallel. Similarly, when m2 invokes m3, a third thread is created, and the three threads can execute concurrently, thus (ideally) speeding up the program threefold. However, the three methods are not independent of one another. Indeed, m3 computes the value of the 1

On start-up, the JVM expects to find within the program a method that matches the signature public static void main (String s[]), otherwise a NoSuchMethodError is thrown.


5

variable x, which is needed by main to compute the value of the variable y. Therefore, the thread running main must be suspended (unless m3 has already completed computing x), and it must wait until m3 has produced the data it requires. When the thread running m3 has completed the computation of x, it signals the waiting main thread, which can then proceed with its own computation. Threads share data in two ways. First, as we have seen in the example above, threads may share data by accessing global variables that are kept in shared memory. Second, they may also pass object references (i.e., pointers to objects) between one another. Such object references are passed either as parameters to parallel method calls, or as return values from such calls. Object data is shared when two threads dereference pointers to the same object to access its field variables. In summary, two threads can share data if they access some data (global or otherwise) at the same address in shared memory. The flow of data in a sequential program, as described above, results in data dependence among its methods. Synchronization of the corresponding threads is necessary to ensure that the parallelization of the program is correct. To achieve this, the zJava system preserves the sequential semantics of a program, by enforcing serial execution order for threads which perform conflicting operations on the same data. In other words, threads that write to the same locations in shared memory must do so only in the same order in which they execute in the sequential program. Similarly, serial execution order is preserved between a thread that writes a location in shared memory and another thread that reads from the same location. Threads that access different data, or that merely read the same data, may execute concurrently.

1.1.2

System Architecture

The zJava parallelizer works by extracting parallelism from sequential Java programs, re-structuring the programs, and executing the resulting parallel programs on top of a standard JVM. A high-level overview of the parallelizer is shown in Figure 1.2. The

6

Chapter 1. Introduction COMPILE TIME Foo.java

Data access summaries of Foo.class

zJava

Compiler

read { Foo.const $0{.m,.b} } write { $0{.x,.y} }

Source program

Foo.class

Transformed program

zJava

Run-time System

Registry

JVM/RUN TIME

Figure 1.2: Overview of the zJava parallelization system.

parallelizer consists of two main components: a compile-time analysis/re-structuring pass and a run-time system. The compiler analyzes the input sequential program to determine how shared variables and objects are used by methods in the program. It captures this information in the form of symbolic access paths, and then collects these paths into a data access summary for each method. The compiler also re-structures the input program into a parallel threaded program; the re-structured program makes calls to routines in the run-time system, that create, execute, and synchronize threads. The run-time system receives the data access summaries for all methods in a class when the class is loaded by the JVM. As methods are invoked, threads are created to execute them. The threads run concurrently on multiple processors (if the underlying JVM is able to schedule Java threads to different physical processors), leading to performance gains. The data dependence among the threads are dynamically computed using the data access summaries of the methods, and stored in a set of data structures called the registry. The run-time system updates the registry when threads are created and destroyed, and when data become shared by multiple threads. The run-time system also provides synchronization devices that help the parallelized program ensure that program threads are executed in the correct order, according to the computed data dependence.


1.2

7

Thesis Contribution

The emphasis of this thesis is on the run-time support for the automatic parallelization of Java programs. Our work assumes the existence of a compiler that analyzes input programs and extracts parallelism from them, using algorithms that are well known in the literature. Indeed, the goal of our work is not to re-implement these compilation techniques, but to devise ways to apply results computed by the compiler at run-time, in order to achieve automatic parallelization. In this respect, this thesis makes three contributions in that area. First, we have developed the data access summary notation, based on symbolic access paths [Deut94], which allows a compiler to represent, and communicate to the run-time system, method-level data access information of a program. The notation acts as an interface between the compiler and the run-time system, so that the compiler analyses and the run-time support can be designed with little coupling between the two. Second, we have designed in a system run-time structures that aid in the discovery of parallelism inherent in a program using its data access summaries. The system exploits the parallelism by creating, ordering and synchronizing threads, which execute methods of the program concurrently. Finally, we have implemented the run-time system as part of the zJava compiler framework [Lew01, Brew01], and have experimented with the implementation using representative applications. We show that our approach yields good performance improvements on programs that have sufficient thread granularity.

1.3

Thesis Organization

The thesis is organized as follows. Chapter 2 introduces data access summaries, and explains how they are used to capture the data used by methods in a program for use by the run-time system. Chapter 3 describes the zJava run-time system in details, and


8

shows how inter-thread data dependence are computed at run time, as well as how the threads are synchronized. Chapter 4 presents our experimental evaluation of the system. Chapter 5 discusses some related work. Finally, Chapter 6 summarizes our work, gives conclusions and indicates directions for future work.

Chapter 2 Summarizing Data Accesses The zJava compiler associates every method in a program with a data access summary. This summary completely specifies how a thread executing the method can reach and use shared data, and how data allocated by the thread may become reachable and possibly be shared by other concurrently running threads. In this chapter, we first describe the symbolic access path notation, which is used to record succinctly the data accessed in a method. (A complete grammar for this notation is given in Appendix B.) We then explain how data access summaries are formed from symbolic access paths. Finally, we give a number of examples to illustrate these concepts.

2.1

The Symbolic Access Path Notation

The zJava system records accesses to shared data using symbolic access paths, which is a symbolic notation for representing one or more access paths [LH88, HLW+ 92]. We give the definition of an access path below.

Definition 2.1.1 (Access path) A pair o.f consisting of a source object o, and a sequence of field names, or selectors, f = f1 . . .fn , where f1 refers to, or selects, a referencetype field variable defined within o. Each successive selector fi , 1 < i < n, selects 9

Chapter 2. Summarizing Data Accesses 1 2 3 4 5 6 7 8 9 10 11 12 13

10

class ListNode { static ListNode head; int data; ListNode next; static ListNode pop() { ListNode popped = ListNode.head; ListNode.head = popped.next; popped.next = null; return popped; } }

Figure 2.1: An example program that illustrates symbolic access paths. a reference-type field variable defined within the object referenced by fi−1 . The last selector fn selects a field variable of any (i.e., primitive or reference) type within the destination object referenced by fn−1 . If f1 is a static field, then the access path has no source object, or o = null. If n = 1, then the destination object of the access path is the source object o itself, and f1 —the only selector in the path—selects a field variable of any type within o. If o = null, then the destination object is also null, and f1 simply selects the static field itself. We use the program in Figure 2.1 to illustrate the access path notation. The Java class ListNode describes the implementation of a simple global linked list of integers. Each element in the linked list is an instance of class ListNode, in which a data field holds an integer, and a next field contains a reference to the next element in the linked list. The class also declares a class field head that holds a reference to the first node of the global linked list. The method pop sets ListNode.head to the second node in the linked list, resets the next field of the original head node to null, and finally returns the original head node. The write access to the static field (line 9 in the code) is trivially recorded with the access path “ListNode.head”. Since ListNode.head is a static variable, the path contains no

Chapter 2. Summarizing Data Accesses

11

source object, and since the name of the static field is the only selector in the path, there is no destination object either. Rather than write “null .ListNode.head”, the notation simply leaves the null source object blank. As the following sections will show, this does not lead to any ambiguity, as special identifiers are used for non-global source objects. The second write access (line 10) is represented by the access path “ListNode.head.data”. There is still no source object since the path begins with a static field. The reference-type static field points to the destination object, and the access path selects the integer-type field data within that object.

2.1.1

Classification of Access Paths

The zJava system classifies an access path into one of four groups, depending on its source object. A global access path has a null source object, i.e., it starts with a static field variable that is sharable by multiple methods. Note that even if a static field is declared private, it is still visible to multiple methods in the same class. The static field variable is uniquely identifiable by its fully qualified class and fields names, e.g., “ListNode.head”. The path “ListNode.head.data” in Figure 2.1 is an example of a global access path. A parametric access path is associated with a source object that has been passed to a method as an argument. Such an object is referenced by multiple aliases in different contexts. For example, the object referenced by System.out is referenced by the alias this within its member method println. Another example is the parameter to the method println, of type java.lang.String, which is likewise accessible to both the method itself and its caller. Hence, a parameter object is sharable by methods that have access to its aliases. Therefore, accesses to its fields, and to other objects reached through it, must be synchronized. Section 2.1.2 gives some examples of parametric access paths. A local access path is associated (theoretically) with source objects that have been created in a method and are only ever assigned to local variables in that method. Since


12

local variables are only visible to the method itself, objects referenced by them will never become shared by multiple threads, and so access paths that traverse through local objects (and their fields, and so on) need not be represented by our notation. However, local objects (or objects reached through them) can, and often do, escape the containing method (i.e., become accessible to methods other than the one containing the local variables) either as return values to the callers, or via assignment to nonlocal variables (static fields or fields of parameter objects) [CGS+ 99]. As a result, these escaping objects can become shared by multiple threads, and once they have escaped, accesses to them must be represented by symbolic access paths. The two ways in which local objects can escape require different solutions in the zJava system. Objects that escape as return values are often assigned to local or temporary variables in the caller, but as we have stated, our notation do not represent local variables. Instead, the run-time system relies on a mechanism called the future to synchronize accesses to return values, without using data access summaries. Futures will be discussed in Section 3.3.5. On the other hand, a local object o can also escape a method m by getting assigned to a sharable field variable f . f can be a field of a parameter object, or a field of a global object, or f itself can be a static field. In all cases, o only becomes sharable after said assignment, therefore we only need to represent accesses to f , in order to capture the fact that o may become shared. In summary, for any given method, the zJava compiler only needs to generate symbolic access paths for accesses to (and/or via) global objects and parameter objects passed to the method.

2.1.2

Anchors

Theoretically, parametric symbolic access paths for a method are generated in terms of the actual parameters of the method. In practice, however, the actual parameters of the

Chapter 2. Summarizing Data Accesses 1 2 3 4 5 6 7 8 9 10 11

13

class ListNode { static ListNode head; int data; ListNode next; void copyNode(ListNode source) { this.data = source.data; this.next = source.next; } }

Figure 2.2: An example that illustrates anchors in symbolic access paths. method vary from one call site to another, and sometimes may not be known until run time. Moreover, the corresponding formal parameters are not uniquely identifiable within the program, and are not suitable as source objects of access paths. Consequently, it may be impossible to determine fully the source objects of a method’s parametric access paths at compile time. To address this problem, we introduce a special notation to represent the source object in a parametric access path, which we refer to as the anchor. The anchor “$n” denotes the n-th parameter of the method. Hence, “$1” denotes the first parameter and “$2” denotes the second parameter. The special anchor “$0” denotes the receiver object, passed to the method as the implicit parameter this. The actual source object of a symbolic access path can then be determined at run time using its anchor and the context of the method call. It should be noted that when the source of a symbolic access path is a static field, no anchor is required; a static field is unique within a program. The example shown in Figure 2.2 illustrates anchors in access paths using the earlier example of the ListNode class, but with the method copyNode. This method copies the contents of a given node (source) to the node on which the method is invoked. in the list from source, as shown in the figure. The symbolic access paths for this method are $0.data, $0.next, $1.data, and $1.data, corresponding to accesses to this.data, this.next, source.data, and source.next, respectively.


2.1.3

14

Accesses to Array Data Structures

The symbolic access path notation treats elements of an array as individual fields of an array object—ones that are referred to by indices rather than identifiers. A contiguous section of multiple array elements can also be handled as a single entity; the aggregate is denoted with a single access path, in which the name of of the array is subscripted with two integers that give the (inclusive) boundaries of the array section being accessed. The first integer marks the element that starts the array section, and the second, larger, integer marks the element that ends the section. Furthermore, all elements of an array can be represented with an asterisk instead of an index range. For example, the access path “$1.children[0-9]” can be used to specify the first ten elements in the children array in lieu of the ten symbolic access paths: “$1.children[0]”, “$1.children[1]”. . . , and “$1.children[9]”. Similarly, the access path “$1.children[*] represents all elements of the array. The same syntax extends to multidimensional arrays; e.g., “$1.children[0-9][0-9]” denotes a 100-element rectangular section in the now two-dimensional array. A common idiom used in array-based programs is a subroutine that accepts as parameters an array A and two integers, m and n, which performs some operation on array elements A [m], A [m + 1],. . . A [n]. To exploit this idiom, special anchors are used to identify m and n in a method that uses arrays. When an array symbolic access path is parsed at run time, anchors surrounded in brackets are assumed to represent integer arguments passed to the method, and are used to index into a given array. The array itself may also be represented with another anchor. For example, the path “$1.children[$2-$3]” represents a section of the array children in the first, reference-type parameter to the method. The start and end indices of the array section are the second and third parameters (which must be of integer types) to the method.


2.1.4

15

Accesses to Recursive Data Structures

In some cases, it may not be feasible to enumerate all possible symbolic access paths of a method. This is true when a method can potentially perform an infinite number of data accesses, via the two usual devices for code repetition: loops and recursion. For instance, a method may contain a loop that traverses a recursive data structure with an induction variable v. The number of times the loop iterates is determined by a condition that is evaluated at run time. During the execution of the method, v refers to multiple, successive elements of the recursive data structure. Hence a single access to v inside the loop gives rise to multiple (potentially an infinite number of) symbolic access paths. To address the above problem, we use a single symbolic access path to represent such successive accesses to elements of a recursive data structure. The repeated portions of the access paths that result from the traversal are “factored out” and specially annotated to indicate the recursive nature of the traversal. This extracted component of the access path is referred to as the basis of the path. This form of the symbolic access path is a k-limited representation of the traversal of the data structure [Deut94]. The example shown in Figure 2.3 illustrates symbolic access paths for the traversal of a recursive data structure. The method initAllNodes traverses a linked list and assigns 0 to the data field of all nodes in the list. The accesses at lines 10 and 11 are repeatedly performed on successive nodes; hence, the following symbolic access paths are required: “ListNode.head.data”, “ListNode.head.next.data”, “ListNode.head.next.next.data”, “ListNode.head.next.next.next.data”, etc. The number of access paths cannot be determined at compile time because it is not possible to determine statically when the loop terminates. Instead, the access is represented by the following single symbolic access path: “ListNode.head(.next)*.data”. The basis “.next” is followed by an asterisk to form a 0-limited representation of the traversal; that is to say, it may occur zero or more times using the next field.

Chapter 2. Summarizing Data Accesses 1 2 3 4 5 6 7 8 9 10 11 12 13 14

16

class ListNode { static ListNode head; int data; ListNode next; void initAllNodes() { ListNode curr = head; while (curr != null) { curr.data = 0; curr = curr.next; } } }

Figure 2.3: An example that illustrates symbolic access paths for the traversal of a recursive data structure. In other situations, e.g., a do-while loop which iterates at least once, a plus sign may be used in lieu of the asterisk to form a 1-limited representation of the traversal. For example, the path “ListNode.head(.next)+” states that the next field of ListNode.head is definitely accessed, while subsequent nodes may not be traversed at all.

2.2

Constructing Data Access Summaries

A data access summary specifies the read set and the write set of a method m, i.e., the sets of variables which m reads or writes, respectively. The summary consists of a list of entries. Each entry in the list is a (symbolic access path, access type) pair that specifies a shared variable v. The access type determines whether v belongs to the read set or the write set of m. If v belongs to both the read and write sets of m, then its access type is indicated as “write”. The data access summary of a method m includes the data access summaries of all methods called by m. In other words, if a method m called by m writes to a variable v, then m is assumed to also write to v, even if it does not directly write to the variable in

Chapter 2. Summarizing Data Accesses 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

17

class Point { static int nextId = 0; int id; float x, y; Point midpoint(Point p) { Point q = new Point(); Point r = new Point(); r.x = (this.x + p.x) / 2; r.y = (this.x + p.y) / 2; r.id = allocateId(); return r; } int allocateId() { nextId = nextId + 1; return nextId; } }

Figure 2.4: A simple method midpoint. its own body. We refer to this as the inclusion property of data access summaries, and it is essential for proper synchronization, as will be discussed in Section 3.3.2. Definition 2.2.1 (The inclusion property) For a given method m, its data access summary Sm must satisfy: (∪n∈callee(m) Sn − local(m)) ⊂ Sm where callee(m) is the set of methods called by m, and local(m) is the set of access paths whose source objects never escape m. The Java code shown in Figure 2.4 is used to illustrate the construction of data access summaries. Instances of the Point class represent points in the Cartesian plane. The midpoint method computes the mid-point between the point represented by the receiver object and the point represented by the parameter p, and returns the result as a new Point object, complete with a unique serial identifier, which is incremented for every Point object allocated.


18

($0.x, read) ($0.y, read) ($1.x, read) ($1.y, read) (Point.nextId, write)

Figure 2.5: Data access summary for the midpoint method. The data access summary for the midpoint method is shown in Figure 2.5. The first two entries indicate that the fields x and y of the receiver object are read by the method. The next two entries indicate that the same fields of the parameter p are also read. The last entry is the access to the static variable nextId. It is included in the data access summary of midpoint even though midpoint does not access nextId because of the inclusion property described above; the method updateId accesses nextId and is called by midpoint. The fully qualified name of the static variable is used as the anchor in the symbolic access path. It is marked as “write” since updateId reads and then increments the value of the specified field. Accesses to the variable q are not included in the data access summary, because it is local to the method, and the object it points to never escapes the method. Accesses to the variable r are also not included in the data access summary, because it is local to the method, and the object it points to only escapes as a return value.

2.2.1

Examples of Data Access Summaries

This section illustrates data access summaries for various kinds of pointer-based data structures, such as linked lists and trees. Figure 2.6 shows a method m1 for the class ListNode. The method first tests the global variable ListNode.head for null, and then updates the data field of the object head. The method then writes the value of temp to the next field of the head node; in other words, the object temp escapes through the globally reachable variable ListNode.head.next.


19

class ListNode { static ListNode head; int data; ListNode next; void m1() { if (ListNode.head != null) { ListNode.head.data *= 2; ListNode temp = new ListNode(); ListNode.head.next = temp; } } }

Figure 2.6: An example that illustrates data access summaries. The set of symbolic access paths that must be generated by the zJava compiler to describe the data accesses in this method are thus: ListNode.head ListNode.head.data ListNode.head.next Each of these paths is then coupled with the type of access the method requires to form the data access summary of the method: (ListNode.head, read) (ListNode.head.data, write) (ListNode.head.next, write) Observe that the first access path is a common prefix of the other two paths, and that the method m1 only reads the variable ListNode.head. Because an intermediate field f (such as ListNode.head) is always assumed to be read by the method if the method is to access objects referenced by fields of f , it suffices to include only the latter two paths in the data access summary of m1, thereby obtaining a more compact data access summary: (ListNode.head.data, write) (ListNode.head.next, write) Further, it is possible to use only one summary entry to summarize both (of the remaining) symbolic access paths, as they have the same common prefix and the same


20

class ListNode { static ListNode head; int data; ListNode next; void m2(ListNode o) { if (o == null) return; data = o.data; next = o.next; } }

Figure 2.7: Another example that illustrates data access summaries. access type, i.e., both are write accesses. A special selector, which we refer to as a path set 1 selects multiple fields from a common object. Using this feature, the data access summary of m1 can be factored into: (ListNode.head{.data,.next}, write) As another example, the method m2 in Figure 2.7 is a “cloning” function for ListNode objects. If the parameter o contains null, then the method returns; otherwise, it copies the content of o into the object on which the method was invoked. On line 8, the parameter o is read, but no access path needs to be generated for this access because, in Java, parameters are passed-by-value to an invoked method [LY99], hence a reference-type parameter is private to the thread executing the method, although the parameter object that it references would be sharable. Line 10 assigns the value in o.data to this.data. More precisely, the parameter o is first dereferenced to reach the object o, then the value stored in its field data is read, and subsequently written into the corresponding field in the receiver object, referenced by the this pointer. Since the contents of a sharable object o are accessed, the following data access summary entries are generated: 1

Our notion of path set as a selector is different from the “symbolic path sets” defined in [Deut94], which are simply sets of symbolic access paths, much like our data access summaries without the access type information.


21

class ListNode { static ListNode head; int data; ListNode next; void m3(ListNode o) { data = o.data; if (o.next != null) { next = new ListNode(); next.m3(o.next); } } }

Figure 2.8: A third example that illustrates data access summaries. ($1.data, read) ($0.data, write) Line 11 copies the value in o.next into this.next in a similar fashion, which generates another pair of data access summary entries: ($1.next, read) ($0.next, write) After factoring out the common prefixes in the two pairs of entries, the data access summary for method m2 can be compressed into: ($1{.data,.next}, read) ($0{.data,.next}, write) The example in Figure 2.8 illustrates how data access summaries are used in recursive methods. Unlike m2, m3 performs a na¨ıve “deep copy” of the data structure; it will not terminate (or will cause a stack overflow) if the linked list happens to be circular. Line 8 is the same as in m2, so it generates the same data access summary entries: ($1.data, read) ($0.data, write) The method then invokes itself on the newly constructed next node of the list referenced by the this pointer. Because of the inclusion property of data access summaries,


22

the recursive call on line 11 implies that the data access summary of the calling instance of the method m3 must include that of the called instance of the same method, by resolving the formal parameters in the latter with respect to the context of the former, and storing both within the same physical data access summary. In other words, the compiler must determine that, within the called instance of the method, the this pointer and the parameter o refer to the objects this.next and o.next in the context of the calling instance, respectively, and must re-write the unified data access summary accordingly: ($1.data, read) ($0.data, write) ($1.next, read) ($0.next, write) ($1.next.data, read) ($0.next.data, write) ($1.next.next, read) ($0.next.next, write) ($1.next.next.data, read) ($0.next.next.data, write) etc. The infinite number of entries in a data access summary is an indication of the possibility of infinite recursion of the method, which would be the case if the method m3 is called on a circular linked list. Generally it is impossible to bound the size of such a data access summary at compile time, but the use of bases in the symbolic access paths can work around this limitation: ($1{.next}*.data, read) ($0{.next}+, write) ($0{.next}*.data, write) Notice how the second and third entries in the data access summary above remain separate, even though they apparently refer to many identical fields. This is because the second entry describes a write access to each of the subsequent next fields in the linked list, while the third entry describes write accesses to the data fields, implying only read accesses to the intermediate next fields.


2.3

23

Program Re-structuring

Given a set of data access summaries extracted from methods in a program, the program must retrieve the summaries at run time, create threads from method invocations, and use the summaries to schedule those threads accordingly. This section describes the kind of program transformations that must be done at compile time to achieve this. (The actual processes of creating and synchronizing threads will be described in Chapter 3.)

2.3.1

Class Initialization

The zJava run-time system is designed to fetch data access summaries for the methods in a class when the class is resolved and loaded by the JVM. Without modifying the JVM itself or the system class library, a way to do this is to introduce static initializers into the class. Static initializers in Java are executed immediately after a class is loaded by the JVM, and before any instance of the class can be allocated, or any method of the class invoked. As shown in Figure 2.9, for a method foo, we insert a static block to retrieve (or construct on demand) a data access summary for the method, and initialize it by adding symbolic access paths to it. The run-time system identifies the method by its owning class, the name of the method “foo,” and an array of types indicating the signature of the method. The addition of this code ensures that the data access summaries of all methods are properly initialized before they can be invoked.

2.3.2

The main Method

The main method of the program must also be re-structured. A call to an initialization routine of the run-time system is inserted at the beginning of the method. This routine will detect whether run-time parallelization is supported on the platform on which the program is executing. If the initialization routine determines that the platform schedules


24

class X { // inserted code static { DataAccessSummary d = DataAccessSummary.forMethod(X.class, "foo", new Class[] { X.class }); d.addWritePath("$1.f"); } void foo(X x) { x.f = 0; } }

Figure 2.9: Adding static initializers to allow the run-time system to initialize a class with parallelized methods. Java threads on to physical processors, and that the number of available processors is greater than one, then it sets up the run-time system with various default or user-specified parameters, and returns a boolean true. Otherwise, the routine returns false, and an inserted conditional branch following the call will invoke the original version of the main method (which has been renamed by prepending “seq ” to its name). This will cause the remainder of the program to execute in a sequential manner, avoiding the overhead of unnecessary synchronizations. While the duplication of methods may cause excessive code growth, this is unlikely since usually only a few methods in a program exhibit parallelism and need to be re-structured. Retaining the original version of the methods allows the program to achieve the best performance when run-time parallelization is not supported on a platform. Figure 2.10 illustrates this transformation.

2.3.3

Call Sites

Code is inserted around call sites of parallelized methods. The inserted code calls thread creation routines provded by the run-time system, and (where applicable) allocates synchronized storage to hold the return values of concurrent functions.


25

// original version public static void seq_main(String[] args) { seq_foo(); } // parallelized version public static void main(String[] args) { if (!RuntimeSystem.initialize(...)) { seq_main(args); return; } foo(); }

Figure 2.10: Transformation of the main method. Figure 2.11 illustrates this type of re-structuring. The original version of the method foo calls the method bar with an integer argument x and assigns the result of the call to the variable y (line 3). The variable y is later used in a call to another method quuz (line 5). After the re-structuring of bar, the call to bar in the body of foo is replaced with a call to call0001.dispatch (line 13). The MethodCall object call0001 uniquely represents this particular call site in foo. Its method dispatch creates a new thread to execute the method associated with call0001, in this case bar (not shown in figure). Parameters to this call are the integer argument x and a newly allocated Future object fut0001, the synchronized storage for the return value from the call to bar, which is now being executed in parallel (see Section 3.3.5). The assignment of the return value to the variable y is postponed until the next use of y, namely the call to quuz (line 16). At that point the Future object is queried and synchronization is performed if necessary.

2.3.4

Uses of Shared Variables

The zJava compiler inserts a call to a synchronization routine provided by the run-time system, prior to the first use of each shared variable in a method. The synchronization will guarantee that the executing thread does not perform accesses to the variable that

Chapter 2. Summarizing Data Accesses 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

26

// original version void foo(int x) { int y = bar(x); /* ... */ quuz(y); } // parallelized version private static MethodCall _call0001; void foo() { int y; Future _fut0001 = new Future(); _call0001.dispatch(x, _fut0001); /* ... */ y = _fut0001.getInt(); quuz(y); }

Figure 2.11: Transformation of a call to method bar in method foo. violate the serial execution order. It will also lock the variable for the exclusive use by the executing thread if the method writes to it, preventing other threads from performing illegal accesses. Conversely, immediately after the last use of a variable in a method, code to unlock the variable is inserted. This signals to the parent thread (as well as other younger “sibling” threads) that the variable is now available for their use, and allows any waiting thread to resume execution. In addition, to prevent a parent thread from performing illegal accesses to the fields of an object after a reference to the object has been passed to a child thread (i.e., after a method call with the object as an argument), synchronization code is inserted into the body of the calling method before the next use of the variable. This allows the parent thread to continue executing as much as possible while preserving the serial execution order of the program. Figure 2.3.4 shows an example in which these transformations occur. On line 4, the method foo assigns the value 1.0 to the instance variable x.f. The re-structured code (line 13) guards this write with a call to the lock acquisition routine.(The term “region” in the example describes a shared memory area; see Chapter 3 for a discussion of regions.

Chapter 2. Summarizing Data Accesses 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

27

class X { // original version float f; void foo(X x) { x.f = 1.0f; bar(x); x.f = 1.1f; } } class X { // parallelized version float f; void foo(X x) { Region.getRegion(x.f).acquire(); x.f = 1.0f; bar(x); Region.getRegion(x.f).acquire(); x.f = 1.1f; Region.getRegion(x.f).release(); } }

Figure 2.12: Transformation of uses of shared variables. The getRegion method shown in the example is used to dynamically look up the lock associated with a given variable.) On line 5, the object x is passed to a call to another method bar. This method call is not parallelized in this example (line 15), but since it may contain parallelized calls to which x may again be passed, the run-time system does not guarantee that the method foo still has exclusive write access to the object x after the call. Therefore, the first use of x.f after the call (line 6) is synchronized by inserting another call to the lock acquisition routine (line 16). Finally, just after the last use of x.f by the method, the lock is released so that other concurrent methods may access x.f.

2.3.5

Exceptions

Java programs commonly “throw” exceptions to indicate that an error has occurred within a method that needs to be handled by an appropriate exception handler. If such


28

a handler is not present in the method, execution skips the rest of the method and jumps back to the method’s caller, in which the search for a handler continues. In other words, when an exception is thrown but not caught within the throwing method, control is immediately transferred to each calling method on the call stack, in succession, until a matching exception handler is found. Our implementation handles this in the following way. When a thread terminates, whether normally or due to an exception, the run-time system searches the registry for locks that are held by the thread, and removes them even if the thread did not call the lock release routine. This ensures that there are no dangling locks left after a thread has terminated. It is also possible to have the compiler re-structure the exception throwing code so that locks that have been acquired by the method are properly released prior to the jump back to the caller. This can be done by appending to the enclosing try block a finally block that calls the appropriate lock release routines. The statements within a finally block are always executed regardless of whether exceptions are thrown or not, or whether they are caught or not. If the throw statement that generates the exception is not enclosed within a try block, the compiler can create an artificial try block that has only the finally block and no exception handlers (i.e., no catch blocks).

2.4

Analysis Necessary for Computing Summaries

The present work only focuses on the zJava run-time system, and the data access summary notation has been developed only to interface the compiler and the run-time system. In this section, we provide some hints as to how the compiler can generate the summaries. There has been much research in escape analysis in recent years[PG92, Blan98]. It has been demonstrated to be effective in facilitating stack allocation of objects, and removing unnecessary synchronization from Java programs[Blan99, CGS+ 99, BH99, WR99].


29

Escape analysis distinguishes between storage that is local to a method (i.e., local variables, objects that are assigned only to local variables, etc.) and storage that is accessible to multiple methods (i.e., global fields, fields of parameter objects, etc.). Objects assigned to the latter type of storage is said to escape the method. Such objects can continue to be used after the method that allocated them has terminated. In other words, an object escapes the methodin which it is constructed if the lifetime of the object exceeds that of the method. Using abstract interpretation, escape analysis identifies objects by their allocation sites in a method, and determines if each object is ever assigned to non-local storage, directly or transitively (i.e., assigned to a field of an object which is in turn assigned to non-local storage). Such objects are identified as escaping objects. Non-escaping objects can then be allocated on the stack, since they will never be accessible outside the method, and can be deallocated along with the method’s stack frame when the method returns. Furthermore, synchronized operations on objects that do not escape the run method of a thread can be left unsynchronized, because such objects will never be accessible by two different threads. The escape analysis can be modified to track not only objects, but also values of primitive types. The modified analysis determines which reads and writes “escape” a method m, or, in other words, which reads and writes are performed on non-local storage. At run time, such reads and writes are asynchronous but visible to other threads concurrent to the one that is executing m, and hence they need to be accounted for in the preliminary data access summary of m. The final data access summary of m must also account for data accesses that m performs transitively, and can be generated by combining the preliminary summary with the data access summaries of the methods that is called within m. The compiler can achieve this in an iterative interprocedural analysis. If the methods are sorted in reverse order of the calling sequence prior to analysis, convergence should be relatively quick.

Chapter 3 The Run-Time System This chapter describes the components of the run-time system and how they operate. The notion of regions, which are used to represent and control accesses to shared variables within the run-time system, is first defined. The main component of the run-time system, called the registry, is then introduced. Finally, the various operations performed by the run-time system, including the creation, synchronization, and termination of threads, are described.

3.1

Regions

The run-time system translates data access summaries into regions, which represent variables in the system and describe how they are shared. Each region corresponds to a memory area in the heap that is potentially accessed by multiple, possibly concurrent, threads. A region is usually a variable (of a reference type or a primitive type), but it can also be a section of an array, containing many individual elements. A region is identified by the address in memory of the variable it represents. Regions are not used to represent stack variables (parameters and local variables), which are private to a thread according to Java semantics [GJS97]. 30

Chapter 3. The Run-Time System

3.2

31

The Registry

During the execution of the program, the zJava run-time system maintains a set of data structures, that we collectively refer to as the registry. The purpose of the registry is twofold: it is used to determine data dependences among methods in a running program, and it is also used to synchronize the threads created to execute, or dispatch, invocations of those methods. The registry consists of a set of region nodes and a set of thread nodes. A region node rv corresponds to the region of a variable v, and represents the variable within the registry. A thread node nτ describes a thread tτ that has been created by the run-time system; it identifies the method mτ which tτ is executing or is going to execute, contains a reference to its parent thread, and maintains a list of region nodes representing data accessed by the thread. The registry is structured as a table of region nodes that represent all shared variables in use by the program at a given point in time. Each region node points to a list of thread nodes representing threads that access the corresponding region during their lifetimes. A region node is also associated with a reader/writer lock, which ensures proper synchronization of concurrent threads. The lock is designed so that multiple reader threads (threads that access, but do not write to, a given region) can share the reader lock and execute concurrently, while writer threads (threads that do write to a given region) can only execute one at a time. Thread nodes are created and queued into the thread list of a region node when a new thread is created to execute a method. The queueing of thread nodes is performed so that a thread list always remains sorted in the serial execution order of the methods associated with the threads. That is, the order in which thread nodes appear on a thread list is the same as the order in which their corresponding methods are executed by the sequential program. The mechanism for ensuring that threads are queued in this order will be described in Section 3.3.3.


32

The registry is updated dynamically; new thread nodes and region nodes are created and inserted into the registry for newly created threads and newly allocated shared variables, respectively. The registry is unaware of the existence of a variable v until the creation of a thread that registers v as possibly shared, i.e., until v escapes from an existing thread to a new thread, at which point a region node rv is allocated in the registry. Thread nodes are deleted from the registry as threads terminate, and region nodes are deleted when their associated objects are garbage-collected. Figure 3.1 is a depiction of the zJava registry at a particular point during the execution of the program in Figure 1.1. The registry in this example contains four region nodes, one for each of the variables x, y, m and b. The registry also contains six thread nodes which serve to associate the four variables with three threads, t1 , t2 and t3 , that will access the variables. The threads execute methods main, m2 and m3, respectively. As shown in Figure 1.1, t1 has been running for a relatively long time, and it has created a concurrent thread, t2 , to execute method m2. Since t1 must eventually access all four variables, its thread node is present on all four thread lists. On the other hand, t2 only indirectly uses x, so its thread node only appears on the thread list of x. The execution of method m3 has just started, and according to serial execution order, it must write to x before m2 or main can legally use the variable. Consequently, the thread node of t3 appears ahead of (i.e., to the left of) the other two thread nodes on the thread list of region x. When the method m3 terminates, the thread node for t3 will be removed from the thread list. When the program eventually finishes execution, the thread lists for all regions will be empty. The following notation will be used for the remainder of this thesis. Variables are denoted by x, y, and z. Their corresponding regions are denoted by rx , ry , and rz respectively. Executing threads are referred to as tτ , tσ , and tγ . Their corresponding thread nodes are denoted by nτ , nσ , and nγ respectively, and they execute methods mτ , mσ , and mγ respectively.

33


Region nodes x

m3 (W)

y

main (W)

m

main (R)

b

main (R)

Per-region thread list m2 (W)

main (W)

Thread nodes

Figure 3.1: The zJava registry maintains sorted thread lists for shared regions.

3.3 3.3.1

Registry Operations Region Creation

When a class is loaded by the JVM, the zJava run-time system receives the data access summaries of all the methods of that class. These data access summaries are stored in a look-up table. When a thread tτ calls a method mσ , a child thread tσ is created. Compiler-inserted code in the parent thread tτ retrieves the data access summary for mσ from the summary look-up table, and creates regions from this summary. The creation of regions for a method mσ proceeds in three main steps. In the first step, the run-time system identifies the real sources of the access paths for the methods, and hence determines what objects are accessed by mσ . We refer to this step as source resolution. This step is accomplished by replacing the anchors in the access path with actual source objects available to the method at run time, i.e., objects passed as actual parameters to the method. In the second step, each resolved access path is expanded into all its sub-paths. Hence, an access path o.f1 . . . fn is expanded into n access paths, namely, o.f1 , o.f1 .f2 ,


34

o.f1 .f2 .f3 and so on, up to o.f1 . . . fn . This expansion is necessary because the access to o.f1 . . . fn requires accesses (reads) to the intermediary fields f1 . . . fn−1 . Consequently, these intermediary fields must be represented in the registry by individual region nodes to ensure proper synchronization with other threads that may access them. We refer to this step as path expansion. Symbolic access paths for recursive data accesses are similarly expanded. For example, the symbolic access path $0.head{.next}* translates to this.head, this.head.next, this.head.next.next, etc. The expansion stops when the current next field contains null. In the third and final step of region creation, the run-time system searches the registry for the region node rv for every variable v that was expanded from the resolved access path. If a region node does not already exist for a given variable, a new node is allocated for it and is added to the registry. As an example, consider the program in Figure 3.2, in which the main method invokes the readMatrix method on the object a. The data access summary for readMatrix is shown in Figure 3.3. The summary includes $0.nCols and $0.data[*] by the inclusion property, since they are accessed by readRow, which is called by readMatrix. For each symbolic access path in this summary, the run-time system replaces its anchor with the actual reference-type argument to the method. Hence, $0.nRows is resolved to a.nRows, $0.nCols is resolved to a.nCols, etc. Since each of the symbolic access paths of the readMatrix method has only one field, path expansion for this example is straightforward. Each path translates to one region, and the corresponding region node is simply added to the registry. The resulting region nodes (with empty thread lists) are shown in Figure 3.4. The creation of regions corresponding to array sections requires special handling. When the symbolic access path specifying an array section A is processed, the run-time system checks the registry to see if elements of A are already included in another existing region. If this is not the case, the new region for A is created as usual. However, if there

35


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

class Matrix { Reader double[][] int

input; data; nRows, nCols;

// input file // data in matrix // size of matrix

void readRow(int r, Reader in) { for (int c = 0; c < nCols; c++) { data[r][c] = parseDouble(in); } } void readMatrix() { for (int i = 0; i < nRows; i++) { readRow(i, input); } } double[] multRow(int r, Matrix m) { double[] d = new double[m.nCols]; for (int i = 0, d[i] = 0; i < m.nCols; i++) { for (int j = 0; j < m.nRows; j++) { d[i] += data[r][j] * m.data[j][i]; } } return d; } Matrix multiply(Matrix m) { Matrix product = new Matrix(); for (int i = 0; i < nRows; i++) { product.setRow(i, a.multRow(i, m)); } } public static void main(String[] args) { Matrix a, b, c, d, e; // initialize input, a, b, and d a.readMatrix(); c = a.multiply(b); e = a.multiply(d);

// first fork point // second fork point // third fork point

} }

Figure 3.2: A matrix multiplication example.

36

Chapter 3. The Run-Time System ($0.nCols, read) ($0.nRows, read) ($0.input, read) ($0.data[*], write)

Figure 3.3: Data access summary for readMatrix.

a.nRows

??

input

??

a.nCols

??

a.data[*]

??

Figure 3.4: Region nodes are added for all regions that the readMatrix method will access. is an existing array section B that overlaps A, the overlapping sub-section B = B − A is treated separately as a new region. The array sections A and B are modified to exclude B . The thread list of B would be the union of the thread lists of both A and B. Thus, the run-time system ensures that no two array section regions in the registry include the same array elements.

3.3.2

Thread Creation

A thread tτ creates a child thread tσ when tτ invokes a method mσ in its body. However, before the child thread can begin to execute, regions accessed by the child thread must be determined, and thread nodes for the child thread must be allocated and inserted into the thread lists of the corresponding region nodes. These steps are performed by the run-time system; the code of mτ is re-structured to make the appropriate calls to the zJava run-time library. The actual parameters to mσ are passed to the constructor of the thread tσ , which then forwards them on to mσ itself. This marshalling process is described in Appendix A.2.


37

When the parent thread tτ creates the child thread tσ , the data access summary of mσ is resolved and expanded (as described in the previous section) into a set of variables V used by mσ . Corresponding region nodes are allocated for variables in V if necessary. A thread node nσ for tσ is then inserted into the thread list of each of those region nodes. Once the insertion of the region nodes is complete, the parent thread tτ continues executing mτ . The child thread tσ can proceed to execute when it acquires the locks for its regions, as will be described in the Section 3.3.4. The parent thread tτ must wait for all threads that precede it in a thread list of a region rv and write v, v ∈ V , to complete their writes before it can insert the thread node nσ of the child thread tσ . This is necessary to ensure that nσ is indeed queued into the correct thread list. Consider, for example, an access o.f1 .f2 by tσ , which translates into two regions, one representing the variable o.f1 and the other representing the variable o.f1 .f2 . If a thread tγ that precedes tτ writes to o.f1 , and tτ does not wait for tγ to complete, an incorrect value (with respect to sequential execution) of o.f1 will be used to evaluate o.f1 .f2 and hence, to identify the corresponding region node. Consequently, nσ will be inserted into the thread list of a wrong region node. By waiting for tγ to complete its write to o.f1 , the correct value of o.f1 is used, and nσ is inserted into the thread list of the correct region node. The inclusion property described earlier in Section 2.2 ensures that, if mσ is called by mτ , a thread node of tτ exists on the thread list of every rv , v ∈ V , even if tτ does not access v directly. Hence, it will always be possible to find nτ in the thread list of every region accessed by tσ , and hence determine where nσ should be inserted.

3.3.3

Thread Ordering

Thread nodes in any given thread list are kept sorted in the serial execution order of the associated methods to preserve program correctness. Furthermore, the order of thread nodes is consistent among all thread lists in the registry. That is, if the thread node nσ


38

of a thread tσ precede the thread node nτ of another thread tτ on one thread list, then nσ must precede nτ on all thread lists. This serial execution order of thread nodes can be achieved by inserting the thread nodes of a newly created thread immediately preceding those of its parent thread on any thread list. Let tσ and tτ be two threads that execute methods mσ and mτ respectively. We say that tσ > tτ if mσ should access a variable v before mτ when the two methods are invoked by the sequential program. Due to the inclusion property of data access summaries, for a given thread tσ in the thread list of a region rv , all ancestor threads of tσ must also have thread nodes in the same thread list. Suppose tσ is the most recent child thread of its parent tτ , and that both access rv . Since, in a sequential program, a called method must terminate before the calling method continues, once tσ has been created, tσ must finish accessing rv before tτ accesses rv again, even though tτ may have been using the variable before tσ was created. Thus, tσ > tτ , and the thread node nσ must precede nτ in the thread list of rv . Suppose tτ creates a new thread tγ after tσ . Then tσ > tγ and nσ must precede nγ in any thread list they share. At the same time, tγ > tτ , for the same reason that tσ > tτ , as described above. Hence, the thread node nγ must precede nτ in any thread list. Consequently, the thread node nγ must be inserted between nσ and nτ . In general, the thread nodes of a new thread must be inserted between those of its parent and those of its next older sibling (which also access the same data). This can only be achieved by inserting the thread nodes of a thread immediately preceding those of its parent. Consider the matrix multiplication program shown earlier in Figure 3.2. The methods readMatrix and multiply operate on a row-by-row basis; the former calls the readRow method nRows times to fill the matrix with data, and the latter calls the multRow method nRows times to multiply the matrix with the other matrix m. Calls to readMatrix and multiply will result in multiple threads. The call sites of these methods are the fork points of the threads.

39


a.nRows

readMatrix (R)

main (W)

input

readMatrix (R)

main (W)

a.nCols

readMatrix (R)

main (W)

a.data[*]

readMatrix (W)

main (W)

Figure 3.5: New thread nodes are created for the readMatrix thread; the nodes are inserted into the thread lists corresponding to regions accessed by the thread. At the first fork point (see the comments in Figure 3.2), a new thread is created for the call to readMatrix, regions nodes are allocated for all regions that the invoked method will access, and the registry is modified as shown in Figure 3.5. It should be noted that as a new thread node is inserted into a thread list, thread nodes for all its ancestors are also inserted, if they are not already on the list. This is consistent with the requirement that the data access summary of a method include those of its callees. Thus, in Figure 3.5, a thread node for the main thread is automatically inserted after every node for the readMatrix thread. At the second fork point, the method multiply is called, which calculates the product of the matrices a and b. This causes a second thread to be created (labelled multiply1 in Figure 3.6). Region nodes are allocated for b.nRows, b.nCols, and b.data[*], which are in the read set of the new thread, but for which no region node exists. A thread node for multiply1 is then inserted into each of the thread lists of these regions. A thread node is also inserted into the thread lists of existing region nodes for a.nRows and a.data[*] because multiply1 reads the corresponding variables. The new thread node is placed immediately preceding that of its parent main, and thus behind that of its sibling thread running readMatrix. This is consistent with the serial execution order in which the method readMatrix must complete before the call to multiply occurs. The new thread eventually returns a reference to the product matrix, which is assigned to the variable c; however, return values are not synchronized with data access summaries,

40


a.nRows

readMatrix (R)

input

readMatrix (R)

main (W)

a.nCols

readMatrix (R)

main (W)

a.data[*]

readMatrix (W)

b.nRows

multiply1 (R)

main (W)

b.nCols

multiply1 (R)

main (W)

b.data[*]

multiply1 (R)

main (W)

c

multiply1 (R)

multiply1 (R)

main (W)

multiply2 (R)

main (W)

main (W)

d.nRows

multiply2 (R)

main (W)

d.nCols

multiply2 (R)

main (W)

d.data[*]

multiply2 (R)

main (W)

e

multiply2 (R)

main (W)

Figure 3.6: The registry is updated as the two multiply threads are created; the thread lists of shared regions keep the threads in serial execution order.

and so the thread node for multiply1 does not need to be inserted into the thread list of the region node for c.

Similarly, the registry is updated for the third forked thread, which runs multiply a second time, but on the matrices a and d, producing the matrix e. Region nodes are allocated for the fields of d, which are accessed by the new thread, labelled multiply2 . The new thread node is inserted into the thread list of the regions that it uses, including those of a.nRows and a.data[*], where the run-time system again preserves the serial execution order by placing the thread nodes of the new thread immediately preceding those of its parent main. The resulting registry is shown in Figure 3.6.


3.3.4

41

Thread Synchronization

A reader thread is defined as a thread that accesses a region but never writes to it. In contrast, a writer thread is a thread that accesses a region and possibly writes to it. These definitions are flow-insensitive; even if a writer thread does not actually write to a region at run time, perhaps due to branching from a conditional statement, or the throwing of an exception, it may not be possible to predict such behaviour at compile time. Consequently, the compiler must be conservative and classify such a thread (which executes a method that possibly writes to a variable) as a writer of the variable, and generate synchronization code accordingly. As described earlier, every region node in the registry is associated with a reader/writer lock. Each reader thread must acquire a reader lock on the region node rx prior to accessing the region x. The reader lock allows concurrent reads of the same region by multiple threads, as explained below. In contrast, a writer thread must acquire a writer lock, which is an exclusive lock, i.e., acquiring the lock blocks the execution of all other threads that access the same region. If a thread is unable to acquire a (reader or writer) lock, its execution must block until it can. A reader thread tτ is immediately granted the reader lock of a region rx if its thread node nτ is the first in the thread list of rx , or if all preceding threads on the thread list are also readers. Otherwise tτ is blocked and waits for a preceding thread to signal it. When tτ acquires the reader lock on rx , it checks if the successor of nτ belongs to another waiting reader thread. If it does, tτ signals the waiting reader thread and grants it a reader lock as well. A writer thread tσ is granted a writer lock only if its thread node nσ is the first node in the thread list of rx . This means that all other accesses to the variable x must complete before tσ may perform the write. This preserves data dependences and ensures correctness of execution. When the writer thread terminates (or otherwise relinquishes its writer lock), it signals the next thread (if any) on the thread list of rx , and grants it the lock it has been waiting for. This scheme allows multiple threads that


42

read the same variable to execute concurrently, yet prevents any conflicting accesses to occur at the same time. This synchronization scheme is illustrated using the example in Figure 3.2. Suppose the program is restructured to create a thread for every call to the readRow and multRow methods. The thread executing the i-th call to the readRow method, which modifies row i of the matrix a, must terminate before either of the two threads running the multiply method makes the i-th call to the multRow method to perform multiplication on the same row. Otherwise, the thread(s) executing the multRow method will use values from an undefined row in its calculation, and yield incorrect results. However, once the entire matrix a has been read in, both multiply threads, and their child threads running multRow, can proceed with their multiplications concurrently. None of the threads modifies the shared data array contained in the matrix a; in other words, they are all readers of the shared array. Hence there is no conflicting data accesses among them, and they may execute concurrently. To facilitate inter-thread synchronization, the zJava compiler inserts region-based synchronization primitives into the bodies of methods. Figure 3.7 shows the definition of the readRow and the multRow methods after such a transformation. The synchronization routines are provided by the run-time system in normal Java classes.

3.3.5

Thread Termination

A thread terminates when the method the thread is executing returns. The thread stores any return value of the method in a future [FW78, CS90], waits for its child threads to terminate, and finally removes its own thread nodes from thread lists of all region nodes the registry. At that point, the registry has no more reference to the thread, and the thread can terminate itself. The future is a synchronization device for handling return values from child threads. If the parent thread attempts to get the value of the future before it has been set by the

Chapter 3. The Run-Time System 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

43

void readRow(int r, Reader in) { Region.getRegion(data[r]).acquire(); for (int c = 0; c < nCols; c++) { data[r][c] = parseDouble(in); } Region.getRegion(data[r]).release(); } double[] multRow(int r, Matrix m) { double[] d = new double[m.nCols]; Region.getRegion(data[r]).acquire(); for (int i = 0, d[i] = 0; i < m.nCols; i++) { for (int j = 0; j < m.nRows; j++) { d[i] += data[r][j] * m.data[j][i]; } } Region.getRegion(data[r]).release(); return d; }

Figure 3.7: The zJava compiler inserts code at synchronization points to ensure that data dependences are not violated at run time.

child thread, the parent thread is blocked until the child thread returns. This maximizes the concurrency between the parent thread and the child thread, in those cases where the return value from the child thread is not used immediately (or used at all) by the parent thread.

3.4

Deadlock Freedom

Deadlock is an error state in which a parallel program cannot continue because its threads are preventing one another from using resources they need. Deadlock is one of the major problems that the writer of parallel programs must overcome. A parallel program can be deemed correct if and only if it cannot lead to a deadlock when executed. Therefore, we must prove that the way in which the zJava system orders and synchronizes threads eliminates the possibility of deadlock.


44

Definition 3.4.1 (Deadlock) A state of a multi-threading system in which a waiting thread never resumes execution because the resources it has requested never become available to it; a deadlock can arise only if the following four conditions hold simultaneously in the system [CES71]: Mutual exclusion At least one resource is held in a non-sharable state, i.e., only one thread in the system may access the resource. Wait for A thread is holding a resource while waiting for its turn to access other resources being held by other threads. No preemption A thread must voluntarily release a resource for the latter to become available to other threads. Circular wait Two (or more) threads are waiting for resources held by each other (or one another, in a cycle). Clearly, with the threading model of the zJava system, the first three conditions hold. First, if a thread t writes to a region R in a given block of code P during its lifetime, then t must have exclusive access to R while it is executing within (the critical region) P . Interleaving these accesses with accesses to R by other threads will result in an incorrect execution of the program. Second, since the zJava system synchronizes threads by inserting lock acquisition calls into the body of each thread, the threading model does have the “wait for” property. Third, each thread must also make lock release calls (or terminate itself) to make the regions it was accessing available to other threads, which means that the “no preemption” condition holds. However, the fourth condition, “circular wait,” does not hold in the zJava system. Intuitively, this is the case because the run-time system orders threads according to the serial execution order of their corresponding methods, which cannot lead to circular dependencies among the threads. Nevertheless, we proceed to prove this by contradiction.


45

Let tσ and tτ be two threads that are executing methods mσ and mτ , respectively, and let nσ and nτ be their corresponding thread nodes. Furthermore, let mσ and mτ both write to a variable v, and let tτ be waiting for tσ , which holds a lock on the region rv . Thus, either tτ is an ancestor of tσ , which needs to complete before yielding the lock to tτ , or tτ shares a common ancestor with tσ , and mσ is called before mτ in serial execution order. In either case, nσ must be in front of nτ in the thread list of rv , by construction of the thread list (see Section 3.3.3 for a full explanation). Assume, without loss of generality, that circular wait exists, i.e., there is a set of threads {t0 , t1 . . . , tn } (with corresponding thread nodes {n0 , n1 . . . , nn }) such that t0 is waiting for t1 , t1 is waiting for t2 , and so on, and tn is waiting for t0 . From the above discussion, we know that since t0 is waiting for t1 , n1 must precede n0 in the thread list of the region on which t1 holds a lock. In fact, the same is true for any variable that both threads access; n1 must precede n0 in all thread lists that they share. Similarly, since t1 is waiting for t2 , n2 must precede n1 in all thread lists, and due to the transitivity of the ordering, it must also precede n0 in all thread lists. It follows that nn must precede nn−1 , and hence n0 as well, in all thread lists. However, the “circular wait” assumption states that tn is waiting for t0 , which means that n0 should precede nn in all thread lists. This clearly results in an impossible, self-contradicting configuration of the thread lists; thus circular wait is impossible within the zJava system. Consequently, the serial execution order imposes a total order on the threads which prevents circular waiting, and the threading model in the zJava system is deadlock-free.

3.5

Optimizations

Clearly, if the run-time system indiscriminately transforms every method call into a concurrent thread, it will not only incur a large run-time overhead, but will also put extra strain on system resources due to excessive contention among the large number


46

of threads. Productive automatic parallelization can only be attained by extracting the largest possible amount of parallelism from a relatively small number of threads. This section briefly describes some techniques that may be employed to maximize the gain from parallelization. These techniques aim to better utilize the run-time framework to make intelligent parallelization choices, and also to remove the run-time overhead as much as possible. Most of these techniques apply during compile-time analyses, but we have employed some of them in our experimental evaluation (see Chapter 4).

3.5.1

Selective Parallelization

Dynamic, “Just-In-Time” compilers often perform selective compilation, in which only methods which are likely to be executed frequently (i.e., “hot spots”) are JIT-compiled to native code. Similarly, with selective parallelization, the parallelizing compiler only performs automatic parallelization on methods which are likely to have good parallelism and high granularity; for example, long-running methods with self-contained computations, or methods with loops and recursions. This increases the likelihood that our automatic parallelization will give good results over a mixture of methods, some of which has small granularity.

3.5.2

Profile-directed Parallelization

A recent trend in compiler research has been the use of feedback information gained from automatically instrumenting the source program (and profiling the resulting executable) to aid in subsequent recompilations of the same source program. This approach is used in both static and dynamic, JIT compilers [BCF+ 99, CL99]. The zJava run-time system can benefit from a similar scheme, where timings of invocations of a given method can be collected to determine if it is profitable to create concurrent threads for subsequent invocations of the same method. This is particularly helpful in long-running applications in which the amount of parallelism may increase or


47

decrease depending on the program state. Rather than being locked into the decisions made by the compiler, the run-time system can adjust itself according to the profile of the executing program. In the zJava system, we have implemented rudimentary support for profiling parallelized runs. Information such as the average execution times of threads and the number of times that a thread has forked at a particular call site. However, when evaluation of the system was conducted, this feature was incomplete and was not used in our experiments.

3.5.3

Method Versioning

“Method versioning” is specialization applied to methods. Essentially, the compiler generates different versions of a method, each with different behaviour regarding thread creation and lock acquisition. The compiler (or the run-time system) then statically (dynamically) selects the variant most suitable for each call site of the method, depending on the context of the call. This optimization is most useful in methods that have a large amount of thread creation and/or lock acquisition. By compiling versions that remove some or all of those overhead, and invoking them in selected call sites where the modified semantics remain legal, the run-time system can exploit methods with large granularity and good parallelism which would otherwise incur too much overhead to be profitable.

3.5.4

Proxy Synchronization

Proxy synchronization allows a parent thread to perform synchronization for a child thread. This is most useful in methods that repeatedly invoke the same methods, where the callees are known not to allow shared variables to escape. In such cases, a caller acquires the lock on a shared variable once, and the callees it invokes (executing in concurrent threads) can access the shared variable without performing synchronization of their own. As a result, the amount of synchronization that would have been done by the child threads is vastly reduced.

Chapter 4

Experimental Results

We implemented the zJava run-time system in Java. It comprises 49 Java classes of a total of 10532 lines of code, including JavaDoc comments. The run-time system is implemented as a user-level library and, as a result, requires no changes to the JVM. This makes the run-time system portable—it can be used on any multiprocessor on which a JVM exists.1

We used Sun Microsystems’ Java 2 Software Development Kit, Standard Edition, version 1.2.2, both to develop the run-time system and to experiment with it. The bytecode compiler provided in this version of the Java 2 SDK is conservative, and performs very little optimization on the code. The performance of the system was measured on a Sun Microsystems Enterprise 450 server equipped with four 400 MHz UltraSPARC II CPUs and two gigabytes of main memory. In this chapter, we report the performance of the run-time system using six benchmarks. 1

The run-time system can even be used on uniprocessor machines, or JVMs that only support userlevel “green” threads which are not scheduled on to physical processors; in such cases the program will still run, only more slowly due to the unnecessary overhead.

48

Chapter 4. Experimental Results

4.1

49

Experimental Methodology

At the time when the experiments were performed, the zJava compiler framework was in early development, and it was not sufficiently functional for the implementation of the actual analyses and compile-time transformations required to parallelize a program. In order to experiment with the run-time system, we computed the data access summaries for each benchmark manually. Unfortunately, this condition presented an upper bound on the size and complexity of programs with which the system was tested. As the first step, the candidate methods for parallelization in a benchmark were identified. The selective parallelization technique discussed in Section 3.5.1 was applied to eliminate candidates, e.g., accessor methods, that did not contain loops or recursion, and hence were not likely to improve performance when parallelized. Each candidate was then analyzed, manually, by closely mimicking the steps of the hypothetical compiler, in order to generate a list of read and write accesses performed by the method. The process was iterated multiple times until all accesses to array or recursive data structures had been recorded. These accesses were divided into the read and write sets, sorted each set and wrote symbolic access paths that represented them. The paths formed the data access summary for a single method. After the data access summaries have been acquired for all candidates in a program, the program was manually restructured, to embed the data access summaries into the program itself, and to insert region/thread creation and synchronization code. The transformation is described in detail in Section 2.3. In some cases, some optimizations have been applied on the re-structured code, namely method versioning and proxy synchronization (see Sections 3.5.3 and 3.5.4, respectively). Finally, the modified program was compiled into bytecode with javac, the Java 2 SDK bytecode compiler. The resulting class files were executed in the usual manner, and execution times of the benchmarks were measured.

50


Benchmark Name ReadTest (100, 800) Traversal (100, 100K) Matrix (1000 × 1000) TSP (14 cities) Mergesort (2M, 32K) 15-puzzle

Sequential Execution Time (seconds) 80.0 15.0 188.7 566.0 6.4 53.0

Table 4.1: Summary of the benchmarks and their sequential execution time.

4.2

Benchmarks

Table 4.1 summarizes the benchmarks. While some of these benchmarks use arrays as the main data structures, their code uses references and dynamically creates its objects. Traditional array-based parallelizing compiler technology would fail to extract parallelism in these benchmarks, even though they may contain parallelizable loops that use arrays. The performance of the benchmarks is reported using speedup, which is defined as the ratio of the execution time of the original, non-restructured sequential program to the execution time of the parallel program, for a given number of processors. The speedup is ideal when it equals the number of processors used. The speedup of the benchmarks (where applicable) is shown in Figure 4.1. The figure indicates that while the benchmarks do not achieve ideal speedup, scalable performance is achieved: the speedup of each benchmark increases with increasing numbers of processors. The main sources of speedup degradation are run time overhead and lack of parallelism in some benchmarks. There are three main sources of run time overhead. First, a parent thread incurs overhead to create a child thread. The parent thread must resolve and expand the data access summary of the child thread to determine its regions. Further, the parent thread must insert thread nodes of the child thread onto the thread lists of the corresponding region nodes. In the process, the parent thread may block, waiting for preceding writers

51


4.0 Ideal Speedup TSP Matrix Multiply (1000x1000) MergeSort (2M,32K) 15-Puzzle

3.5

Speedup

3.0 2.5 2.0 1.5 1.0 0.5 1

2

3

4

Processors

Figure 4.1: The speedup of the benchmarks. on a thread list to finish. Also, the parent must create the child thread itself. Second, the registry is a shared data structure in itself, and it must be accessed atomically. Consequently, overhead is incurred at run time to co-ordinate registry access by the running threads. Third, a child thread uses the registry in order to determine when it can execute, which introduces some run time overhead. There are other sources of overhead, such as that incurred at class load time to transmit data access summaries to the runtime system, but they occur only during initialization, and can be assumed negligible. It is difficult to non-intrusively profile the run-time system to obtain a breakdown of the contribution of each of these sources of overhead. Hence, we describe them collectively as run time overhead for creating, executing, and synchronizing threads. The overhead varies from one benchmark to another depending on the number of threads created and the nature of the benchmark. We consider the performance of each benchmarks in some details in the following sections.

4.2.1

ReadTest: A Synthetic Micro-benchmark

The ReadTest program is a micro-benchmark that aims to measure the performance improvement that can be achieved by executing multiple reads (of the same data item)


52

in concurrent threads. The benchmark consists of a loop that creates m threads, each of which reads (but does not write) only one object and performs n computations with it.2 Hence the benchmark is a good indicator of the overhead of parallelizing a program without thread blocking. Its performance provides an upper bound on the speedups that can be obtained by the system.

The speedup of ReadTest at 4 processors is shown in Figure 4.2 as a function of thread granularity (i.e., different values of n), when the number of threads is equal to 100. The granularity of each thread is expressed in milliseconds. The figure indicates that the performance of the benchmark is poor when the granularity of a thread is very small (less than 1 millisecond). Indeed, the performance reflects a slow-down in execution time. This is because at such small granularity, the costs of creating, executing, and synchronizing threads outweigh benefits gained from parallelism. However, when the granularity of a thread becomes greater than 50 milliseconds, close to ideal speedup is achieved. Indeed, for thread granularity greater than 100 milliseconds, the speedup is close to ideal (3.6 at 4 processors).

The effect of the number of threads in the ReadTest benchmark (i.e., the value of m) on its speedup is shown in Figure 4.3. The speedup of the ReadTest benchmark when the granularity of each thread is 800 milliseconds is shown for different number of threads in the system. The figure indicates that when the number of threads is very small (less than 10), the speedup of the benchmark is low. This is because at such small number of threads, parallelism is limited. When the number of threads is increased (to about 100), close to ideal speedup can be achieved. However, when the number of threads is large (about 1000), speedup degrades slightly. This is due to the contention resulting from sharing the registry and from the blocking of parent threads as they create child threads. 2

Specifically, each of the m threads calls Math.sine n times on the value of a shared Double object.

53


4.0

Speedup at 4 Processors

3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.1

1

10

100

1000

Thread Granularity (mseconds)

Figure 4.2: The effect of thread granularity on the performance of the run-time system.

4.0


3.5 3.0 2.5 2.0 1.5 1.0 0.5 1

10

100

1000

Number of Threads

Figure 4.3: The effect of the number of threads on the performance of the run-time system.


4.2.2

54

Traversal: Measuring Overhead

The micro-benchmark Traversal was used to measure the costs of registry operations and thread synchronizations. The core method in Traversal traverses a simple linked list of nodes by tail recursion. During the traversal, the method spends a given amount of time performing work on each node it visits and, upon completing the work on a node, calls itself on the succeeding node. The program terminates when it reaches the end of the linked list, and there are no more nodes to process. The data access summary of the recursive traversal method was extracted manually, and the program was re-structured so that one thread is spawned to carry out the synthetic workload on every node in the linked list. Since the algorithm affords no parallelism—the sequential program must finish its work with one node before proceeding to the next—each thread must synchronize with its predecessor, i.e., a thread may not actually start working on a node ni until the thread working on the preceding node ni−1 has finished its work. As a result, the parallel version of the program actually runs more slowly than the sequential version of the program. For a linked list of n nodes, the difference d(n) in the execution times of the two versions is the overhead introduced by automatic parallelization. This overhead is due to source resolution and path expansion, among other things. The cost incurred by the resolution of m symbolic access paths for all n invocations of the method is O(mn), while the cost incurred by path expansion is O(mn2 ). Other hidden costs include context switching overhead, object allocation overhead, overhead of using Java synchronization primitives (such as Object.wait()), and so on, which are platform-dependent and cannot be measured accurately. We experimented with a number of scenarios involving different numbers of nodes and varying amount of workload per node. Figure 4.4 shows the percentage overhead (defined as the ratio of the overhead to the execution time of the original sequential program) of parallelized list traversal for some scenarios. The overhead is expensive (close to 90% of the execution time) when the list is short and the synthetic workload is light. When both

55


100 1.58ms 16.86ms 172.82ms

90 80

%overhead

70 60 50 40 30 20 10 0 1 10

2

10 list size

3

10

Figure 4.4: The run-time overhead of multi-threaded list traversal as a function of the size of the linked list and the amount of work done on each visited node. the number of nodes and the workload are increased a hundred times, the overhead drops below 10%. If the work for each node was independent from one another, the program could be parallelized to process multiple nodes concurrently, and performance gain could be expected in spite of the 10% overhead. For best results, the run-time system would have to ensure that the parallel algorithm is only invoked if the number of nodes and the amount of work per node is significant; in cases like this, a profiling approach (see Section 3.5.2) could prove beneficial.

4.2.3

The Matrix Benchmark

The Matrix benchmark program is the matrix multiplication application described earlier in the paper. It’s is a slightly modified version of the one shown in Figure 3.2. The speedup of Matrix for a 1000 × 1000 matrix for different number of processors was shown earlier in Figure 4.1. Although the speedup of Matrix is less than ideal (3.6 at 4 processors), the speedup increases linearly with the number of processors. The speedup of Matrix as a function of the matrix size is shown in Figure 4.5. The program creates one thread to compute each row of the product matrix. The speedup is

56


4.0 1200


3.5

1000

700

3.0 500

2.5

300

2.0 250

1.5 200

1.0 1

10

100

1000

Thread Granularity (mseconds)

Figure 4.5: The speedup of Matrix at 4 processors as a function of matrix size. shown for various matrix sizes, and hence, varying number of threads and granularity of each thread. For example, for a 1000 × 1000 matrix, 1000 threads are created, each of which performs a million additions and multiplications. The speedup of for the largest matrix size is 3.6 at 4 processors.

4.2.4

The Mergesort Benchmark

The Mergesort benchmark implements a 2-way merge sort algorithm. It divides the input into two “units”, recursively sorts each unit, and finally merges the two units into a single sorted unit. When the size of a unit drops below a user-defined threshold, the sorting method provided by the Java class library3 is invoked to sort the unit. In the zJava parallelized program, a thread is used to sort a unit whose size drops below the threshold. Hence, the unit size essentially determines the number of threads that will be created to perform the sort, as well as the granularity of each thread. The benchmark is used to sort 2M (221 ) randomly generated integers. The speedup of the benchmark with a unit size of 32768 (215 ) integers was shown earlier in Figure 4.1. 3

java.util.Arrays.sort() implements a tuned quick sort that offers n × log(n) performance on most data sets [CLK99].

57


3.0


2.5

2.0

1.5

1.0

0.5

0.0 2048

4096

8192

16384

32768

Unit Size

Figure 4.6: The speedup of Mergesort on 2M integers as a function of number of the unit size. The speedup is low, but it is close to what is attainable from the divide-and-conquer parallelism that merge sort offers [LER92]. That is, the main source of speedup degradation in this benchmark is lack of parallelism. The effect of the unit size, and hence the number of threads and their granularity, on the speedup of the benchmark is shown in Figure 4.6. For small unit sizes, there are more units, and hence, more threads. This leads to more finer-grained parallelism, but also to increased thread creation and synchronization overhead. Conversely, for large unit sizes, there are less units, and hence, less threads. This leads to limited coarsergrained parallelism, but less synchronization overhead. Since we experimented with only 4 processors, very high degrees of parallelism are not needed, and the performance of the benchmark improves with increasing the unit size.

4.2.5

The TSP Benchmark

The TSP benchmark program uses a branch-and-bound algorithm to compute an optimal tour for a travelling salesman problem. Our implementation of the TSP benchmark is similar to the one in [Huyn96]; essentially the three top-most levels of the search tree are


58

flattened into a three-dimensional array, and individual threads are created to expand a sub-tree from each element of the array. The speedup of the benchmark was shown earlier in Figure 4.1. The cost of the best solution discovered at any time during the search is used to prune the tree. The figure indicates that scaling speedup is achievable for this benchmark. Indeed, the speedup is close to ideal.

4.2.6

The 15-puzzle Benchmark

The 15-puzzle benchmark solves the 15-puzzle problem4 using an iterative deepening A* (IDA*) depth-first search algorithm. The benchmark was written as a parallel program by Hui et al [HMSS00]. We reverted it back into a sequential program, which we then automatically parallelized using the zJava system. We report on the performance of the automatically parallelized program, and also compare it to the performance of the original parallel program of Hui et al [HMSS00]. The benchmark solves a 15-puzzle problem instance by dynamically building and examining a search tree. The IDA* algorithm builds a search tree of a given depth, and searches for a solution to the problem using depth-first search. If a solution is not found, the depth of the tree is increased, and the search is re-started. The original parallel program expands the game tree up to a threshold level and then creates a separate thread to further explore each of the sub-trees rooted at this level. We ran the benchmark using different configurations of the 15-puzzle, and obtained the average speedup as the number of processors was increased. In some cases, we observed speedup anomalies where the parallelized program completed more than four times faster than the original version. The anomalies occurred because, by chance, the parallelized version found a solution to the puzzle without needing to process the same amount of nodes in the search tree. Such anomalous cases would pose unfair comparisons 4

The 15-puzzle is a game in which 15 tiles numbered 1 to 15 are placed in random order within a 4 × 4 grid, leaving one tile space empty. The player is required to arrange the tiles in ascending order of their numbers by repeatedly sliding tiles, one at a time, into the empty space.

59

Chapter 4. Experimental Results 4.5 4.0 Ideal Speedup

3.5

Explicit zJava

Speedup

3.0 2.5 2.0 1.5 1.0 0.5 1

2

3

4

Processors

Figure 4.7: The speedup of the 15-puzzle application. of the two versions of the program, and were discarded from the computation of the averages. The speedup of the zJava parallelized 15-puzzle application is shown in Figure 4.7. The figure indicates that although the resulting speedup is less than the ideal speedup, the speedup of the application increases with increasing the number of processors. However, as the figure also indicates, the speedup of the zJava parallelized program is close to that of the original explicitly parallelized program. That is, the zJava system automatically exploited parallelism as well as the manually parallelized program.

Chapter 5 Related Work Our work draws inspiration from, and improves on, three areas of research: compile-time characterization of data access patterns (for both arrays and pointer-based data structures), alias analysis, and automatic parallelization systems. In the following sections, we discuss some related work in these areas, and highlight differences between our approach and existing techniques.

5.1

Array Access Summaries

There has been a large amount of literature on summarizing array accesses. For instance, Triolet et al. [TIF86] use systems of linear inequalities to represent array regions. Callahan and Kennedy [CK88] introduces regular section descriptors (RSDs) which can summarize array accesses performed in a FORTRAN subroutine. A RSD is a pair (A, θ) where A is an array symbol and θ is a vector of array subscripts, each of which may be a constant or a simple arithmetic expression. RSDs form the basis of a number of interprocedural array analyses [TA94, BDE+ 96, Huyn96] for the purpose of parallelizing FORTRAN and C programs. Generally, these approaches are limited to FORTRAN, and cannot be used with Java. Java arrays are one-dimensional; a “multi-dimensional” array is essentially an array of 60

Chapter 5. Related Work

61

pointers to other single-row arrays of pointers. It is possible to do things to Java arrays that cannot be done in FORTRAN programs; for example, a simple pointer assignment can replace an entire row in an two-dimensional array with another single-row array. The symbolic access path notation used in the zJava system supports Java’s arrays-as-objects paradigm naturally; fine-grain accesses to individual array elements can be represented as field accesses. In addition, we have extended the notation with a syntax similar to that of RSDs to summarize coarse-grain accesses to sections of rectangular arrays. Compared to systems of inequalities, our design is less concise but works sufficiently well; it does not require solving systems of inequalities, which is expensive and impractical in a run-time environment. Cheng and Hwu [CH00] use summary transfer functions to describe data accessed by functions in a program. The summary transfer functions (also known simply as “summaries”) play a similar role to our data access summaries. They are used to store information for functions (or methods, in our case) that have already been compiled, so that the information can be later retrieved and propagated to other calling methods. Many implementors of interprocedural analyses use similar structures to reduce or avoid the cost of whole-program analyses. Some examples are side effect summaries [CK88], connection graphs [GH98, CGS+ 99], and region lists [Huyn96]. In the zJava system, data access summaries are used for two purposes. During the compile-time analyses, the summaries allow information about any method to propagate to its callers. When the re-structured program is executed, the summaries in effect propagate the same information from the compiler to the run-time system. A number of run-time parallelization algorithms [KS88, LZ93, MP87, SMC91, ZY87] performs dependence analysis at run time for loops that use arrays, in order to exploit parallelism where compile-time analyses fail to do so. In general, these schemes work in two stages, namely an inspector and an executor. The inspector observes program behaviour, and determines the dependence among array accesses performed across iter-


62

ations of a loop. The executor uses this information to execute the iterations in parallel while preserving the dependence. The two stages are tightly coupled in most approaches, making the results of inspecting a loop unusable in the next invocation of the loop. Chen et al. [COTY94] has devised an algorithm that reuses the inspector’s results across invocations of the same loop, by storing the results in a so-called “ticket table.” The executor then uses the table to attempt parallelization of invocations of the loop. In contrast, our work targets non-loop parallelism in Java programs, which exhibit much more dynamic program behaviour and use dynamic data structures.

5.2

Access Paths

More recently, the use of access paths as a analysis device has proliferated [Deut94, GH96, GH98, CH00]. An access path (see Definition 2.1.1) uses a series of field names to encode how a program traverses an aggregate data structure to access a variable embedded therein [LH88, HLW+ 92]. The access path notation has seen numerous refinements and extensions in pointer analysis research. For example, Deutsch [Deut94] has developed the symbolic access path, which he uses for interprocedural may-alias analysis of C programs. He introduces the use of bases which allows the characterization of accesses to recursive data structures. He also gives algorithms to normalize symbolic access paths and factor paths for recursive types into their bases. Likewise, Cheng and Hwu [CH00] use extended access paths in a modular (i.e., not whole-program) interprocedural pointer analysis. Their notation supports many C-specific features, such as type-casting and unions. Ghiya and Hendren [GH98] do not use the access path notation explicitly in their connection graph-based pointer analysis, but they also define anchors to represent objects that have been passed to a C function as its parameters.


63

In the zJava system, we have based our symbolic access path notation on Deutsch’s design, and augmented it with new syntax to handle non-global source objects and array sections. Unlike previous usage which limit access paths strictly to compile-time analyses, our work is a novel application of access paths in that the paths are used, not at compile time, but at run time to determine how threads may be created and synchronized.

5.3

Exploiting Method-level Parallelism

The zJava system exploits parallelism at the method level, instead of at the loop level, traditionally exploited by parallelizing compilers. There has been a number of systems that exploit non-loop-level parallelism, commonly referred to as task-level parallelism. (Girkar and Polychronopoulos [GP95] define “task” to be “a program statement of arbitrary granularity.” The term “task-level parallelism” has been used to refer variously to instruction-level parallelism [Jeff85, LMPC98], statement-level parallelism [GP95, LR91], as well as subroutine-level parallelism. In our work, we adopted the term “method-level parallelism” in order to avoid confusion.) Abdelrahman and Huynh [AH96, Huyn96] implement pTask, a system that automatically parallelizes function calls in array-based C programs. Our system builds on theirs, but extends their work to address pointer-based programs that employ dynamic data structures. Lam and Rinard [LR91] describe Jade, a C-like language that allows the parallelization of C programs at various levels of granularity. Programmers of Jade must extend their programs with special constructs to identify parallelizable tasks, and provide a specification of the side effects that each task performs on shared data. Another similar approach is Fx, a compiler based on High Performance FORTRAN (HPF), designed and implemented by Gross, O’Hallaron and Subhlok [GOS94]. Rather than introducing extra features into the language, Fx uses compiler directives to annotate the input/output parameters of each task, and to map tasks on to processor nodes. It also relies on the


64

programmer to add these compiler directives to the source program. In comparison, we extract data access to shared data from a program, represent them with data access summaries, and automatically exploit parallelism at run-time, without the need for special program directives or manual annotations. Rugina and Rinard [RR99] employ an expensive (flow-sensitive, context-sensitive and interprocedural) analysis to exploit parallelism in array-based programs that implement divide-and-conquer algorithms and use pointer arithmetics to access arrays. In contrast, our system is designed to exploit parallelism in general-purpose Java programs (which have pointers but no pointer arithmetics). We do not obtain precise alias information at compile time, but instead leave data dependence computation until run time. Although this does incur run-time overhead, it allows the run-time system to obtain accurate alias information and easily discover parallelism without resorting to complex analyses.

5.4

Alias Analysis

Static automatic parallelization of pointer-based programs have been limited, mostly by the lack of precise pointer alias analyses, which parallelizing compilers use to determine whether any dependence exists between two sections of code. Precise, static determination of aliasing has been shown to be an undecidable problem [Land92, Rama94], and there has been a large amount of research on designing approximation heuristics or algorithms [Hind01]; see [HP01] for a comparison of several kinds of pointer analyses. These approaches suffer from a number of shortcomings when applied to parallelization. Most importantly, while they give excellent results locally, if used in an interprocedural context, they are necessarily conservative, and do not yield the precision that is required to determine method-level parallelism. In contrast, we rely on the well-studied, relatively simple escape analysis, which has been implemented successfully for Java [Blan99, CGS+ 99, WR99], as the basis of the compiler analysis that generates data access summaries. By resolving the data access


65

summary for each invoked method at run time, the zJava system avoids the complexity and imprecision of pointer alias analysis, and has the advantage of being able to discover aliases, and hence parallelism, precisely.

Chapter 6 Conclusions

6.1

Summary of Contributions

We presented and evaluated a novel approach for the automatic parallelization of programs that use pointer-based data structures, written in Java. The approach is novel in that it combines compile-time analysis with run-time support to extract parallelism among methods in a sequential Java program; each method invocation in the program creates an asynchronous thread of execution. To ensure correct program execution, each methods is analyzed at compile time to determine the data it will access. A description of these data accesses, called a data access summary, which is parameterized by the context of the method, is transmitted to a run-time system during program execution. The runtime system utilizes the summaries of various methods to determine dependences among the threads created from them, and enforce the sequential semantics of the resulting parallelized execution. In this thesis, we described the symbolic access path notation, which allows data access summaries to record concisely data accesses performed by a method. We also explained the design and implementation of the run-time component of the system, and presented the results of its experimental evaluation. 66

Chapter 6. Conclusions

67

Experimental results indicate that although the system incurs some run-time overhead, close to ideal speedup is obtained for a number of benchmarks. Indeed, the speedup of all the benchmarks is scaling, i.e., it increases as the number of available processors increases. Hence, the benefits of exploiting parallelism in this manner outweigh the penalty of the run-time overhead. This indicates that our approach is viable for the automatic parallelization of target programs.

6.2

Future Work

The current implementation of the zJava system has a number of limitations, both in functionality and in performance. These issues should be addressed for the zJava system to become more generally applicable. Most importantly, the present work only covers the run-time support for automatic parallelization, and for the parallelization to be truly automatic, a number of compile-time analyses and program transformations must be implemented within the zJava compiler framework. The run-time system currently assumes that the input program is sequential and uses only one thread throughout the program. It cannot handle a heterogeneous program which contains a mixture of zJava-generated threads and Java threads implemented by the user. Unfortunately, much existing Java code are multi-threaded, and thus cannot be used with the current implementation of the zJava system. Ideally, the compile-time analyses within the zJava compiler would identify those parts of a program where Java threads are used, and either avoid analyzing/transforming them, or re-structure creations of Java threads into method calls, to which automatic parallelization can be applied. Another assumption made by the system is that some special methods, i.e., constructors and finalizers, are not parallelizable (see Appendix A.3). Often these special methods do not contain a lot of code, and must execute before (after) any other method may proceed (has completed). Thus, in most cases, there is no major loss of performance. Nev-

Chapter 6. Conclusions

68

ertheless, these methods (especially constructors) sometimes exhibit significant amount of parallelism. The assumptions will have to be revised to improve the performance of these methods. The symbolic access path notation also assumes that multi-dimensional arrays are rectangular, i.e., the number of columns is uniform across all rows of an array. The notation can support other form of arrays, e.g., triangular arrays, but each row of a nonrectangular array must be treated as a separate sub-array. In addition, locks on array sections exhibit poor scalability; fortunately, for methods of sufficient granularity, the impact on performance is small. Despite our effort to minimize the sizes of data access summaries, concurrent accesses to large recursive data structures typically lead to an extraordinary amount of registry updates and synchronization overhead, which may prove to out-weigh any benefit resulting from the parallelization. Profile-directed re-compilation may be one solution to this problem. Finally, we would like to extend our approach to support JNI (Java Native Interface) methods. JNI enables Java programs to link in and call native code, usually written in C, and also allows the native code to call Java methods and manipulate Java objects. Many Java programs that communicate and interact with external software (operating systems, graphical user interfaces, database management systems, etc.) rely on JNI methods to do so. For such methods, the C compiler could perform the necessary analysis, generate data access summaries, and re-structure the code accordingly. Then the run-time system can load the summaries as the JNI methods are linked in at run time, and can treat calls to native methods as fork points. Thus, it is possible to exploit parallelism in even native methods.

Bibliography [AH96]

Abdelrahman, Tarek S., and Sum Huynh. Exploiting task-level parallelism using pTask. In International Conference on Parallel and Distributed Processing Techniques and Applications (1996), pp. 252–263.

[BCF+ 99]

Burke, Michael G., Jong-Deok Choi, Stephen Fink, David Grove, Michael Hind, Vivek Sarkar, Mauricio J. Serrano, V. C. Sreedhar, Harini Srinivasan, and John Whaley. The Jalape˜ no dynamic optimizing compiler for Java. In Proceedings of the ACM 1999 Java Grande Conference (San Francisco, CA, USA, June 1999), Association for Computing Machinery, pp. 129–141.

[BDE+ 96]

Blume, W., R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Parallel programming with Polaris. IEEE Computer, 29(12):78–82, Dec. 1996.

[BENP93]

Banerjee, Utpal, Rudolf Eigenmann, Alexandru Nicolau, and David A. Padua. Automatic program parallelization. Proceedings of the IEEE, 81(2):211–243, Feb. 1993.

[BH99]

¨ lzle. Removing unnecessary synchronization Bogda, Jeff, and Urs Ho in Java. In [OOPSLA99], pp. 35–46.

[Blan98]

Blanchet, Bruno. Escape analysis: Correctness proof, implementation and experimental results. In [POPL98], pp. 25–37.

[Blan99]

Blanchet, Bruno. Escape analysis for object oriented languages: Application to Java. In [OOPSLA99], pp. 20–34.

[Brew01]

Brewster, Neil. A high-level intermediate representation for Java. Master’s thesis, University of Toronto, 2001.

[CES71]

Coffman, Jr., E. G., M. J. Elphick, and A. Shoshani. System deadlocks. Computer Surveys, 3(2):67–78, June 1971.

[CGS+ 99]

Choi, Jong-Deok, Manish Gupta, Mauricio Serrano, Vugranam C. Sreedhar, and Sam Midkiff. Escape analysis for Java. In [OOPSLA99], pp. 1–19. 69

Bibliography

70

[CH00]

Cheng, Ben-Chung, and Wen-mei W. Hwu. Modular interprocedural pointer analysis using access paths: Design, implementation, and evaluation. In Proceedings of the ACM SIGPLAN ’00 Conference on Programming Language Design and Implementation (Vancouver, BC, Canada, June 2000), Association for Computing Machinery, pp. 57–69.

[CK88]

Callahan, David, and Ken Kennedy. Analysis of interprocedural side effects in a parallel programming environment. Journal of Parallel and Distributed Computing, 5:517–550, 1988.

[CL99]

Cohn, Robert, and P. Geoffrey Lowney. Feedback directed optimization in Compaq’s compilation tools for Alpha. In Proceedings of the Second Workshop on Feedback-Directed Optimization (Haifa, Israel, Nov. 1999).

[CLK98]

Chan, Patrick, Rosanna Lee, and Douglas Kramer. The Java Class Libraries Second Edition, Volume 1. Addison Wesley, 1998.

[CLK99]

Chan, Patrick, Rosanna Lee, and Douglas Kramer. The Java Class Libraries Second Edition, Volume 1 (Supplement for the Java 2 Platform Standard Edition, v1.2). Addison Wesley, 1999.

[COTY94]

Chen, Ding-Kai, David A. Oesterreich, Josep Torrellas, and Pen-Chung Yew. An efficient algorithm for the run-time parallelization of DOACROSS loops. In Proceedings of the 1994 Conference on Supercomputing (Washington, D.C., USA, 1994), pp. 518–527.

[CS90]

Callahan, David, and Burton Smith. A future-based parallel language for a general-purpose highly-parallel computer. In Languages and Compilers for Parallel Computing, D. Gelernter, A. Nicolau, and D. Padua, Eds., Research Monographs in Parallel and Distributed Computing. MIT Press, Cambridge, MA, USA, 1990, pp. 95–113.

[Deut94]

Deutsch, Alain. Interprocedural may-alias analysis for pointers: Beyond k-limiting. In Proceedings of the ACM SIGPLAN’94 Conference on Programming Language Design and Implementation (Orlando, FL, USA, June 1994), Association for Computing Machinery, pp. 230–241.

[FW78]

Friedman, Daniel P., and David S. Wise. Aspects of applicative programming for parallel processing. IEEE Transactions on Computers, 27(4):289–296, Apr. 1978.

[GH96]

Ghiya, Rakesh, and Laurie J. Hendren. Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C. In Conference Record of POPL’96: The 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (St. Petersburg Beach, FL, USA, Jan. 1996), Association for Computing Machinery, pp. 1–15.

Bibliography

71

[GH98]

Ghiya, Rakesh, and Laurie J. Hendren. Putting pointer analysis to work. In [POPL98], pp. 121–133.

[GJS97]

Gosling, James, Bill Joy, and Guy L. Steele Jr. The Java Language Specification. Addison Wesley, 1997.

[GOS94]

Gross, Thomas, David R. O’Hallaron, and Jaspal Subhlok. Task parallelism in a High Performance Fortran framework. IEEE Parallel and Distributed Technology: Systems and Applications, 2(3):16–26, 1994.

[GP95]

Girkar, Milind, and Constantine D. Polychronopoulos. Extracting task-level parallelism. ACM Transactions on Programming Languages and Systems, 17(4):600–634, 1995.

[HAA+ 96]

Hall, M., J. Anderson, S. Amarasinghe, B. Murphy, S. Liao, E. Bugnion, and M. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, 29(12):84–89, Dec. 1996.

[Hind01]

Hind, Michael. Pointer analysis: haven’t we solved this problem yet? In Proceedings of the 2001 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis For Software Tools and Engineering (Snowbird, UT, USA, June 2001), pp. 54–61.

[HLW+ 92]

Hogg, John, Doug Lea, Alan Wills, Dennis de Champeaux, and Richard Holt. The Geneva Convention on the treatment of object aliasing. OOPS Messenger, 3(2):11–16, 1992.

[HMSS00]

Hui, W., S. MacDonald, J. Schaeffer, and D. Szafron. Visualizing object and method granularity for program parallelization. In Proceedings of Parallel and Distributed Computing and Systems (PDCS 2000) (Nov. 2000), pp. 286–291.

[HNO97]

Hammond, Lance, Basem A. Nayfeh, and Kunle Olukotun. A single-chip multiprocessor. IEEE Computer, 30(9):79–86, Sept. 1997.

[HP01]

Hind, Michael, and Anthony Pioli. Evaluating the effectiveness of pointer alias analyses. Science of Computer Programming, 39(1):31–55, Jan. 2001.

[Huyn96]

Huynh, Sum. Exploiting task-level parallelism automatically using pTask. Master’s thesis, University of Toronto, 1996.

[Jeff85]

Jefferson, David R. Virtual time. ACM Transactions on Programming Languages and Systems, 7(3):404–425, July 1985.

[KS88]

Krothapalli, V. Prasad, and P. Sadayappan. An approach to synchronization of parallel computing. In Proceedings of the 1988 International Conference on Supercomputing (June 1988), pp. 573–581.

Bibliography

72

[Land92]

Landi, William. Undecidability of static analysis. ACM Letters on Programming Languages and Systems (LOPLAS), 1(4):323–337, 1992.

[LER92]

Lewis, Ted G., and Hesham El-Rewini. Introduction to Parallel Computing. Prentice Hall, Englewood Cliffs, NJ, 1992.

[Lew01]

Lew, Dion. BCIR: A framework for the representation and manipulation of Java bytecode. Master’s thesis, University of Toronto, 2001.

[LH88]

Larus, James R., and Paul N. Hilfinger. Detecting conflicts between structure accesses. In Proceedings of the ACM SIGPLAN’88 Conference on Programming Language Design and Implementation (Atlanta, GA, USA, June 1988), Association for Computing Machinery, pp. 21–34.

[LMPC98]

Littin, Richard H., J. A. David McWha, Murray W. Pearson, and John G. Cleary. Block based execution and task level parallelism. In The 3rd Australasian Computer Architecture Conference (ACAC’98) (Perth, Australia, Feb. 1998), pp. 57–66.

[Lo98]

Lo, Jack Lee-jay. Exploiting Thread-Level Parallelism on Simultaneous Multithreaded Processors. PhD thesis, University of Washington, 1998.

[LR91]

Lam, Monica S., and Martin C. Rinard. Coarse-grain parallel programming in Jade. In Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Williamsburg, VA, USA, Apr. 1991), Association for Computing Machinery, pp. 94–105.

[LY99]

Lindholm, Tim, and Frank Yellin. The Java Virtual Machine Specification. Addison Wesley, 1999.

[LZ93]

Leung, Shun-Tak, and Jhon Zahorjan. Improving the performance of runtime parallelization. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, CA, USA, May 1993), pp. 83–91.

[MBH+ 02]

Marr, Deborah T., Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, J. Alan Miller, and Michael Upton. Hyperthreading technology architecture and microarchitecture. Intel Technology Journal, 6(1):4–15, Feb. 2002.

[MP87]

Midkiff, Samuel P., and D. A. Padua. Compiler algorithms for synchronization. IEEE Transactions on Computers, 36(12):1485–1495, Dec. 1987.

[OOPSLA99] Association for Computing Machinery. Proceedings of the 1999 ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (Denver, CO, USA, Oct. 1999).

Bibliography

73

[PG92]

Park, Young Gil, and Benjamin Goldberg. Escape analysis on lists. In Proceedings of the ACM SIGPLAN’92 Conference on Programming Language Design and Implementation (San Francisco, CA, USA, June 1992), Association for Computing Machinery, pp. 116–127.

[POPL98]

Association for Computing Machinery. Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (San Diego, CA, USA, Jan. 1998).

[Rama94]

Ramalingam, G. The undecidability of aliasing. ACM Transactions on Programming Languages and Systems, 16(5):1467–1471, 1994.

[RR99]

Rugina, Radu, and Martin Rinard. Automatic parallelization of divide and conquer algorithms. In Proceedings of the seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Atlanta, GA, USA, May 1999), Association for Computing Machinery, pp. 72–83.

[SMC91]

Saltz, Joel H., Ravi Mirchandaney, and Kay Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603–611, May 1991.

[TA94]

Tomko, Karen A., and Santosh G. Abraham. Partitioning regular applications for cache-coherent multiprocessors. Tech. Rep. CSE-TR-20694, Department of Electrical Engineering and Computer Science, University of Michigan, March 1994.

[TEL95]

Tullsen, Dean, Susan Eggers, and Henry Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (June 1995), pp. 392–403.

[TIF86]

Triolet, R´ emi, Fran cois Irigoin, and Paul Feautrier. Direct parallelization of call statements. In ACM SIGPLAN Symposium on Compiler Construction (Palo Alto, CA, USA, July 1986), Association for Computing Machinery, pp. 176–185.

[TLEL99]

Tullsen, Dean, Jack Lo, Susan Eggers, and Henry Levy. Supporting fine-grain synchronization on a simultaneous multithreaded processor. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (Jan. 1999), pp. 54–58.

[WR99]

Whaley, John, and Martin Rinard. Compositional pointer and escape analysis for Java programs. In [OOPSLA99], pp. 187–206.

[ZY87]

Zhu, Chuan-Qi, and Pen-Chung Yew. A scheme to enforce data dependence on large multiprocessor systems. IEEE Transactions on Software Engineering, 13(6):726–739, June 1987.

Appendix A Notes on the Implementation

A.1

The zJava Compiler Framework

The zJava compiler framework is designed to be modular and flexible. As shown in Figure A.1, the compiler is based on two different intermediate representations. The highlevel intermediate representation (HLIR) [Brew01] facilitates classical analyses, optimizations and parallelization. The low-level bytecode intermediate representation (BCIR) [Lew01] permits peephole optimization of the bytecode and manipulation of the class file itself. A parser translates the source code of a Java program into HLIR, which can then be fed to a number of analysis and transformation passes. The compiler framework allows analyses and transformations to be easily “plugged in.” The HLIR can be written out as source again, or be encoded into BCIR, which can be further optimized. Bytecode can then be generated from the BCIR and written to class files. The BCIR of a Java program can also be obtained by decompiling its class files directly. Apart from optimizing the intermediate representation and re-generating the class files, it can also be translated back into HLIR for output in source form. In this case, the zJava compiler acts as a decompiler. 74

75

Appendix A. Notes on the Implementation

High-level Optimizer

.java

HLIR

Parser

.java

Code Generator

Parallelizer

Bytecode Manipulator class

BCIR

Class Reader

Class Writer

.class

Class File Modifier

Figure A.1: The zJava compiler framework. Automatic parallelization is designed to be implemented as a collection of analysis and transformation plug-ins for the zJava framework.

A.2

Transferring Context Via Reflection

The zJava run-time system employs reflection—a powerful API (“application programming interface”) for meta-programming 1 in Java, to discover dynamically information about the fields, methods and constructors of any class in the parallelized program, and to operate on the actual objects in the program through the reflections, i.e., Java objects that represent the reflected entities. For a given method call, the re-structured code passes the reflection of the called method as well as the actual parameters for the original method call to the thread creation routine as parameters. Since Java does not allow variable-length parameter lists, actual parameters to a method call can only be passed to a child thread by marshalling them 1

Meta programming can be defined loosely as the writing of programs that manipulate themselves, without complete knowledge of the program.

Appendix A. Notes on the Implementation

76

into a MethodCallContext object, which is essentially a linked list that can store values of both object and primitive types. This MethodCallContext object are then passed to the thread constructor, along with the reflection of the method. When the child thread runs, it demarshalls the parameters from the MethodCallContext object, and invokes the reflected method with the individual parameters.

A.3

Constructors and Finalizers

Constructors and finalizers are not considered “normal” methods in the zJava system. Unlike most (non-static) methods that can be invoked on a known object, at the time when a constructor is called, it is not given a reference to a known object on which to act. Rather, a new object is allocated on the heap, and the constructor proceeds to initialize the new object into a consistent state. The caller of a constructor acquires a reference to the complete object only after the constructor returns. Since zJava synchronizes running threads by determining what data is accessed by each thread, it is unable to create and manage a thread that executes a constructor which accesses as-yet-unknown objects. On the other hand, a finalizer in Java should only be invoked by the JVM, after an object declaring the finalizer has been garbage-collected. By definition, no user thread holds a reference to an object that has become garbage. Parallelizing finalizers (by running them in separate user-level threads) would violate the semantics of finalizers; not to mention that it would also require modifications to the underlying JVM. For these reasons, the zJava system does not attempt to exploit parallelism, if any, within constructors and finalizers. However, constructors are still analyzed at compile time to discover what data is accessed within them; while constructors cannot be run in concurrent threads, their data access information is used to compute the data access summaries of their callers, as required by the inclusion property of data access summaries.

Appendix B A Grammar of the Symbolic Access Path Notation SymbolicAccessPath : Anchor ’.’ SubPath Anchor : ParametricAnchor | StaticAnchor ParametricAnchor : ’$’ Number StaticAnchor : a fully qualified Java class name Number : [’0’-’9’] [’0’-’9’]* SubPath : Element SubPath | Element Element : ArraySection | Field | Basis | SubPathSet ArraySection : ’[’ IndexRange ’]’ IndexRange : ’*’ | Index ’-’ Index Index : Number | ParametricAnchor Field : ’.’ a field name Basis : ’(’ SubPath ’)’ Repeat Repeat : ’+’ | ’*’ SubPathSet : ’{’ SubPath ’,’ SubPathList ’}’ SubPathList : SubPath ’,’ SubPath | SubPath

77

Appendix C Benchmarks Used This appendix includes the original, un-parallelized source code of the benchmarks used in this thesis. In the case of 15-puzzle, the original source code had already been parallelized manually. The class Timer, included in the Traversal benchmark, is actually used by several other benchmarks for timing purposes.

C.1 C.1.1

ReadTest ReadTest.java

import j a v a . l a n g . r e f l e c t . * ; import zJava . r u n t i m e . * ; c l a s s ReadTest { s t a t i c long s ta r tT i m e ; s t a t i c double data = Math . random ( ) ; public s t a t i c void main ( S t r i n g [ ] a r g s ) { s ta r tT i m e = System . c u r r e n t T i m e M i l l i s ( ) ; i n t nReaders = 0 ; i n t nLoops = 0 ; try { nReaders = I n t e g e r . p a r s e I n t ( a r g s [ 0 ] ) ; nLoops = I n t e g e r . p a r s e I n t ( a r g s [ 1 ] ) ; } catch ( NumberFormatException e ) { System . e r r . p r i n t l n ( h el p M es s a g e ) ; System . e x i t ( 1 ) ; } catch ( ArrayIndexOutOfBoundsException e ) { System . e r r . p r i n t l n ( h el p M es s a g e ) ; System . e x i t ( 1 ) ; }

78

Appendix C. Benchmarks Used

f o r ( i n t i = 0 ; i < nReaders ; i ++) { doRead ( nLoops ) ; } }

System . e r r . p r i n t l n ( ( System . c u r r e n t T i m e M i l l i s ( ) − s ta r tT i m e ) / 1 0 0 0 . 0 ) ;

s t a t i c double doRead ( i n t n ) { double l o c a l D a t a = 0 ; while ( n− − > 0) { l o c a l D a t a = Math . s i n ( ReadTest . data ) ; } return l o c a l D a t a ; } s t a t i c S t r i n g h e l p M es s a g e = ” ReadTest nReaders nLoops \n\n” + ” nReaders and nLoops must be i n t e g e r s . ” ; }

C.2 C.2.1

Traversal Timer.java

public c l a s s Timer { private long s ta r tT i m e = −1; private long stopTime = −1; private boolean r u n n i n g = false ; public Timer s t a r t ( ) { s ta r tT i m e = System . c u r r e n t T i m e M i l l i s ( ) ; r u n n i n g = true ; return t h i s ; } public Timer s t o p ( ) { stopTime = System . c u r r e n t T i m e M i l l i s ( ) ; running = f a l s e ; return t h i s ; } public long getElapsed Time ( ) { i f ( s ta r tT i m e == −1) return 0 ;

}

}

i f ( running ) return System . c u r r e n t T i m e M i l l i s ( ) − s ta r tT i m e ; else return stopTime − s ta r tT i m e ;

public Timer r e s e t ( ) { s ta r tT i m e = stopTime = −1; running = f a l s e ; return t h i s ; }

79


C.2.2

LinkedList.java

class LinkedList { s t a t i c int

nLoops ;

private i n t data = 0 ; private L i n k e d L i s t t a i l = n u l l ; L i n k e d L i s t ( i n t data ) { t h i s . data = data ; } void s etD a ta ( i n t i ) { data = i ; } void s e t N e x t ( L i n k e d L i s t l i s t ) { t a i l = l i s t ; } int getData ( ) { return data ; } L i n k e d L i s t g etN ex t ( ) { return t a i l ; } boolean l i n e a r S e a r c h ( i n t key ) { int n = nLoops ; double l o c a l D a t a ; while ( n− − > 0) { l o c a l D a t a = Math . s i n ( nLoops % ( 2 * Math . PI ) ) ; } i f ( data == key ) return true ; i f ( t a i l ! = n u l l ) return t a i l . l i n e a r S e a r c h ( key ) ; return f a l s e ; } public s t a t i c void main ( S t r i n g [ ] a r g s ) { int s i z e = I n t e g e r . par s eInt ( ar gs [ 0 ] ) ; nLoops = I n t e g e r . p a r s e I n t ( a r g s [ 1 ] ) ; L i n k e d L i s t head = new L i n k e d L i s t ( 0 ) ; L i n k e d L i s t c u r r = head ; f o r ( i n t i = 1 ; i < s i z e ; i ++) { c u r r . s e t N e x t (new L i n k e d L i s t ( i ) ) ; c u r r = c u r r . g etN ex t ( ) ; } Timer t i m e r = new Timer ( ) . s t a r t ( ) ; System . out . p r i n t ( head . l i n e a r S e a r c h ( s i z e ) + ” ” ) ; System . out . p r i n t l n ( t i m e r . getElapsedT ime ( ) ) ; }

}

C.3 C.3.1

Matrix Matrix.java

c l a s s Matrix { double [ ] [ ] int

data ; nRows , nCols ;

// dat a i n m at r i x // s i z e o f m at r i x

Matrix ( i n t r , i n t c ) { t h i s . nRows = r ; t h i s . nCols = c ; t h i s . data = n u l l ; } void setRow ( i n t r , double [ ] row ) { i f ( t h i s . data == n u l l ) {

80


t h i s . data = new double [ t h i s . nRows ] [ ] ; } t h i s . data [ r ] = row ; } void f i l l R o w ( i n t r ) { double [ ] d = new double [ t h i s . nCols ] ; f o r ( i n t c = 0 ; c < t h i s . nCols ; c ++) { d [ c ] = c ; // d [ c ] = Math . random ( ) ; } t h i s . setRow ( r , d ) ; } void f i l l M a t r i x ( ) { f o r ( i n t i = 0 ; i < t h i s . nRows ; i ++) this . f i l l R ow ( i ) ; } double [ ] multRow ( i n t r , Matrix m) { double [ ] d = new double [m. nCols ] ; f o r ( i n t i = 0 ; i < m. nRows ; i ++) { f o r ( i n t j = 0 ; j < m. nCols ; j ++) { d [ j ] += t h i s . data [ r ] [ i ] * m. data [ i ] [ j ] ; } } return d ; } Matrix m u l t i p l y ( Matrix m) { Matrix p r o d u c t = new Matrix ( t h i s . nRows , m. nCols ) ; f o r ( i n t i = 0 ; i < t h i s . nRows ; i ++) { p r o d u c t . setRow ( i , multRow ( i , m) ) ; } return p r o d u c t ; } s t a t i c f i n a l S t r i n g h e l p M es s a g e = ” u s a g e : Matrix s i z e ” ; public s t a t i c void main ( S t r i n g [ ] a r g s ) { long s ta r tT i m e = System . c u r r e n t T i m e M i l l i s ( ) ; int n = 0; try { n = Integer . parseInt ( args [ 0 ] ) ; } catch ( NumberFormatException e ) { System . e r r . p r i n t l n ( h el p M es s a g e ) ; System . e x i t ( 1 ) ; } catch ( ArrayIndexOutOfBoundsException e ) { System . e r r . p r i n t l n ( h el p M es s a g e ) ; System . e x i t ( 1 ) ; } Matrix a , b , c ; a = new Matrix ( n , n ) ; b = new Matrix ( n , n ) ; a. fillMatrix (); b. fillMatrix (); c = a . multiply (b ) ;

}

}

System . e r r . p r i n t l n ( ( System . c u r r e n t T i m e M i l l i s ( ) − s ta r tT i m e ) / 1 0 0 0 . 0 ) ;

81


C.4 C.4.1

Mergesort MergeSort.java

public c l a s s MergeSort { private s t a t i c Timer t i m e r = new Timer ( ) . s t a r t ( ) ; private s t a t i c boolean v e r b o s e = f a l s e ; private s t a t i c double [ ] data = n u l l ; private s t a t i c i n t u n i t S i z e = 1 0 2 4 ; public s t a t i c void main ( S t r i n g [ ] a r g s ) { f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; i ++) { i f ( a r g s [ i ] . s t a r t s W i t h ( ”−v ” ) ) { MergeSort . v e r b o s e = true ; } e l s e i f ( a r g s [ i ] . s t a r t s W i t h ( ”−u” ) ) { try { u n i t S i z e = I n t e g e r . p a r s e I n t ( a r g s [++ i ] ) ; } catch ( NumberFormatException ex ) { e x i tW i t h U s a g e ( ) ; } } else { try { generateRandomData ( I n t e g e r . p a r s e I n t ( a r g s [ i ] ) ) ; } catch ( NumberFormatException ex ) { e x i tW i t h U s a g e ( ) ; } } } i f ( data == n u l l ) e x i tW i t h U s a g e ( ) ; System . out . p r i n t l n ( ” g e n e r a t e d ” + data . l e n g t h + ” random numbers , ” + ” time = ” + t i m e r . getElapsedT ime ( ) / 1 0 0 0 . 0 ) ; i f ( verbose ) { System . out . p r i n t l n ( ” b e f o r e s o r t i n g : ” ) ; printData ( ) ; } s o r t ( 0 , data . l e n g t h ) ; i f ( verbose ) { System . out . p r i n t l n ( ” a f t e r s o r t i n g : ” ) ; printData ( ) ; } System . out . p r i n t l n ( ” . . . done s o r t i n g , time = ” + t i m e r . getElapsedT ime ( ) / 1 0 0 0 . 0 ) ; } public s t a t i c void generateRandomData ( i n t s i z e ) { data = new double [ s i z e ] ; f o r ( i n t i = 0 ; i < s i z e ; i ++) data [ i ] = Math . random ( ) ; } public s t a t i c void s o r t ( i n t p , i n t r ) { if ( r − p > unitSize ) { int q = ( p + r ) / 2 ; sort (p , q ) ;

82


}

sort (q , r ) ; merge ( p , q , r ) ; } else { j a v a . u t i l . A r r a y s . s o r t ( data , p , r ) ; }

public s t a t i c void merge ( i n t p , i n t q , i n t r ) { double [ ] tmp = new double [ q − p ] ; System . a r r a y c o p y ( data , p , tmp , 0 , q − p ) ; int m = 0 , n = q ;

}

f o r ( i n t i = p ; i < r ; i ++) { i f (m >= tmp . l e n g t h ) { data [ i ] = data [ n++]; } e l s e i f ( n >= r ) { data [ i ] = tmp [m++]; } else { i f ( tmp [m] < data [ n ] ) data [ i ] = tmp [m++]; else data [ i ] = data [ n++]; } }

public s t a t i c void swap ( i n t p , i n t q ) { double tmp = data [ p ] ; data [ p ] = data [ q ] ; data [ q ] = tmp ; } public s t a t i c void p r i n t D a t a ( ) { f o r ( i n t i = 0 ; i < data . l e n g t h ; i ++) System . out . p r i n t l n ( data [ i ] ) ; } public s t a t i c void e x i tW i t h U s a g e ( ) { System . e r r . p r i n t l n ( ” u s a g e : j a v a MergeSort [ − v ] [ − u u n i t S i z e ] s i z e ” ) ;

}

}

C.5 C.5.1

System . e x i t ( − 1);

TSP TSP.java

import j a v a . i o . * ; public c l a s s TSP { private s t a t i c boolean debug = f a l s e ; private s t a t i c Timer t i m e r = new Timer ( ) ; public s t a t i c int private s t a t i c i n t [ ] [ ] private s t a t i c Path [ ] [ ]

numCities = 0; d i s t a n c e s = null ; bestPaths = null ;

public s t a t i c void main ( S t r i n g [ ] a r g s ) throws IOException { Str i ng f i l enam e = null ;

83


try { f o r ( i n t i = 0 ; i < a r g s . l e n g t h ; i ++) { i f ( a r g s [ i ] . s t a r t s W i t h ( ”−d” ) ) { debug = true ; } e l s e i f ( a r g s [ i ] . s t a r t s W i t h ( ”−f ” ) ) { f i l e n a m e = a r g s [++ i ] ; } else { n u m C i ti es = I n t e g e r . p a r s e I n t ( a r g s [ i ] ) ; } } } catch ( E x c e p t i o n ex ) { e x i tW i t h U s a g e ( ) ; } i f ( n u m C i t i e s = = 0 | | f i l e n a m e == n u l l ) e x i tW i t h U s a g e ( ) ; System . out . p r i n t l n ( ” i n i t i a l i z i n g from f i l e ” + f i l e n a m e + ” . . . ” ) ; i n i t i a l i z e ( filename ) ; System . out . p r i n t l n ( ” s o l v i n g . . . ” ) ; timer . s t a r t ( ) ; solve () ; timer . stop ( ) ; printResult () ; System . out . p r i n t l n ( ” time = ” + t i m e r . getElapsedT ime ( ) / 1 0 0 0 . 0 ) ; } /*

* I n i t i a l i z e s t h e TSP s o l v e r . This method f i r s t r e a d s t h e i n t e r −c i t y * d i s t a n c e s from a f i l e . The f i l e s t a r t s w i t h an i n t e g e r N on a l i n e , * which i n d i c a t e s t h e number o f c i t i e s i n t h e problem . The n e x t * (N(N+1)/2) l i n e s i n t h e f i l e s p e c i f y t h e d i s t a n c e s b e t we e n t h e c i t i e s . * The i n p u t dat a i s t he n u se d t o i n i t i a l i z e t h e d i s t a n c e s ar r ay and t h e * pat h t r e e . */

private s t a t i c void i n i t i a l i z e ( S t r i n g f i l e n a m e ) throws IOException { B u f f e r e d R e a d e r b u f = new B u f f e r e d R e a d e r (new F i l e R e a d e r ( f i l e n a m e ) ) ; debugln ( numCities + ” c i t i e s . . . ” ) ; d i s t a n c e s = new i n t [ n u m C i t i e s ] [ n u m C i t i e s ] ; // b u i l d t r i a n g u l a r d i s t a n c e s ar r ay f o r ( i n t i = 1 ; i < n u m C i t i e s ; i ++) { f o r ( i n t j = 0 ; j < i ; j ++) { d i s t a n c e s [ i ] [ j ] = I n t e g e r . par s eInt ( buf . readLine ( ) ) ; debug ( ” ” + d i s t a n c e s [ i ] [ j ] ) ; } debugln ( ”” ) ; } buf . c l o s e ( ) ; // i n i t i a l i z e t h e sub−t r e e s b e s t P a t h s = new Path [ n u m C i t i e s ] [ n u m C i t i e s ] ; f o r ( i n t i = 0 ; i < n u m C i t i e s ; i ++) f o r ( i n t j = 0 ; j < n u m C i t i e s ; j ++) b e s t P a t h s [ i ] [ j ] = new Path ( n u m C i t i e s ) ; } /*

* Returns t h e d i s t a n c e b e t we e n t h e two g i v e n c i t i e s . ( This i s * n e c e s s a r y b e c a u s e t h e d i s t a n c e s ar r ay i s t r i a n g u l a r . ) */

84

85

Appendix C. Benchmarks Used public s t a t i c i n t g e t D i s t a n c e ( i n t c1 , i n t c2 ) { return ( c1 > c2 ? d i s t a n c e s [ c1 ] [ c2 ] : d i s t a n c e s [ c2 ] [ c1 ] ) ; } /*

* Having r e ad t h e i n t e r −c i t y d i s t a n c e s from a f i l e , s o l v e s t h e TSP w i t h * t h e branch−and−bound a l g o r i t h m , s t a r t i n g from c i t y # 0 . ( Changing t h e * o r i g i n c i t y a p p a r e n t l y r e q u i r e s pe r m u t i n g t h e dat a i n t h e i n p u t f i l e . ) * The pat h t r e e i s expanded s e q u e n t i a l l y ( i n a l o o p ) t o t h e s e c on d l e v e l * ( f i x i n g t h e s e c on d and t h e t h i r d c i t i e s ) , t he n t h e sub−t r e e s from t h a t * l e v e l on can be expanded i n p a r a l l e l . */

public s t a t i c void s o l v e ( ) { f o r ( i n t c2 = 1 ; c2 < n u m C i t i e s ; c2 ++) { f o r ( i n t c3 = 1 ; c3 < n u m C i t i e s ; c3 ++) { i f ( c2 ! = c3 ) { f o r ( i n t s k i p = 0 ; s k i p < n u m C i t i e s − 3 ; s k i p ++) t r a v e r s e ( c2 , c3 , s k i p ) ; } } } } /*

* Pr e par e s a sub−t r e e f o r r e c u r s i v e e x pan s i on by f i l l i n g i n t h e f i r s t * f o u r s t o p s o f t h e pat h r e p r e s e n t e d by t h e sub−t r e e , and c a l c u l a t i n g * t h e r u n n i n g t o t a l o f t h e pat h l e n g t h . */

private s t a t i c void t r a v e r s e ( i n t c2 , i n t c3 , i n t s k i p ) { T r a v e r s a l t r a v e r s a l = new T r a v e r s a l ( n u m C i t i e s ) ; traversal traversal traversal traversal traversal

. v isit (0 , 0); . v i s i t ( 1 , c2 ) ; . v i s i t ( 2 , c3 ) ; . v i s i t ( 3 , t r a v e r s a l . findNextCity ( 0 , skip ) ) ; . branchAndBound ( b e s t P a t h s [ c2 ] [ c3 ] , 4 ) ;

} private s t a t i c void p r i n t R e s u l t ( ) { f o r ( i n t i = 0 ; i < n u m C i t i e s ; i ++) { f o r ( i n t j = 0 ; j < n u m C i t i e s ; j ++) { i f ( ! bestPaths [ i ] [ j ] . isIncomplete ( ) ) System . out . p r i n t l n ( b e s t P a t h s [ i ] [ j ] ) ; } } } private s t a t i c void e x i tW i t h U s a g e ( ) { System . e r r . p r i n t l n ( ” u s a g e : j a v a TSP [ − debug ] − f i l e System . e x i t ( − 1); } public s t a t i c void debug ( S t r i n g s ) { i f ( TSP . debug ) System . out . p r i n t ( s ) ; }

}

public s t a t i c void d e b u g l n ( S t r i n g s ) { i f ( TSP . debug ) System . out . p r i n t l n ( s ) ; }

c l a s s Path { protected i n t [ ] path ; private int t o t a l D i s t a n c e = −1; public Path ( i n t n u m C i t i e s ) {

filename nCities ” ) ;


path = new i n t [ n u m C i t i e s ] ; f o r ( i n t i = 0 ; i < path . l e n g t h ; i ++) path [ i ] = − 1 ;

}

public public public public

int void void void

g e t T o t a l D i s t a n c e ( ) { return t o t a l D i s t a n c e ; } s etTotal D i s tance ( int d i s t a n c e ) { t o t a l D i s t a n c e = d i s t a n c e ; } i n c r e m e n t T o t a l D i s t a n c e ( i n t d e l t a ) { t o t a l D i s t a n c e += d e l t a ; } d e c r e m e n t T o t a l D i s t a n c e ( i n t d e l t a ) { t o t a l D i s t a n c e −= d e l t a ; }

public boolean i s I n c o m p l e t e ( ) { f o r ( i n t i = 0 ; i < path . l e n g t h ; i ++) i f ( path [ i ] == −1) return true ; return f a l s e ; } public void update ( Path p ) { this . total Di s tance = p . total Di s tance ; System . a r r a y c o p y ( p . path , 0 , t h i s . path , 0 , p . path . l e n g t h ) ; } public S t r i n g t o S t r i n g ( ) { S t r i n g B u f f e r sb = new S t r i n g B u f f e r ( ) ; sb . append ( ” ( ” ) . append ( t o t a l D i s t a n c e ) . append ( ” ) ” ) ; int i = 0; while ( i < path . l e n g t h ) { sb . append ( path [ i ] ) ; i f ( path [ i ] == −1) break ; sb . append ( ”−>” ) ; ++i ; } i f ( i == path . l e n g t h ) sb . append ( path [ 0 ] ) ; return sb . t o S t r i n g ( ) ; }

}

c l a s s T r a v e r s a l extends Path { private boolean [ ] i s V i s i t e d ; public T r a v e r s a l ( i n t n u m C i t i e s ) { super ( n u m C i t i e s ) ; i s V i s i t e d = new boolean [ n u m C i t i e s ] ; f o r ( i n t i = 0 ; i < i s V i s i t e d . l e n g t h ; i ++) isV isited [ i ] = false ; } /*

* Marks a c i t y as v i s i t e d and adds i t t o t h e t r a v e r s a l a t t h e s p e c i f i e d * location , returns the t o t a l distance . */

public void v i s i t ( i n t hop , i n t c i t y ) { i s V i s i t e d [ c i t y ] = true ; path [ hop ] = city ;

i f ( hop ! = 0 ) i n c r e m e n t T o t a l D i s t a n c e (TSP . g e t D i s t a n c e ( path [ hop − 1 ] , c i t y ) ) ;

} /*

* R e v e r s e s t h e e f f e c t s o f a c a l l t o v i s i t ( ) w i t h t h e same par am e t e r s . */

86

Appendix C. Benchmarks Used public void u n v i s i t ( i n t hop , i n t c i t y ) { is Vi sited [ city ] = false ; path [ hop ] = −1; i f ( hop ! = 0 ) d e c r e m e n t T o t a l D i s t a n c e (TSP . g e t D i s t a n c e ( path [ hop − 1 ] , c i t y ) ) ;

} /*

* Returns a c i t y t h a t has n ot been v i s i t e d i n t h i s T r a v e r s a l , b u t s k i p s * t h e s p e c i f i e d number o f c a n d i d a t e s . */

public i n t f i n d N e x t C i t y ( i n t prev , i n t s k i p ) { i n t c i t y = ( p r e v + 1 ) % TSP . n u m C i t i e s ;

f o r ( i n t i = 0 ; i < TSP . n u m C i t i e s ; i ++) { if (! isVisited [ city ]) { i f ( s k i p −− == 0) return c i t y ; } c i t y = ( c i t y + 1 ) % TSP . n u m C i ti es ; } throw new RuntimeException ( ” e r r o r s e l e c t i n g n e x t c i t y ” ) ; } /*

* R e c u r s i v e l y e x pan ds a sub−t r e e , s t a r t i n g from t h e l e v e l s p e c i f i e d by * t h e parameter hop , and d i s c a r d s a l l p a t h s whose l e n g t h s e x c e e d t h e * c u r r e n t bound . */

public void branchAndBound ( Path bestPath , i n t hop ) { i n t r e m a i n i n g = TSP . n u m C i t i e s − hop ; i n t p r e v = path [ hop − 1 ] ;

// each o f t h e r e m ai n i n g c i t i e s may be v i s i t e d n e x t f o r ( i n t i = 0 ; i < r e m a i n i n g ; i ++) { i n t n ex t = f i n d N e x t C i t y ( prev , i ) ; i n t b e s t = b es tP a th . g e t T o t a l D i s t a n c e ( ) ; v i s i t ( hop , n ex t ) ; i f ( best == −1 || getTotalDistance () < best ) { i f ( remaining > 1) { t h i s . branchAndBound ( bestPath , hop + 1 ) ; } else { i n t r e t u r n T r i p D i s t a n c e = TSP . g e t D i s t a n c e ( next , path [ 0 ] ) ; incrementTotalDistance ( returnTripDistance ) ; i f ( best == −1 || getTotalDistance () < best ) { b es tP a th . update ( t h i s ) ; // TSP . d e b u g l n ( t h i s . t o S t r i n g ( ) + ” ok ” ) ; } else { // TSP . d e b u g l n ( t h i s . t o S t r i n g ( ) + ” d i s c a r d e d ” ) ; }

} } else { // TSP . d e b u g l n ( t h i s . t o S t r i n g ( ) + ” d i s c a r d e d ” ) ; }

// u n v i s i t t h e c i t y so we can t r y a n o t h e r pe r m u t at i on u n v i s i t ( hop , n ex t ) ; }

}

}

87


C.6 C.6.1

15-puzzle CostLookupTable.java

import j a v a . l a n g . * ; public c l a s s CostLookupTable { public s t a t i c void i n i t i a l i z e ( ) { int s i z e = F i f teenP uzz l e . getBoardSize ( ) ; i n t r o w S i z e = F i f t e e n P u z z l e . getRowSize ( ) ; int index ; CostLookupTable . c o s t T a b l e = new i n t [ s i z e ] [ s i z e ] [ s i z e ] ; f o r ( i n t t i l e = 1 ; t i l e < s i z e ;++ t i l e ) { f o r ( i n t s o u r c e = 0 ; s o u r c e < s i z e ;++ s o u r c e ) { for ( int des t = 0; d e s t < MoveLookupTable . l o o k u p N u m b e r O f P o s i ti o n s ( s o u r c e ) ; ++d e s t ) { i n d e x = MoveLookupTable . l o o k u p P o s i t i o n ( s o u r c e , d e s t ) ; CostLookupTable . c o s t T a b l e [ t i l e ] [ s o u r c e ] [ i n d e x ] = Math . abs ( ( t i l e % r o w S i z e ) − ( i n d e x % r o w S i z e )) − Math . abs ( ( t i l e % r o w S i z e ) − ( s o u r c e % r o w S i z e )) + Math . abs ( ( t i l e / r o w S i z e ) − ( i n d e x / r o w S i z e )) − Math . abs ( ( t i l e / r o w S i z e ) − ( s o u r c e / r o w S i z e ) ) ; } /* f o r */ } /* f o r */ } /* f o r */ } /* i n i t i a l i z e */ public s t a t i c i n t l o o k u p C o s t I n c r em en t ( i n t t i l e , i n t newblank , i n t b l a n k ) { return ( CostLookupTable . c o s t T a b l e [ t i l e ] [ newblank ] [ b l a n k ] ) ; } / * l ook u pC os t In c r e m e n t * / protected s t a t i c i n t [ ] [ ] [ ] } / * CostLookupTable * /

C.6.2

costTable ;

FifteenPuzzle.java

import j a v a . l a n g . * ; import j a v a . i o . IOException ; public c l a s s F i f t e e n P u z z l e { s t a t i c native i n t n a ti v eT h r ea d S et C o n cu r r en c y ( i n t c o n c u r r e n c y ) ; static { System . l o a d L i b r a r y ( ” PuzzleThread ” ) ; n a ti v eT h r ea d S e tC o n c u r r e n c y ( 4 ) ; } public s t a t i c void main ( S t r i n g [ ] a r g v ) { i n t masterDepth = 0 ; int p a r a l l e l S t a r t = 1; i f ( argv . l en gth ! = 2 ) { F i f te e n P u z z l e . usage ( ) ; System . e x i t ( − 1 ) ; } /* i f */

88


89

try { masterDepth = I n t e g e r . p a r s e I n t ( a r g v [ 0 ] ) ; p a r a l l e l S t a r t = I n t e g e r . p ar s eIn t ( argv [ 1 ] ) ; } catch ( NumberFormatException n f e ) { F i f te e n P u z z l e . usage ( ) ; System . e x i t ( − 1 ) ; } /* t r y */ F i f t e e n P u z z l e . s e a r c h P o s i t i o n s ( masterDepth , p a r a l l e l S t a r t ) ; } / * main * / protected s t a t i c void s e a r c h P o s i t i o n s ( i n t masterDepth , i n t p a r a l l e l S t a r t ) { PuzzleFactory f a c t o r y = null ; InitialNode initialNode ; try { f a c t o r y = new P u z z l e F a c t o r y ( ) ; } catch ( IOException i o e ) { i oe . printStackTrace ( ) ; System . e x i t ( − 1 ) ; } /* t r y */ // Precompute needed t a b l e s . MoveLookupTable . i n i t i a l i z e ( ) ; CostLookupTable . i n i t i a l i z e ( ) ; f o r ( i n t i = 1;;++ i ) { i n i t i a l N o d e = f actor y . getNextPuzzle ( ) ; i f ( i n i t i a l N o d e == n u l l ) { break ; } /* i f */ System . out . p r i n t l n ( ” Problem Number ” + i ) ; System . out . p r i n t l n ( ” Board : ” ) ; i n i t i a l N o d e . printBoard ( ) ; i n i t i a l N o d e . cr ea teM a s ter G r a p h ( masterDepth ) ; // i n i t i a l N o d e . pr i n t G r aph ( ) ; initialNode . iterativeDeepening ( paralle lSta rt ) ; } /* f o r */ } / * main * / public s t a t i c i n t g e t B o a r d S i z e ( ) { return ( F i f t e e n P u z z l e . s i z e * F i f t e e n P u z z l e . s i z e ) ; } / * getNumberOfPieces * / public s t a t i c i n t getRowSize ( ) { return ( F i f t e e n P u z z l e . s i z e ) ; } / * g e t R owS i z e * / // 4 by 4 b oar d . protected s t a t i c f i n a l i n t

size = 4 ;

protected s t a t i c void u s a g e ( ) { System . e r r . p r i n t l n ( ” Usage : j a v a F i f t e e n P u z z l e depthOfMasterNodes p a r a l l e l S t a r t ” ) ; } /* u s a g e */ } /* F i f t e e n P u z z l e */

C.6.3

FrontierNode.java

import j a v a . l a n g . * ; public c l a s s F r o n ti er N o d e extends SearchNode implements Runnable


90

{ public s t a t i c boolean done = f a l s e ; protected s t a t i c ThreadGroup threadGroup = new ThreadGroup ( ” Threads ” ) ; protected s t a t i c boolean t h r e a d s E x i s t = f a l s e ; protected i n t g ; protected i n t h ; protected i n t t h r e s h o l d ; protected i n t o l d B l a n k ; protected i n t c o u n t ; public i n t nodeCount ; public F r o n ti er N o d e ( i n t [ ] p o s i t i o n , i n t b l a n k ) { super ( p o s i t i o n , b l a n k ) ; t h i s . co u n t = 0 ; } / * Fr on t i e r N ode * / public s t a t i c i n t busy ( ) { /* r e t u r n t r u e w h i l e a t l e a s t one t h r e a d i s a c t i v e and f a l s e when t h e y ar e a l l done . */ i f ( ! F r o n ti er N o d e . t h r e a d s E x i s t ) return 0 ; Thread . cu r r en tT h r ea d ( ) . y i e l d ( ) ; // t r y { Thread . s l e e p ( 1 0 0 0 ) ; // } // c a t c h ( E x c e p t i o n e ) { } // System . ou t . p r i n t l n (”# t h r e a d s : ” + threadGroup . a c t i v e C o u n t ( ) ) ; return ( threadGroup . a c t i v e C o u n t ( ) ) ; } protected boolean s e a r c h ( i n t g , i n t h , i n t t h r e s h o l d , i n t oldBlank , i n t p a r a l l e l S t a r t ) { Thread t h r e a d ;

//

// //

i f ( F r o n ti er N o d e . done ) return true ; t h i s . co u n t = t h i s . co u n t + 1 ; System . ou t . p r i n t l n (” Search number ” + t h i s . c ou n t + ” i n node : ” + t h i s . g e t B l a n k ( ) ) ; this . g = g ; this . h = h ; this . threshold = threshold ; this . oldBlank = oldBlank ; i f ( ( c ou n t >= p a r a l l e l S t a r t ) && ( threadGroup . a c t i v e C o u n t ( ) < 8 ) ) { i f ( c ou n t >= p a r a l l e l S t a r t ) { // i f ( p a r a l l e l S t a r t ! = 0 ) { i f ( false ) { t h r e a d = new Thread ( F r o n ti er N o d e . threadGroup , t h i s ) ; thread . s t a r t ( ) ; F r o n ti er N o d e . t h r e a d s E x i s t = true ; } else t h i s . run ( ) ; return F r o n ti er N o d e . done ; } /* s e a r c h */ public void run ( ) { t h i s . nodeCount = 0 ; this . f r o n t i e r S e a r c h ( this . g , this . h , this . threshold , t h i s . oldBlank , t h i s . g etB l a n k ( ) , this . copyPosition ( ) ) ; t h i s . updateNodeCount ( ) ; } / * run * / protected synchronized void updateNodeCount ( ) {


SearchNode . nodeCount += t h i s . nodeCount ; } protected void f r o n t i e r S e a r c h ( i n t g , i n t h , i n t t h r e s h o l d , i n t oldBlank , i n t blank , i n t [ ] p o s i t i o n ) { int index ; i n t newBlank ; int t i l e ; i n t newh ; i n t noMoves = MoveLookupTable . l o o k u p N u m b e r O f P o s i ti o n s ( b l a n k ) ; f o r ( i n d e x = 0 ; i n d e x < noMoves;++ i n d e x ) { newBlank = MoveLookupTable . l o o k u p P o s i t i o n ( blank , i n d e x ) ; i f ( newBlank ! = o l d B l a n k ) { ++t h i s . nodeCount ; t i l e = p o s i t i o n [ newBlank ] ; p o s i t i o n [ b l a n k ] = p o s i t i o n [ newBlank ] ; newh = h + CostLookupTable . l o o k u p C o s tIn cr e m en t ( t i l e , newBlank , b l a n k ) ; i f ( g + 1 + newh = t h i s . manhattan Dist a n c e ( ) ) { throw new I l l e g a l A r g u m e n t E x c e p t i o n ( ” Depth o f m a s ter graph ” + ” to o l a r g e . ” ) ; } /* i f */ SearchNode . createNewGraph ( ) ; L a b e l l a b e l s = new L a b e l ( masterDepth + 1 ) ; t h i s . co n s tr u ctM a s ter G r a p h ( 0 , masterDepth , t h i s . g etB l a n k ( ) , − 1 ,

91


} / * c r e at e M as t e r G r aph * /

92

labels ) ;

public void i t e r a t i v e D e e p e n i n g ( i n t p a r a l l e l S t a r t ) { i n t manhattan ; long s ta r tT i m e ; double searchTime ; boolean s o l v e d = f a l s e ; double t o t a l N o d e s = 0 . 0 ; double nodesForDepth [ ] = new double [ 1 0 0 ] ; int thr es hol d ; manhattan = t h i s . manhattanD ista nce ( ) ; s ta r tT i m e = System . c u r r e n t T i m e M i l l i s ( ) ; f o r ( t h r e s h o l d = manhattan ; ! s o l v e d && ! F r o n ti er N o d e . done ; t h r e s h o l d + = 2 ) { // R e s e t v i s i t e d v e r t i c e s . SearchNode . getGraph ( ) . r e s e t ( ) ; SearchNode . nodeCount = 0 . 0 ; s o l v e d = t h i s . s e a r c h ( 0 , manhattan , t h r e s h o l d , − 1 , p a r a l l e l S t a r t ) ; // i n t add = 0 ; // i n t c ou n t = −1; // i n t was t e = 0 ; // w h i l e ( c ou n t ! = 0 ) { // c oun t = Fr on t i e r N ode . b usy ( ) ; // i f ( c ou n t < 8 ) // add = 1 ; // i f ( add == 1 ) // was t e += ( 8 − c ou n t ) ; // } // System . ou t . p r i n t l n ( ” Waste i s ” + was t e + ” avg ” + was t e / 8 ) ; while ( F r o n ti er N o d e . busy ( ) ! = 0 ) ; t o t a l N o d e s += SearchNode . nodeCount ; nodesForDepth [ t h r e s h o l d ] = SearchNode . nodeCount ; searchTime = ( System . c u r r e n t T i m e M i l l i s ( ) − s ta r tT i m e ) / 1 0 0 0 . 0 ; System . out . p r i n t l n ( t h r e s h o l d + ” : ” + t o t a l N o d e s + ” time : ” + searchTime ) ; } /* f o r */ ++t o t a l N o d e s ; ++nodesForDepth [ t h r e s h o l d − 2 ] ; searchTime = ( System . c u r r e n t T i m e M i l l i s ( ) − s ta r tT i m e ) / 1 0 0 0 . 0 ; System . out . p r i n t l n ( ”h = ” + manhattan ) ; System . out . p r i n t l n ( ”Time i s ” + searchTime + ” : ” + ( ( i n t ) ( t o t a l N o d e s / searchTime ) ) + ” nodes p e r s e c o n d . ” ) ; System . out . p r i n t l n ( ” g = ” + ( t h r e s h o l d − 2 ) + ” , ” + t o t a l N o d e s + ” nodes ” ) ; f o r ( i n t i = manhattan ; i < t h r e s h o l d ; i + = 2 ) { System . out . p r i n t ( nodesForDepth [ i ] + ” ” ) ; } /* f o r */ System . out . p r i n t l n ( ” ” ) ; } /* i t e r a t i v e D e e p e n i n g */ protected i n t manhattanD ista n ce ( ) { int value = 0 ; int l ength = F i f teenP uzz l e . getBoardSize ( ) ; i n t r o w S i z e = F i f t e e n P u z z l e . getRowSize ( ) ; int t i l e ; f o r ( i n t i = 0 ; i < l e n g t h ;++ i ) { t i l e = this . getPositionValue ( i ) ; if ( ti l e != 0) { v a l u e += (Math . abs ( ( i % r o w S i z e ) − ( t i l e % r o w S i z e )) + Math . abs ( ( i / r o w S i z e ) − ( t i l e / r o w S i z e ) ) ) ;


93

} /* i f */ } /* f o r */ return ( v a l u e ) ; } / * m an hat t an D i s t an c e * / } /* I n i t i a l N o d e */

C.6.5

Label.java

import j a v a . l a n g . * ; public c l a s s L a b e l { public L a b e l ( i n t numberLabels ) { t h i s . l a b e l s = new I n t e g e r [ numberLabels ] ; f o r ( i n t i = 0 ; i < numberLabels;++ i ) { t h i s . l a b e l s [ i ] = new I n t e g e r ( i ) ; } /* f o r */ } /* L a b e l */ public I n t e g e r g e t L a b e l ( i n t l a b e l ) { return ( t h i s . l a b e l s [ l a b e l ] ) ; } /* g e t L a b e l */ protected I n t e g e r [ ] } /* L a b e l */

C.6.6

labels ;

MasterNode.java

import j a v a . l a n g . * ; import s t r u c t u r e . * ; public c l a s s MasterNode extends SearchNode { public MasterNode ( i n t [ ] p o s i t i o n , i n t b l a n k ) { super ( p o s i t i o n , b l a n k ) ; } / * MasterNode * / protected boolean s e a r c h ( i n t g , i n t h , i n t t h r e s h o l d , i n t oldBlank , i n t p a r a l l e l S t a r t ) { i n t b l a n k = t h i s . g etB l a n k ( ) ; boolean s o l v e d = f a l s e ; Graph graph = SearchNode . getGraph ( ) ; i n t newh ; i n t newBlank ; //++SearchNode . nodeCount ; graph . v i s i t ( t h i s ) ; I t e r a t o r n e i g h b o u r s = graph . n e i g h b o r s ( t h i s ) ; while ( n e i g h b o u r s . hasMoreElements ( ) ) { SearchNode s = ( SearchNode ) n e i g h b o u r s . nextElement ( ) ; i f ( ! graph . i s V i s i t e d ( s ) ) { ++SearchNode . nodeCount ; newBlank = s . g etB l a n k ( ) ; newh = h + CostLookupTable . l o o k u p C o s t I n c r em en t ( t h i s . g e t P o s i t i o n V a l u e ( newBlank ) , newBlank , b l a n k ) ; i f ( g + 1 + newh rowSize − 1) { t h i s . p o s [ t h i s . num ] = b l a n k I n d e x − r o w S i z e ; ++t h i s . num ; } /* i f */ i f ( blankIndex % rowSize > 0) { t h i s . p o s [ t h i s . num ] = b l a n k I n d e x − 1 ; ++t h i s . num ; } /* i f */ i f ( blankIndex % rowSize < rowSize − 1) { t h i s . p o s [ t h i s . num ] = b l a n k I n d e x + 1 ; ++t h i s . num ; } /* i f */ i f ( blankIndex < F i f te e n P u z z l e . getBoardSize () − rowSize ) { t h i s . p o s [ t h i s . num ] = b l a n k I n d e x + r o w S i z e ; ++t h i s . num ; } /* i f */ } / * Operator * / public f i n a l i n t g e t P o s i t i o n ( i n t i n d e x ) { i f ( i n d e x >= t h i s . num ) { throw new IndexOutOfBoundsException ( i n d e x + ” o f ” + t h i s . num ) ;

95

96

Appendix C. Benchmarks Used } /* i f */ return ( t h i s . p o s [ i n d e x ] ) ; } / * g e t P os * / public f i n a l i n t g e t N u m P o s i t i o n s ( ) { return ( t h i s . num ) ; } / * g e t N u m P os i t i on s * / public f i n a l void p r i n t ( ) { System . out . p r i n t l n ( ”Number o f p o s i t i o n s : ” + t h i s . g etN u m P o s i ti o n s ( ) ) ; System . out . p r i n t ( ” P o s i t i o n s : ” ) ; f o r ( i n t i = 0 ; i < t h i s . num;++ i ) { System . out . p r i n t ( ” ” + t h i s . g e t P o s i t i o n ( i ) ) ; } /* f o r */ System . out . p r i n t ( ” \n” ) ; } /* p r i n t */ protected i n t num ; protected i n t [ ] p o s ; } / * Operator * /

C.6.8

PuzzleFactory.java

import j a v a . l a n g . * ; import j a v a . i o . * ; public c l a s s P u z z l e F a c t o r y { public P u z z l e F a c t o r y ( S t r i n g f i l e N a m e ) throws IOException { t h i s . f i l e = new B u f f e r e d R e a d e r (new F i l e R e a d e r ( f i l e N a m e ) ) ; } /* P u z z l e F a c t o r y */ public P u z z l e F a c t o r y ( ) throws IOException { t h i s ( ” 15 puzz ” ) ; } /* P u z z l e F a c t o r y */ // Returns n u l l node when i n p u t f i l e i s e x h a u s t e d . public I n i t i a l N o d e g e t N e x t P u z z l e ( ) { i n t [ ] p o s i t i o n = new i n t [ F i f t e e n P u z z l e . g e t B o a r d S i z e ( ) ] ; int blankIndex ; I n i t i a l N o d e node ; try { String inputLine = this . f i l e . readLine ( ) ; blankIndex = this . p a r s e P o s i t i o n ( p o s i t i o n , inputLine ) ; i f ( blankIndex != −1) { node = new I n i t i a l N o d e ( p o s i t i o n , b l a n k In d ex ) ; } else { node = n u l l ; i f ( inputLine ! = null ) { System . e r r . p r i n t l n ( ” E r r o r p a r s i n g p o s i t i o n : \ n inputLine ) ; } /* i f */ } /* i f */ } catch ( IOException i o e ) { i oe . printStackTrace ( ) ; System . e r r . p r i n t l n ( ” C o n t i n u i n g . . . ” ) ; node = n u l l ; } /* t r y */

” +


return ( node ) ; } /* g e t N e x t P u z z l e */ // Returns i n d e x o f t h e b l a n k ( t h e 0 i n t h e i n p u t l i n e ) . protected i n t p a r s e P o s i t i o n ( i n t [ ] p o s i t i o n , S t r i n g t e x t L i n e ) { int blankIndex = −1 ; int s i z e = p o s i t i o n . l ength ; String s ubs tr i ng = ”” ; int s t a r t I n d e x = 0 ; i n t endIndex ; int i ; String inputLine ; i f ( t e x t L i n e == n u l l ) { return ( − 1 ) ; } /* i f */ i n p u t L i n e = t e x t L i n e . tr i m ( ) ; endIndex = i n p u t L i n e . i n d ex O f (SPACE ) ; i f ( endIndex ! = − 1 ) { s u b s t r i n g = i n p u t L i n e . s u b s t r i n g ( s t a r t I n d e x , endIndex ) ; pos i ti on [ 0] = Integer . parseInt ( substring ) ; i f ( posi ti on [0] == 0) { blankIndex = 0 ; } /* f o r */ // I f t h e end i n d e x e v e r comes b ac k empty , t he n we have an // e r r o r s i n c e t h i s l o o p i s o n l y r e a d i n g t h e i n t e r m e d i a t e // v a l u e s . f o r ( i = 1 ; i < s i z e − 1;++ i ) { s t a r t I n d e x = endIndex + 1 ; endIndex = i n p u t L i n e . i n d ex O f (SPACE, s t a r t I n d e x ) ; i f ( endIndex == −1) { System . e r r . p r i n t l n ( ” Encountered premature end . ” ) ; return ( − 1 ) ; } /* i f */ s u b s t r i n g = i n p u t L i n e . s u b s t r i n g ( s t a r t I n d e x , endIndex ) ; posi ti on [ i ] = Integer . parseInt ( substring ) ; i f ( p o s i t i o n [ i ] == 0) { blankIndex = i ; } /* i f */ } /* f o r */ s t a r t I n d e x = endIndex + 1 ; endIndex = i n p u t L i n e . l e n g t h ( ) ; s u b s t r i n g = i n p u t L i n e . s u b s t r i n g ( s t a r t I n d e x , endIndex ) ; posi ti on [ i ] = Integer . parseInt ( substring ) ; i f ( p o s i t i o n [ i ] == 0) { blankIndex = i ; } /* i f */ } /* i f */ // I f we have s u c c e s s f u l l y r e ad a l l o f t h e data , r e t u r n t h e // i n d e x o f t h e b l a n k . O t he r wi s e , r e t u r n t h e e r r o r s t a t u s . return ( b l a n k I n d e x ) ; } /* p a r s e P o s i t i o n */ file ; protected B u f f e r e d R e a d e r protected s t a t i c i n t SPACE = 3 2 ; } /* P u z z l e F a c t o r y */

C.6.9

SearchNode.java

import j a v a . l a n g . * ; import s t r u c t u r e . * ;

97


public abstract c l a s s SearchNode { public SearchNode ( i n t [ ] p o s i t i o n , i n t b l a n k ) { this . p o s i t i o n = p o s i t i o n ; this . blank = blank ; } / * SearchNode * / public void p r i n t N o d e ( ) { int s i z e = this . p o s i t i o n . l ength ; f o r ( i n t i = 0 ; i < s i z e − 1;++ i ) { System . out . p r i n t ( t h i s . p o s i t i o n [ i ] + ” ” ) ; } /* f o r */ System . out . p r i n t l n ( t h i s . p o s i t i o n [ s i z e − 1 ] ) ; } / * pr i n t N ode * / public void p r i n t B o a r d ( ) { i n t r o w S i z e = F i f t e e n P u z z l e . getRowSize ( ) ; int i = 0 ; f o r ( i n t j = 0 ; j < r o w S i z e;++ j ) { f o r ( i n t k = 0 ; k < r o w S i z e;++k ) { i f ( this . p o s i t i o n [ i ] < 10) { System . out . p r i n t ( ” ” ) ; } /* i f */ System . out . p r i n t ( ” ” + t h i s . p o s i t i o n [ i ] ) ; ++i ; } /* f o r */ System . out . p r i n t l n ( ” ” ) ; } /* f o r */ } / * pr i n t B oar d * / public void p r i n t G r a p h ( ) { t h i s . p r i n tG r a p h ( 0 ) ; } / * pr i n t G r aph * / protected abstract boolean s e a r c h ( i n t g , i n t h , i n t t h r e s h o l d , i n t oldBlank , i n t p a r a l l e l S t a r t ) ; protected void p r i n t G r a p h ( i n t depth ) { Graph g = SearchNode . getGraph ( ) ; // I t e r a t o r v e r t i c e s = g . e l e m e n t s ( ) ; // f o r ( ; v e r t i c e s . hasMoreElements ( ) ; ) { // SearchNode s = ( SearchNode ) v e r t i c e s . n e x t E l e m e n t ( ) ; // I t e r a t o r n e i g hb our hoo d = g . n e i g h b o u r s ( s ) ; // // } / * f o r * / i f ( ! g . i s V i s i t e d ( this ) ) { System . out . p r i n t l n ( ”Node a t depth ” + depth + ” with ” + g . degree ( this ) + ” neighbours . ” ) ; this . printBoard ( ) ; g . v i s i t ( this ) ; I t e r a t o r neighbourhood = g . n ei gh b or s ( this ) ; f o r ( ; n e i g h b o u r h o o d . hasMor eElements ( ) ; ) { SearchNode s = ( Sear chNode ) n e i g h b o u r h o o d . nextElement ( ) ; s . p r i n t G r a p h ( depth + 1 ) ; } /* f o r */ } /* i f */ } / * pr i n t G r aph * / protected f i n a l i n t g e t P o s i t i o n V a l u e ( i n t i n d e x ) { return ( t h i s . p o s i t i o n [ i n d e x ] ) ;

98

Appendix C. Benchmarks Used } /* g e t P o s i t i o n V a l u e */ protected f i n a l i n t g e t B l a n k ( ) { return ( t h i s . b l a n k ) ; } /* g e t B l a n k */ protected f i n a l i n t [ ] c o p y P o s i t i o n ( ) { i n t [ ] copy = new i n t [ t h i s . p o s i t i o n . l e n g t h ] ; System . a r r a y c o p y ( t h i s . p o s i t i o n , 0 , copy , 0 , t h i s . p o s i t i o n . l e n g t h ) ; return ( copy ) ; } /* c o p y P o s i t i o n */ position ; protected i n t [ ] protected i n t b l a n k ; protected s t a t i c Graph getGraph ( ) { return ( SearchNode . g r a p h ) ; } / * getGraph * / protected s t a t i c void createNewGraph ( ) { SearchNode . g r a p h = new G r a p h L i s t U n d i r e c t e d ( ) ; } / * createNewGraph * / protected s t a t i c Graph g r a p h = new G r a p h L i s t U n d i r e c t e d ( ) ; protected s t a t i c double nodeCount = 0 . 0 ; } / * SearchNode * /

99

Run-Time Support for the Automatic Parallelization of Java ...

Run-Time Support for the Automatic Parallelization of Java ...

Suggest Documents

Automatic Loop Transformations and Parallelization for Java

Semi-Automatic Parallelization of Java Applications

Runtime support for type-safe dynamic Java classes - CiteSeerX

Runtime Support for Scable Programming in Java - Community Grids

NPEFix: Automatic Runtime Repair of Null Pointer Exceptions in Java

Automatic Parallelization using AutoFutures

Toward Automatic Parallelization of Spatial Computation for

Static Heap Analysis for Automatic Parallelization

Automatic Parallelization for Embedded Multi ... - Eldorado Dortmund

Automatic Parallelization - Sable Research Group

Runtime Data Analysis for Java Programs - CiteSeerX

Runtime Optimizations for a Java DSM Implementation

A Java Framework for Runtime Modules - CiteSeerX

Runtime Data Analysis for Java Programs - CiteSeerX

Automatic Runtime Customization for Variability ...

Java runtime systems - IEEE Xplore

Java runtime systems - IEEE Xplore

Architectural Support for Scalable Speculative Parallelization in ...

Automatic Parallelization of Programming Languages: Past ... - Microsoft

Automatic Parallelization of Recursive Procedures - Semantic Scholar

Semi-Automatic Parallelization Of Simulations With ...

Automatic Parallelization of XQuery Programs - Semantic Scholar

Automatic Parallelization of Embedded Software Using ... - CiteSeerX

Automatic Parallelization and Scheduling of Programs on ...