Lock Reservation: Java Locks Can Mostly Do Without ... - CiteSeerX

11 downloads 620 Views 241KB Size Report
results show that it achieved performance improvements up ... Java, synchronization, monitor, lock, reservation, thread lo- cality ..... our algorithm into the testing.
Lock Reservation: Java Locks Can Mostly Do Without Atomic Operations Kiyokuni Kawachiya

Akira Koseki

Tamiya Onodera

IBM Research, Tokyo Research Laboratory 1623-14, Shimotsuruma, Yamato, Kanagawa 242-8502, Japan {kawatiya,akoseki,tonodera}@jp.ibm.com

ABSTRACT

cuting threads, Java adopts semantics based on monitor [11, 18], and has monitors associated with objects. The language constructs for synchronization are synchronized methods and blocks. When a thread executes a synchronized method against an object or a synchronized block with an object, the thread acquires the object’s lock before the execution and releases the lock after the execution. Thus, at most one thread can execute the synchronized method or the synchronized block. Because of the built-in support for multi-threaded programming, libraries in Java tend to be designed to be threadsafe, containing many methods declared as synchronized. As a result, Java applications perform a significant number of lock operations. It was reported that 19% of the total execution time was wasted by thread synchronization in an early version of Java virtual machine [4]. Many techniques have since been proposed for optimizing locks in Java, which can be divided into two categories, runtime techniques and compile-time techniques. The former attempts to make lock operations cheaper [2, 6, 13, 34], while the latter attempts to eliminate lock operations [3, 9, 10, 12, 38, 44]. Almost all the runtime techniques follow the principle of optimizing common cases. They exploit the observation that Java locks are normally not contended, and optimize the uncontended cases. These techniques allow a lock to be acquired and released with only a few machine instructions in the absence of contention. However, the instruction sequence inevitably contains one or more compound atomic operations such as compare_and_swap. Considering that atomic operations are especially expensive in modern architectures, the synchronization has not yet become sufficiently light, though the overhead has significantly been reduced. This paper proposes a new runtime technique called lock reservation. It also follows the principle of optimizing common cases. The observation exploited is the biased distribution of lockers called thread locality. That is, for a given object, the lock tends to be dominantly acquired and released by a specific thread, which is obviously the case in single-threaded applications1 . The key idea is to allow a lock to be reserved for a thread. When a thread attempts to acquire an object’s lock, the acquisition is ultra-fast if the lock is reserved for the thread. In particular, it does not require any atomic operation. On

Because of the built-in support for multi-threaded programming, Java programs perform many lock operations. Although the overhead has been significantly reduced in the recent virtual machines, one or more atomic operations are required for acquiring and releasing an object’s lock even in the fastest cases. This paper presents a novel algorithm called lock reservation. It exploits thread locality of Java locks, which claims that the locking sequence of a Java lock contains a very long repetition of a specific thread. The algorithm allows locks to be reserved for threads. When a thread attempts to acquire a lock, it can do without any atomic operation if the lock is reserved for the thread. Otherwise, it cancels the reservation and falls back to a conventional locking algorithm. We have evaluated an implementation of lock reservation in IBM’s production virtual machine and compiler. The results show that it achieved performance improvements up to 53% in real Java programs.

Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors—optimization

General Terms Languages, Algorithms, Performance, Measurement, Experimentation

Keywords Java, synchronization, monitor, lock, reservation, thread locality, atomic operation

1.

INTRODUCTION

One important characteristics of the Java programming language [17] is the built-in support for multi-threaded programming. For synchronization between independently exe-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. OOPSLA’02, November 4-8, 2002, Seattle, Washington, USA. Copyright 2002 ACM 1-58113-417-1/02/0011 ...$5.00.

1

Java virtual machines may create internal helper threads, where Java programs can never be single-threaded in the strict sense.

130

Table 1: Benchmark programs Multithreaded?

Program name SPECjvm98 _202_jess _201_compress _209_db _222_mpegaudio _228_jack _213_javac _227_mtrt SPECjbb2000 Volano Server Volano Client

Description Run each program three times in the application mode. Expert shell system solving a set of puzzles LZW compression and decompression Perform database functions on memory resident database Decompress MP3 audio files Parser generator generating itself Java source-to-bytecode compiler from the JDK 1.0.2 Two-threaded ray tracer Simulate the operations of a TPC-C like business logic, run for 8 warehouses. Chat room simulator Chat client, creating 200 connections and sending 100 messages per connection.

No No No No No No Yes Yes Yes Yes

Table 2: Exploitable thread locality of Java locks X denotes that thread X acquires the lock

Difficult-to-exploit locality Object 1 Created

A

Object 2 Created

C

C

B

C

B

C

B

C

B

C

B

C

B

C

B

B

C

C

A

C

Program name SPECjvm98 _202_jess _201_compress _209_db _222_mpegaudio _228_jack _213_javac _227_mtrt SPECjbb20002 Volano Server Volano Client

Garbage collected

Garbage collected

Exploitable locality

Figure 1: General thread locality and exploitable thread locality the other hand, if the lock is reserved for another thread, the reservation must first be canceled, and the acquisition falls back to an existing algorithm. As we see later, lock reservation can be built on any existing locking algorithm, as long as it uses a word or field in the object header and has one available bit. This bit is used for representing the reservation status. When the status bit is set, the meaning of the rest of the bits is defined by our lock reservation algorithm, while when the bit is not set, the meaning is defined by the underlying algorithm. The rest of the paper is organized as follows. Section 2 shows the thread locality of locks in real Java programs. Section 3 describes the algorithm of lock reservation. Section 4 presents performance results, while Section 5 discusses the related work. Finally, Section 6 offers conclusions.

2.

Number of sync’d objects

Number of lock operations

21278 2135 66592 1620 1635497 1192734 3020 2077210 7279 4102

14646978 28895 162117521 27168 38570415 47062772 3522926 102282147 7244208 10419671

Ratios of lock ops. in 1st. repetitions 99.993% 97.211% 99.9998% 98.108% 99.998% 99.974% 99.557% 79.392% 75.983% 84.270%

Thus, a stronger form of thread locality is considered for exploitability, which is described as follows. For a given lock, if the locking sequence starts with a very long repetition of a specific thread, the lock is said to show exploitable thread locality. When the lock exhibits exploitable thread locality, the initial locker is the dominant locker. Figure 1 shows two objects, one with general but not exploitable locality, and the other with exploitable locality. To investigate how many objects show exploitable thread locality in real programs, we gathered locking statistics using an instrumented version of the IBM Development Kit for Windows, Java Technology Edition, Version 1.3.1 [20]. We measured the Java programs listed in Table 1 — the seven programs of the SPECjvm98 [40], the SPECjbb2000 [39] for eight warehouses, and the server and client programs of the Volano Mark [43]. Among these programs, _227_mtrt, SPECjbb2000, and the Volano Mark are multi-threaded programs. We ran these programs with the JIT compiler disabled, since some locks would otherwise be optimized away by compiler optimizations. The focus in our measurements is the first repetition in the locking sequence of each lock. This is the beginning subsequence consisting only of the initial locker3 . If the first repetition of a lock is very long, the lock shows exploitable thread locality. Table 2 presents the results4 , including the

THREAD LOCALITY OF JAVA LOCKS

This section studies the thread locality of Java locks, which we exploit for reducing the synchronization overhead of Java programs. Thread locality of a lock is defined in terms of the locking sequence, the sequence of threads (in temporal order) that acquire the lock. The general form of thread locality is stated as follows. For a given lock, if its locking sequence contains a very long repetition of a specific thread, the lock is said to exhibit thread locality, while the specific thread is said to be the dominant locker. However, the general form of thread locality is not easy to exploit, since we consider adaptive optimization of locks rather than static optimization using off-line profiles. When the locking sequence of a lock is currently being constructed, it is very hard for the runtime system to cheaply determine whether the lock exhibits thread locality or whether the current locker is the dominant locker.

2 The total number of locks for SPECjbb2000 varies depending on the execution speed. 3 The length of the first repetition may be one. Also, the initial locker may appear again after the first repetition. 4 The results shown here are for the complete execution of each program, including lock operations during the program startup and shutdown.

131

Lockword structure

rcnt

1

Reserve mode

(defined by base lock)

0

Base mode

tid

Anonymously reserved

0

0

1 (a) Reserved for Thread A, but not held

A

>0

1 (b) Reserved for and held by Thread A

0

0

1 (c) Reserved anonymously (will be reserved by the initial locker)

0

1

acquire (initial synchronization)

Lockword semantics in the reserve mode A

Reserve mode

Object creation

LRV bit

Recursion count Thread ID

Acquired

Reserved for Thread A

A

0

acquire

1

Base locking algorithm

unreserve

xxxxxx

0

yyyyyy

0

zzzzzz

0

release

A

1

acquire Recursively acquired

Base mode

1

unreserve

release

A

2

1

unreserve

: : : :

Figure 2: Lockword structure and semantics

Figure 3: Lock state transitions

total number of synchronized objects, the total number of lock operations, and the ratios of lock operations in the first repetitions. As shown in the table, the vast majority of lock operations are performed by the initial lockers. Even for multi-threaded programs, more than 75% of the lock operations were performed by the initial lockers in the first repetitions. Thus, we can draw the conclusion that a significant number of objects exhibit exploitable thread locality. Notice that the ratios in the last column are not 1.0 even for single-threaded programs, since the virtual machine creates system threads for internal tasks such as finalization. We also note that the initial locker of an object is not necessarily the creator of the object. This happens in the Volano Mark programs, where a single thread is dedicated to creating objects and passing them to worker threads that actually use the objects.

bit is set, the lockword is in the reserve mode, and the structure is defined by our algorithm. When the bit is not set, the lockword is in the base mode, and the structure is defined by the underlying algorithm that the runtime system falls back to after canceling the reservation.

3.

3.1

Lockword Structure

Figure 2 shows the structure of the lockword. When the LRV bit is set, the lockword is in the reserve mode, and is further divided into the thread identifier (tid) field and the recursion count (rcnt) field. The former field contains an identifier of the owner thread, for which the lock is reserved, while the latter field keeps the lock recursion level. When the rcnt field is zero, the lock is reserved but not held by any thread (Figure 2(a)). When the field is non-zero, the lock is held by the owner thread (Figure 2(b)). As we will see later, the owner thread can acquire the lock by simply incrementing the rcnt field, with no atomic operation. The rcnt field is also intended for recursive locking, which is fairly common in Java. The owner thread acquires the lock recursively by simply incrementing the rcnt field, in just the same manner as it initially acquires the lock. We must maintain the recursion count of a lock since Java does not allow a thread to release a lock more times than it acquires the lock. The virtual machine must detect such an illegal state and raise an instance of IllegalMonitorStateException. When an object is created, the lock is anonymously reserved. That is, the lockword is in the reserve mode, but not reserved for or held by any particular thread (Figure 2(c)). This is because the thread for which the lock should be reserved is normally not known at the time of creation. In general, a reservation policy determines when and for which thread a lock is reserved. Since we base our algorithm on exploitable thread locality from the previous section, we use the initial-locker policy in our algorithm. That is, when an object is locked for the first time by a thread, we reserve the object’s lock for that thread. When the reservation is canceled, the LRV bit is reset, and the lockword is put in the base mode. The structure is completely defined by the base algorithm. As we will see later, canceling a reservation is the most challenging part of our algorithm, requiring the owner thread to be suspended. The cancellation replaces the lockword in the reserve mode with the corresponding state in the base algorithm. Figure 3 depicts the state transitions of the lockword in our algorithm.

LOCK RESERVATION

This section presents a new locking algorithm called lock reservation. It exploits the observation that Java locks show thread locality, as discussed in the previous section. The key idea is to reserve locks for threads. When a thread attempts to acquire an object’s lock, one of the following actions is taken: 1. If the object’s lock is reserved for the thread, the runtime system allows the thread to acquire the lock with a few instructions involving no atomic operation. 2. If the object’s lock is reserved for another thread, the runtime system cancels the reservation, and falls back to a conventional algorithm for further processing. 3. If the object’s lock is not reserved, the runtime system uses a conventional algorithm. Our algorithm can be built on any existing locking algorithm, as long as it uses a lockword5, a word in the object header for locking, and allows one bit to be available in the lockword. The bit is used for representing the lock reservation status, and hence named the LRV bit. When the LRV 5 Actually, we don’t need the whole 32 bits of the word, and could put in the word other information unrelated to locking. However, for the sake of explanation, we assume that the whole word is used for locking.

132

3.2

Algorithm

tains the execution context of the suspended thread (line 83) to see whether the thread is in one of the unsafe regions. If it is in an unsafe region, the function modifies the program counter with the address of the corresponding retry point (line 17 or 48). Notice that each unsafe region was carefully made restartable by preventing any side effects from occurring. Finally, after a lock’s reservation is canceled, our algorithm does not return the lock back to the reserve mode. The algorithm supporting repeated reservation would become too complicated, while it might result in more cancellations and degrade performance. In addition, the investigations in the previous section show that most lock operations can be performed in the reserve mode even without repeated reservation.

6

Figure 4 shows the algorithm of lock reservation . A thread attempting to acquire an object’s lock calls the acquire() function, where it reads the lockword, and performs four checks to see if it is not in a special state (lines 21–24). If it passes all the checks, the lock is in the most common state where the thread owns the lock’s reservation. It completes the lock acquisition by simply incrementing the rcnt field (line 28). Similarly, a thread attempting to release an object’s lock calls the release() function, where it first reads the lockword, and performs three checks to see if it is not in a special state (lines 52–54). When it passes all the checks, the function finishes the lock release by simply decrementing the rcnt field (line 58). Thus, it only takes a few non-atomic instructions to acquire and release a lock in the most common case when the thread owns the reservation. There are three special cases in the acquire() function. First, when the lock is anonymously reserved (line 22), the function attempts to make it specifically reserved by using compare_and_swap (line 33). Second, when the lock is reserved for another thread (line 23), the thread calls the unreserve() function to cancel the reservation (line 37), and falls back to the base algorithm. This second special case also results when the thread owns the reservation but the recursion count has reached the maximum value (line 24). Third, when the lockword is not in the reserve mode (line 21), the thread executes the corresponding function of the base algorithm (line 40). There is only one legal special case in the release() function. That is, when the lockword is not in the reserve mode (line 52), the function invokes the corresponding function in the base algorithm (line 65). The Java specification [17] requires that, when a thread attempts to release a lock, the thread actually holds the lock. Otherwise the runtime system must raise an instance of IllegalMonitorStateException. The checks in lines 53 and 54 detect the illegal state in the reserve mode. We now explain cancellation of a reservation, the most complicated part of our algorithm, which the unreserve() function is responsible for. Basically, a thread calls the function when the thread attempts to acquire a lock which is reserved for another thread7 . The calling thread atomically replaces the lockword in the reserve mode with the equivalent state in the base algorithm. In doing so, it first suspends the owner thread (line 74), modifies the lockword using the atomic operation (line 80), and resumes the suspended thread (line 90). Special care must be taken when the owner thread is in the middle of the acquire() or release() functions, more specifically, when it is in one of the unsafe regions which are between the read and write of the lockword in the acquire() (lines 18–28) and release() functions (lines 49–58). To avoid a data race condition, the unreserve() function ob-

3.3

Correctness

We now discuss the correctness of our algorithm. As we have shown, a thread does not have to execute any atomic operation in acquiring and releasing a lock when it owns the reservation. In other words, the owner thread can read-modify-write the lockword without atomic operations. Thus, when a different thread attempts to change the lockword between the read and the write, special care must be taken to prevent the modification from being lost. The lock state would otherwise become inconsistent. When a thread does not own a lock’s reservation, our algorithm requires the thread to call the unreserve() function, where the thread without the reservation modifies the lockword after suspending the owner thread. When the latter thread is suspended in the middle of an unsafe region, it is forced to restart the unsafe region, detecting that it no longer has the reservation. This prevents the thread from continuing the execution based on the no-longer-valid assumption that the thread still owns the reservation. The owner thread may have already completed the computation and ceased to exist when another thread attempts to cancel a reservation. Although the unreserve() must also handle this case properly, there is no risk of a data race condition involving the owner thread. More than one thread may simultaneously try to make an anonymous reservation specific (line 33) or try to convert the lockword in the reserve mode to the base mode (line 80). However, it is guaranteed that only one thread succeeds since atomic operations are used in both cases. Once the reservation is canceled, the lockword will be never reserved again. Thus, after the cancellation, our algorithm behaves in exactly the same manner as the base algorithm, and the correctness is ensured by the correctness of the base algorithm.

3.4

Discussion

This section considers the performance characteristics of lock reservation, discusses in detail how to determine whether a thread has been suspended in the middle of an unsafe region and how to cancel reservations, and explains multiprocessor issues.

6 For readability, the code shown here is slightly different from the actual code. For instance, the condition checks in the beginning of the acquire() and release() functions are merged into two checks in the actual code. Also, the base acquire() and base release() functions are tightly coupled with the acquire() and release() functions, respectively. 7 The unreserve() function is also called when the rcnt is about to overflow or when the wait() method is called.

Performance Characteristics Our algorithm is strongly expected to reduce the synchronization overhead when the reservation succeeds, since the owner thread can acquire and release the lock by simply

133

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

// Lockword structure in struct Object { : struct lockword { unsigned int tid unsigned int rcnt unsigned int reserve } lockword; : };

each object header

: N; : M; : 1;

// // // //

[tid:rcnt:R] Thread ID of the owner thread. Recursion count. Non-zero denotes that the lock is acquired. LRV bit. One denotes that the lock is reserved.

int acquire(struct Object *obj) { struct lockword l1, l2; int myTID = thread_id(); retry_acquire: l1 = obj->lockword; // if if if if

check special cases (l1.reserve == 0) (l1.tid == 0) (l1.tid != myTID) (l1.rcnt == RCNT_MAX)

// read the lockword goto goto goto goto

base_acquire; make_specific; unreserve_and_base; unreserve_and_base;

------------------(1) A | [xxxxxx:0] not reserved | [0:0:1] anonymously reserved |unsafe [other:xxx:1] reserved for another thread |region [myTID:max:1] rcnt reached the maximum | | | [myTID:rcnt:1] -> [myTID:rcnt+1:1] V write the lockword ------------------(2)

// // // //

// reserved for me, and rcnt does not reach the maximum l2 = l1; l2.rcnt++; // obj->lockword = l2; // return SUCCESS;

make_specific: l2 = l1; l2.tid = myTID; l2.rcnt = 1; if (compare_and_swap(&obj->lockword, l1, l2) != SUCCESS) goto retry_acquire; return SUCCESS;

// [0:0:1] -> [myTID:1:1]

unreserve_and_base_acquire: unreserve(obj, l1.tid, myTID);

// [xxx:xxx:1] -> [xxxxxx:0]

base_acquire: return base_acquire(obj);

// if not reserved, call the function for the base mode

} int release(struct Object *obj) { struct lockword l1, l2; int myTID = thread_id(); retry_release: l1 = obj->lockword; // if if if

check special cases (l1.reserve == 0) (l1.tid != myTID) (l1.rcnt == 0)

// read the lockword goto base_release; goto illegal_state; goto illegal_state;

// reserved for and held by me l2 = l1; l2.rcnt--; obj->lockword = l2; return SUCCESS;

------------------(1) A | [xxxxxx:0] not reserved | [other:xxx:1] reserved for another thread |unsafe [myTID:0:1] rcnt is zero |region | | [myTID:rcnt:1] -> [myTID:rcnt-1:1] V write the lockword ------------------(2)

// // // // //

illegal_state: return IllegalMonitorStateException; base_release: return base_release(obj);

// if not reserved, call the function for the base mode

} void unreserve(struct Object *obj, int ownerTID, int myTID) { struct lockword l1, l2; struct Context context; if (ownerTID == myTID) ownerTID = 0; thread_suspend(ownerTID);

// don’t suspend myself // no-op when the target thread does not exist

retry_unreserve: l1 = obj->lockword; if (l1.reserve == 0) goto already_unreserved; // already unreserved by someone l2 = base_equivalent_lockword(l1); // create the equivalent lock state in the base mode if (compare_and_swap(&obj->lockword, l1, l2) != SUCCESS) goto retry_unreserve; // [xxx:xxx:1] -> [xxxxxx:0] // modify the owner thread’s context if it is in an unsafe region if (thread_get_context(ownerTID, &context) == SUCCESS) { if (in_unsafe_region(context.pc)) { // check if (1) < next PC