Using Squids to Address Forwarding Pointer Aliasing - Semantic Scholar

2 downloads 0 Views 83KB Size Report
Aug 16, 2002 - [Setrag86] Setrag N. Khoshafian, George P. Copeland,. “Object Identity”, Proc. 1986 ACM Conference on Object Oriented Programming ...
-04

August 16, 2002

Using Squids to Address Forwarding Pointer Aliasing J.P. Grossman, Jeremy Brown, Andrew Huang, Tom Knight

forwarding pointer to D

Abstract P1: indirect pointer to D

Forwarding pointers allow safe and efficient data migration. However, they also introduce a new source of aliasing and as a result can have a serious impact on program performance. In this paper we introduce short quasi-unique ID’s (squids), a simple hardware mechanism for capability architectures that mitigates the problems associated with aliasing. In the common case, squids can be used to prove that there is no aliasing and therefore avoid any overhead. The probability of having to perform expensive dereferencing operations to check for aliasing when comparing pointers to different objects is exponentially small in the number of bits used to implement squids. Benchmark programs on a simulated architecture show that squids can, in extreme cases, reduce execution time by more than a factor of four.

1

Introduction

Forwarding pointers are a conceptually simple mechanism that allow references to one memory location to be transparently forwarded to another. Known variously as “invisible pointers” [Greenblatt74], “forwarding addresses” [Baker78] and “memory forwarding” [Luk99], they are relatively easy to implement in hardware, and are a valuable tool for safe data compaction ([Moon84], [Luk99]) and object migration [Jul88]. Why, then, have they been incorporated into so few architectures? One reason is that forwarding pointers have traditionally been perceived as having limited utility. Their original intent was fairly specific to LISP garbage collection, but many methods of garbage collection exist which do not make use of or benefit from forwarding pointers [Plainfossé95], and consequently even some LISP architectures do not implement forwarding pointers (such as SPUR [Taylor86]). Furthermore, the vast majority of processors developed in the past decade have been designed with C code in mind, so there has been little reason to support forwarding pointers. More recently, the increasing prevalence of the Java programming language has prompted interest in mechanisms for accelerating the Java virtual machine, Project Aries Technical Memo ARIES-TM-04 Artificial Intelligence Laboratory Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Cambridge, MA, USA Sponsored by DARPA/AFOSR Contract Number F306029810172

D: data

P2: direct pointer to D

Figure 1: Forwarding pointer indirection creates aliasing: two different pointers (P1, P2) can point to the same piece of data (D)

including direct silicon implementation [Tremblay99]. Since the Java specification includes a garbage collected memory model [Gosling96], architectures designed for Java can benefit from forwarding pointers which allow efficient incremental garbage collection ([Baker78], [Moon84]). Additionally, in [Luk99] it is shown that using forwarding pointers to perform safe data relocation can result in significant performance gains on arbitrary programs written in C, speeding up some applications by more than a factor of two. Finally, in a distributed shared memory machine, data migration can improve performance by collocating data with the threads that require it. Forwarding pointers provide a safe and efficient mechanism for object migration [Jul88]. Thus, there is growing motivation to include hardware support for forwarding pointers in novel architectures. A second and perhaps more significant reason that forwarding pointers have received little attention from hardware designers is that they create a new set of aliasing problems. On an architecture that supports forwarding pointers, no longer can the hardware and programmer assume that different pointers point to different words in memory (Figure 1). In [Luk99] two specific problems are identified. First, direct pointer comparisons are not a safe operation; some mechanism must be provided for determining the final addresses of the pointers. Second, seemingly independent memory operations may no longer be reordered in out-of-order machines. In this paper we will show how to address these problems using squids – Short Quasi-Unique ID’s. Squids are suitable for architectures that support guarded pointers [Cater94], a form of capabilities [Fabry74] that eliminate table lookups by storing all permission and segment bits within the pointers themselves. The central idea is to assign to every object a short random tag, stored

in pointers to the object, which is similar in role to a unique identifier, but is not necessarily unique. In the common case squids allow pointer comparison and memory operation reordering to proceed with no overhead. Only in rare cases is it necessary to degrade performance to ensure correctness. Thus, squids allow an architecture to support forwarding pointers with a minimal average run-time overhead. The following section outlines squids in detail and provides theoretical arguments for their performance benefits. In Section 3 we describe the simulated capability architecture on which the benchmark programs are run. Section 4 presents the benchmarks and discusses the results of our experiments. In Section 5 we examine mechanisms which complement or provide alternatives to squids, and we describe a way to implement squids without capabilities. Finally, in Section 6 we summarize our findings and conclude.

make use of the address so that no translation is required, and pointer comparisons and memory operation reordering are based on ID’s which solve the aliasing problem. However, this still leaves us with disadvantage (ii), which implies that the pointer format must be quite large. We can solve this problem by dropping the restriction that the ID’s be unique. Instead of long unique ID’s, we use short quasi-unique ID’s (squids). At first this seems to defeat the purpose of having ID’s, but we make the following observation: while squids can’t be used to determine that two pointers reference the same object, they can in most cases be used to determine that two pointers reference different objects. If we randomly generate an n bit squid every time an object is allocated, then the probability that pointers to distinct objects cannot be distinguished by their squids is 2-n.

2.2

2

Pointer Comparisons

Squids

Forwarding pointer aliasing is an instance of the more general challenge of determining object identity in the presence of multiple and/or changing names. This problem has been studied explicitly [Setrag86]. A natural solution which has appeared time and again is the use of system-wide unique object ID’s (e.g. [Dally85], [Setrag86], [Moss90], [Day93], [Plainfossé95]). UID’s completely solve the aliasing problem, but have two disadvantages: i.

The use of ID’s to reference objects requires an expensive translation each time an object is referenced to obtain its virtual address.

ii. Quite a few bits are required to ensure that there are enough ID’s for all objects and that globally unique ID’s can be easily generated in a distributed computing environment. In a large system, at least sixty-four bits would likely be required in order to avoid any expensive garbage collection of ID’s and to allow each processor to allocate ID’s independently. Despite these disadvantages, the use of ID’s remains appealing as a way of solving the aliasing problem, and it is tempting to try to find a practical and efficient mechanism based on ID’s. This is the motivation for squids.

2.1

Quasi-Unique ID’s

We begin by noting that the expensive translations (i) are unnecessary in a guarded pointer architecture that includes object ID’s as part of the pointer format. In this case we have the best of both worlds; object references

-04

2

We can efficiently compare two pointers by comparing both their addresses and their squids. Assume for the moment that all pointers point to object heads; we will relax this restriction in Section 2.4. If the addresses are the same then the pointers point to the same object. If the squids are different then they point to different objects. In the case that the addresses are different but the squids are the same, we trap to a software routine which performs the expensive dereferences necessary to determine whether or not the final addresses are equal. We can argue that this last case is rare by observing that it occurs in two circumstances: either the pointers reference different objects which have the same squid, or the pointers reference the same object through different levels of indirection. The former of these occurs with probability 2-n. The latter is application dependent, but we note that (1) applications tend to compare pointers to different objects more frequently then they compare pointers to the same object, and (2) the results of the simulations in [Luk99] indicate that it may be reasonable to expect the majority of pointers to migrated data to be updated, so that two pointers to the same object will usually have the same level of indirection.

2.3

Reordering Memory Operations

In a similar manner, the hardware can use squids to decide whether or not it is possible to reorder memory operations. Assuming still that pointers point to object heads, memory addresses being read/written must be specified as address + offset. If the squids are different, it is safe to reorder. If the squids are the same but the offsets are different, it is again safe to reorder. If the squids and offsets are the same but the addresses are different, the hardware assumes that the operations cannot

SQUIDs

be reordered. It is not necessary to explicitly check for aliasing since preserving order guarantees conservative but correct execution. Only simple comparisons are required, and the probability of failing to reorder references to different objects is 2-n.

2.4

Pointers to Object Interiors

The preceding discussion assumed that pointers always point to object heads. However, in many cases it is desirable to support pointers to object interiors. For example, it is common practice in C programs to create an array of data structures and then work with pointers to individual structures within the array. This can cause problems for squids. Consider the worst case in which a program allocates a single massive object containing as sub-objects all the data structures it will ever need. A single squid is created for this super object which will appear in every pointer used by the program. In this case it becomes a common occurrence to have two pointers to different objects with the same squid. As a result, virtually every pointer comparison will result in a trap, and no memory references will be reordered. The program will still execute correctly, but the advantages of squids will be lost. We can solve this problem by incorporating a mechanism for extracting the base address of an object from pointers to the object’s interior. This allows the hardware to determine when two pointers point to different words of the same object; when the pointers are decomposed as base+offset, the hardware will see that the base addresses are the same, but the offsets are different. We note that in a capability architecture that supports pointer arithmetic, such a mechanism must already exist to prevent user programs from subtracting an offset from a pointer which is larger than the pointer’s current offset within the object. Furthermore, it is possible to implement this mechanism with an efficient pointer representation and very little hardware. For example, the guarded pointer format in [Carter94] adds six segment length bits to the pointer representation, and base addresses may be computed with a single right-shift.

3

Simulation Environment

To test the effectiveness of squids, several benchmark programs were run using a cycle-accurate simulation of a guarded pointer architecture which is under development. In this section we briefly describe the features of this architecture which are relevant to squids.

3.1

Overview

The architecture targets an embedded-memory fabrication process and consists of a large number of

-04

3

processor-memory nodes connected by a fast network to form a distributed shared-memory machine. Each node contains a network interface, data memory, instruction memory, and a 128-bit VLIW processor. Processor design is kept simple and does not include out-of-order execution, register renaming, branch prediction or speculative execution. Associated with each 128-bit word in the system is a tag bit, not accessible to user programs, which indicates whether the word contains data or a pointer. 128-bit pointers are broken down into 64 address bits and 64 capability bits. The capability bits include segment information (a scheme similar to that in [Carter94] is used), permission bits, and up to 8 squid bits. Pointer arithmetic is supported.

3.2

Implementation of Squids

A special alloc instruction is used to allocate a memory segment with a specified size. The instruction creates a new pointer with a randomly generated squid. Two instructions are provided for pointer equality testing (peq and pneq). These instructions compare two pointers as described in Sections 2.2 and 2.4. In the case that equality cannot be established without checking for aliasing, a software trap is taken and the trap handler performs the necessary dereferencing operations. One alternative to randomly generating squids is to generate them sequentially. While this would likely work well in most cases, there may be instances in which it does not. For example, if a program generates 2n linked lists simultaneously, adding nodes in a round-robin fashion, then all nodes within a given list will have the same squid. One could perhaps argue that such occurrences are rare, but randomly generating the squids avoids the problem altogether.

3.3

Memory System

Processors are tightly coupled with memory, and there are no data caches. In the absence of contention for a bank of memory, local memory accesses take six cycles to complete. Forwarding pointers are implemented by associating a trap bit with each 128-bit word in data memory. This bit is accessible only to programs operating in privileged mode. When a user program attempts to access a memory word with the trap bit set, an event is generated. The kernel handles the event by simply re-issuing the memory operation using the forwarding address. All memory operations are explicitly split-phase. Computation may proceed after a memory request has been sent without waiting for the reply, and up to eight memory requests may be outstanding at any given time. The memory system makes no guarantees regarding the

SQUIDs

Benchmark list 2cycle kruskal fibsort sparsemat vector filter listrev

Description Add/delete objects to/from a linked list Detect 2-cycles in a directed graph Kruskal’s minimal spanning tree algorithm Fibonacci heap sort Sparse matrix multiplication ai = xi * yi bi = xi + yi ci = xi – yi yi = 0.25 * xi-1 + 0.5 * xi + 0.25 * xi+1 Reverse the pointers in a linked list

Parameters 32 objects in list, 1024 delete/add iterations 1024 vertices, 10,240 edges 512 vertices, 2048 edges 4096 keys 32x32 matrix with 64 entries; 5 iterations of B = A * (B + A) 20,000 terms 200,000 terms 30,000 nodes

Table 1: Benchmark programs

order in which memory requests complete, so even though the processor does not support out-of-order execution, the problem of memory operation reordering must still be addressed. In particular, two requests may not be outstanding at the same time if their addresses cannot be disambiguated and at least one of them modifies the contents of memory. Thus, before a request is issued, the architecture checks it against the currently outstanding requests, using squids as described in Section 2.3. If no conflict is found, the request is issued immediately to the memory system. Otherwise, the request is delayed until the conflicting operation completes.

3.4

Running the Benchmarks

The overhead of forwarding pointer dereferencing is potentially quite large, especially if there is a deep chain of forwarding pointers, a remote memory reference is required, or, in the worst case, a word in the forwarding chain resides in a page that has been sent to disk. For the purposes of evaluation, we wish to minimize this overhead before applying squids in order to avoid exaggerating their effectiveness. Accordingly, all benchmark programs are run as a single thread on a single node and fit into main memory.

4

Experimental Results

In this section we present our selection of benchmark programs and the results of running the programs with 0-8 squid bits.

4.1

Benchmark Programs

Table 1 lists the benchmark programs used to evaluate squids. The first five (list, 2cycle, kruskal, fibsort, sparsemat) involve pointer comparisons and the primary overhead is traps taken to determine final addresses. The last three (vector, filter, listrev) involve rapid loads/stores and the primary overhead is memory stalls when addresses cannot be disambiguated. We note that while many (arguably most) programs do not make use of pointer comparisons, there are some

-04

4

important data structures for which pointer comparisons are frequently used. One common example is graphs, in which pointer comparisons can be used to determine whether two vertices are the same. Both 2cycle and kruskal are graph algorithms. 2cycle uses a brute force approach to detect 2-cycles in a directed graph by having each vertex look for itself in the connection lists of the vertices it is connected to. kruskal uses Kruskal’s minimum spanning tree algorithm [CLR90] in which edges are chosen greedily without creating cycles. To avoid cycles, a representative vertex is maintained for every connected sub-tree during construction, and an edge may be selected only if the vertices it connects have different representatives. Another data structure which makes use of pointer comparisons is the cyclically linked list. Figure 2 gives C code for iterating over all elements of a non-empty cyclically linked list; a pointer comparison is used as the termination condition. This differs from a linear linked list in which termination is determined by comparing a pointer to NULL. Both fibsort and sparsemat make use of cyclically linked lists. fibsort sorts a set of integers using fibonacci heaps [Fredman87]; in a fibonacci heap the children of each node are stored in a cyclically linked list. sparsemat uses a sparse matrix representation in which both the rows and columns are kept in cyclically linked lists. In the list benchmark, 32 objects are stored in both an array and a linked list. On each iteration, an object is randomly selected from the array and located in the linked list using pointer comparisons. The object is deleted, and a new object is created and added to the array and the linked list. Every 64 iterations the list is linearized [Clark76]; this is one of the locality optimizations performed in [Luk99]. This benchmark is somewhat contrived; it was constructed to provide an example in which squids are unable to asymptotically reduce Node *p = first; do { ... p = p->next; } while (p != first); Figure 2: Iterating over a cyclically linked list

SQUIDs

Squids are also helpful, albeit to a lesser extent, in the three memory-intensive benchmarks. With eight squid bits the speedup ranges from 1.12 in filter to 1.30 in vector. Note that in each of these benchmarks, as the number of squid bits is increased the decrease in overall execution time is slightly less than the decrease in the number of memory stalls. This is because programs are allowed to continue issuing arithmetic instructions while a memory request is stalled, so in some cases there is overlap between memory stalls and program execution. Overlap cycles have been graphed as memory stall cycles, which does not affect the results.

overhead to zero. Because the pointers in the array are not updated, subsequent to a linearization comparisons will be made between pointers having different levels of forwarding indirection. In particular, an object will be found by comparing two pointers to the same object with different levels of indirection. This is one of the two cases in which squids fail. In Section 2.2 it was argued that this may be a rare case in general, however it is a common occurrence in the list benchmark.

4.2

Simulation Results

Figure 3 shows the results of running all eight benchmark programs on the simulated architecture with the number of squid bits varying from 0 to 8. Execution time is broken down into program cycles, trap handler cycles, and memory stalls. A cycle counts as a memory stall when the hardware is unable to disambiguate different addresses and as a result a memory operation is blocked. Table 2 lists the total speedup of the benchmarks over their execution time with zero squid bits. As expected, in most cases the overhead due to traps and memory stalls drops exponentially to zero as the number of squid bits increases. The three exceptions are list, vector and filter. In vector and filter the lack of a smooth exponential curve is simply due to the small number of distinct objects (five in vector, two in filter), so in both cases the overhead steps down to zero once all objects have distinct squids. In list the overhead drops exponentially to a non-zero amount. As explained in Section 4.1, this is because squids cannot be used to compare two pointers to the same object with different levels of indirection. We note that squids are most effective in programs that compare pointers within the inner loop. This includes list, 2cycle, fibsort and sparsemat, where the speedup with eight squid bits ranges from 1.72 for fibsort to 4.75 for 2cycle. In kruskal, by contrast, the inner loop follows a chain of pointers to find a vertex’s representative; only once the representatives for two vertices have been found is a pointer comparison performed. As a result, the improvement in performance due to squids is barely noticeable.

Squid bits: list 2cycle kruskal fibsort sparsemat vector filter listrev

1 1.57 1.66 1.01 1.26 1.62 1.18 1.00 1.09

2 2.22 2.47 1.02 1.46 2.24 1.30 1.12 1.14

3 2.76 3.27 1.02 1.58 2.83 1.30 1.12 1.17

4.3

Extension to Other Architectures

The design space of capability architectures is quite large, so we must call into question the extent to which our results would generalize to other architectures. In particular, the execution time of the trap handler (which is roughly 48 cycles in our simulations) may be significantly smaller in an architecture with data caches or extra registers available to the trap handler. However, we note that: 1.

Any mechanism that speeds up the trap handler but is not specific to traps (e.g. data caches, out-oforder execution) will most likely reduce program execution time comparably.

2.

Hardware improvements reduce the cost but not the number of traps. Squids reduce the number of traps exponentially independent of the architecture.

The number of memory operations that the hardware fails to reorder or issue simultaneously is architecture dependent since it is affected by such factors as memory latency and the size of the instruction reorder buffer (if there is one). If the overhead due to memory stalls in memory intensive applications is negligible to begin with, then the architecture will not benefit from adding squids to the memory controller logic. On the other hand if the overhead is noticeable, then squids will reduce it exponentially.

4 3.16 3.91 1.02 1.65 3.41 1.30 1.12 1.18

5 3.41 4.32 1.02 1.69 3.57 1.30 1.12 1.19

6 3.55 4.56 1.03 1.70 3.92 1.30 1.12 1.20

7 3.62 4.68 1.03 1.71 3.98 1.30 1.12 1.20

8 3.65 4.75 1.03 1.72 4.00 1.30 1.12 1.20

Table 2: Speedup over execution time with zero squid bits

-04

5

SQUIDs

program execution trap handler memory stall

list

2cycle

1.4

7 1.2

6 1

5 0.8

4

0.6

3

0.4

2

0.2

1

0

0 0

1

2

3

4

5

6

7

0

8

1

2

3

kruskal

4

5

6

7

8

7

8

fibsort

0.9

8

0.8

7

0.7

6

0.6

5

0.5

4

0.4

3

0.3

2

0.2 0.1

1

0

0 0

1

2

3

4

5

6

7

0

8

1

2

3

4

5

6

vector

sparsem at 0.45

4 3.5

0.4

3

0.35 0.3

2.5

0.25

2

0.2

1.5

0.15

1

0.1

0.5

0.05 0

0 0

1

2

3

4

5

6

7

0

8

1

2

3

4

5

6

7

8

5

6

7

8

listrev

filter 3

0.3

2.5

0.25

2

0.2

1.5

0.15

1

0.1

0.5

0.05

0

0 0

1

2

3

4

5

6

7

8

0

1

2

3

4

Figure 3: Simulation results. For each benchmark the horizontal axis indicates the number of squid bits used and the vertical axis gives the execution time in millions of cycles. Execution time is broken down into program cycles, trap handler cycles and memory stall cycles.

-04

6

SQUIDs

5

Alternate Approaches

We now discuss a number of approaches to the problem of pointer disambiguation that differ from the techniques presented thus far.

5.1

Generation Counters

We can associate with each pointer an m bit saturating generation counter which indicates the number of times that the object has been migrated. If two pointers being compared have the same generation counter (and it has not saturated), then the hardware can simply compare the address fields directly. Using a single generation bit deals with the common case of objects that are never migrated. This completely eliminates aliasing overhead for applications that choose not to make use of forwarding pointers. Using two generation bits handles the additional case in which objects are migrated exactly once as a compaction operation after the program has initialized its data (this is one of the techniques used in [Luk99]). Again, overhead is completely eliminated in this case. More generally, generation counters are effective in programs for which (1) objects are migrated a small number of times, and (2) at all times most working pointers to a given object have the same level of indirection. They lose their effectiveness in programs for which either of these statements is false.

5.2

Software Comparisons

Instead of relying on hardware to ensure the correctness of pointer comparisons, the compiler can insert code to explicitly determine the final addresses and then compare them directly, as in [Luk99]. Figure 4 shows the code that must be inserted; for each pointer a copy is created, and the copy is replaced with the final address by repeatedly checking for the presence of a forwarding pointer in the memory word being addressed. The outer loop is required in systems that support concurrent garbage collection or object migration to avoid a race condition when an object is migrated while the final addresses are being computed. In a complex superscalar processor, the cost of this code may only be a few cycles (and a few registers) if the memory words being addressed are present in the data cache. The overhead will be much larger if a cache miss occurs while either of the final addresses is being computed. Making use of hardware traps, and placing this code in a trap handler rather than inlining it at every pointer comparison, has the advantages of reducing code size and eliminating overhead when the hardware is able to disambiguate the pointers. On the other hand, overhead is increased when a trap is taken due to the need to initialize

-04

7

temp1 = ptr1; temp2 = ptr2; flag = 0; do { while (check_forwarding_bit(temp1)) temp1 = unforwarded_read(temp1); while (check_forwarding_bit(temp2)) { temp2 = unforwarded_read(temp2); flag = 1; } } while (flag); compare(temp1, temp2);

Figure 4: Using software only to ensure the correctness of pointer comparisons, the compiler must insert the above code wherever two pointers are compared. temp = ptr1 ^ ptr2; if (temp & SQUID_MASK) else Figure 5: Using squids in conjunction with software comparison.

the trap handler and clean up when it has finished. In our simulations, we found that 25% of the trap cycles were used to perform the actual comparisons. Thus, using software comparisons would give roughly the same performance as hardware comparisons with two squid bits. The cost of software comparisons can be reduced (but not eliminated) by incorporating squids, as shown in Figure 5. This combined approach features both exponential reduction in overhead and fast inline comparisons at the expense of increased code size and register requirements.

5.3

Data Dependence Speculation

In [Luk99], the problem of memory operation reordering is addressed using data dependence speculation ([Moshovos97], [Chrysos98]). This is a technique that allows loads to speculatively execute before an earlier store when the address of the store is not yet known. In order to support forwarding pointers, the speculation mechanism must be altered so that it compares final addresses rather than the addresses initially generated by the instruction stream. This in turn requires that the mechanism is somehow informed each time a memory request is forwarded. The details of how this is accomplished would depend on whether forwarding is implemented directly by hardware or in software via exceptions.

SQUIDs

Data dependence speculation does not allow stores to execute before earlier loads/stores, but this is unlikely to cause problems as a store does not produce data which is needed for program execution. A greater concern is the failure to reorder atomic read and modify memory operations, such as those supported by the Tera [Alverson90] or the Cray T3E [Scott96].

5.4

Squids without Capabilities

It is possible to implement squids without capabilities by taking the upper n bits of a pointer to be that pointer’s squid. This has the effect of subdividing the virtual address space into 2n domains. When an object is allocated, it is randomly assigned to one of the domains. Objects migration is then restricted to within a domain in order to preserve squids. Using a large number of domains introduces fragmentation problems and makes data compaction difficult since, for example, objects from different domains cannot be placed in the same page. However, as seen in Section 4, noticeable performance improvements are achieved with only one or two squid bits (two or four domains). Alternately, the hardware can cooperate to avoid the problems associated with multiple domains by simply ignoring the upper n address bits. In this case the architecture begins to resemble a capability machine since the pointer contains both an address and some additional information. The pointers are unprotected, though, so user programs must take care to avoid mutating the squid bits or performing pointer arithmetic that causes a pointer to address a different object. Additionally, because the pointer contains no segment information, arrays of objects are a problem as discussed in Section 2.4.

6

provides noticeable speedups on the majority of the benchmarks, and as Figure 3 shows, most of the potential performance gains can be realized with four bits. Thus, squids remain appealing in architectures which have few capability bits to spare. Second, squids allow an architecture to tolerate slow traps and/or long memory latencies while determining final addresses. For most applications, three or four additional squid bits would compensate for an order of magnitude increase in the time required to execute the pointer comparison code of Figure 4. In summary, we have seen that squids are able to reduce the overhead of forwarding pointer aliasing to an acceptable level. This is an important step towards the inclusion of hardware support for forwarding pointers in novel architectures.

7

Acknowledgements

This work was supported by DARPA/AFOSR contract number F306029810172.

References [Alverson90] Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Peterfield, Burton Smith, “The Tera Computer System”, Proc. 1990 ACM International Conference on Supercomputing, June 1990. [Baker78]

Henry G. Baker, Jr., “List Processing in Real Time on a Serial Computer”, Communications of the ACM, Volume 21, Number 4, pp. 280-294, April 1978.

[Carter94]

Nicholas P. Carter, Stephen W. Keckler, William J. Dally, "Hardware Support for Fast Capabilitybased Addressing", Proc. 6th International Conference on Architectural Support for Programming Languages and Operating Systems, 1994.

[Chrysos98]

George Z. Chrysos, Joel S. Emer, “Memory Dependency Prediction using Store Sets”, Proc. ISCA ’98, pp. 142-153, June 1998.

[Clark76]

D. W. Clark, “List Structure: Measurements, Algorithms and Encodings”, Ph.D. thesis, Carnegie-Mellon University, August 1976.

[CLR90]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Introduction to Algorithms, MIT Press, Cambridge, MA, 1990, 1028pp.

[Dally85]

William J. Dally, James T. Kajiya, “An Object Oriented Architecture”, Proc. ISCA ’85, pp. 154161.

[Day93]

Mark Day, Barbara Liskov, Umesh Maheshwari, Andrew C. Myers, “References to Remote Mobile Objects in Thor”, ACM Letters on

Conclusion

Forwarding pointers facilitate safe data compaction, object migration, and efficient garbage collection. However, before they can be incorporated into new architectures it is necessary to address the aliasing problems that arise. We have presented squids, a mechanism that allows the hardware to, with high probability, disambiguate pointers in the presence of aliasing without performing expensive dereferencing operations. Our experimental results show that squids provide significant performance improvements on a restricted set of applications, speeding up some programs by over a factor of four. The fact that the overhead associated with forwarding pointer support diminishes exponentially with the number of squid bits has two important consequences. First, very few squid bits are required to produce considerable performance improvements. Even a single squid bit

-04

8

SQUIDs

Programming Languages & Systems, vol.2, no.14, March-Dec. 1993, pp.115-26. [Fabry74]

R.S. Fabry, “Capability-Based Addressing”, Communications of the ACM, Volume 17, Number 7, pp. 403-412, July 1974.

[Tremblay99] Marc Tremblay, “An Architecture for the New Millenium”, Proc. Hot Chips XI, Aug. 15-17, 1999.

[Fredman87] M. L. Fredman, R. E. Tarjan, “Fibonacci heaps and their uses in improved network optimization algorithms”, JACM, vol. 34, no. 3, July 1987, pp. 596-615. [Gosling96]

James Gosling, Bill Joy, Guy L. Steele Jr., The Java Language Specification, Addison-Wesley Publication Co., Sept. 1996, 825pp.

[Greenblatt74] Richard Greenblatt, “The LISP Machine”, Working Paper 79, M.I.T. Artificial Intelligence Laboratory, November 1974. [Jul88]

Eric Jul, Henry Levy, Norman Hutchinson, Andrew Black, “Fine-Grained Mobility in the Emerald System”, ACM TOCS, Vol. 6, No. 1, Feb. 1988, pp. 109-133.

[Luk99]

Chi-Keung Luk, Todd C. Mowry, “Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing the Safety of Data Relocation”, Proc. ISCA ’99, pp. 88-99.

[Moon84]

David A. Moon, “Garbage Collection in a Large Lisp System”, Proc. 1984 ACM Conference on Lisp and Functional Programming, pp. 235-246.

[Moon85]

David A. Moon, “Architecture of the Symbolics 3600”, Proc. ISCA ’85, pp. 76-83, 1985.

[Moss90]

J. Eliot B. Moss, “Design of the Mneme Persistent Object Store”, ACM Transactions on Information Systems, Vol. 8, No. 2, April 1990, pp. 103-139.

[Moshovos97] Andreas Moshovos, Scott E. Breach, T. N. Vijaykumar, Gurindar S. Sohi, “Dynamic Speculation and Synchronization of Data Dependences”, Proc. ISCA ’97, pp. 181-193, June 1997. [Plainfossé95] David Plainfossé, Marc Shapiro, “A Survey of Distributed Garbage Collection Techniques”, Proc. 1995 International Workshop on Memory Management, pp. 211-249. [Scott96]

Steven L. Scott, “Synchronization and Communication in the Cray T3E Multiprocessor”, Proc. ASPLOS VII, Oct. 1996, pp. 26-36.

[Setrag86]

Setrag N. Khoshafian, George P. Copeland, “Object Identity”, Proc. 1986 ACM Conference on Object Oriented Programming Systems, Languages and Applications, pp. 406-416.

[Taylor86]

George S. Taylor, Paul N. Hilfinger, James R. Larus, David A. Patterson, Benjamin G. Zorn, “Evaluation of the SPUR Lisp Architecture”, Proc. ISCA ’86, pp. 444-452, 1986

-04

9

SQUIDs