Optimizing Message Lookup in Dynamic Object-Oriented Languages with Sparse Arrays Kresten Krab Thorup
[email protected]
Keywords: message lookup, dynamic object-oriented languages Abstract
Sparse Arrays provide a new way to reduce the overhead of message lookup by
providing constant time access with moderate space consumption. For some architectures, such as the SPARC processor, the Sparse Array technique gives an approximate speedup of 50% over existing message lookup techniques. Sparse Arrays are especially useful for compiled dynamic languages like Objective-C, since such cannot take advantage of runtime compilation like in semi-compiled systems like Self and Smalltalk. However, it could easily be embedded in such systems to improve performance. This article describes the sparse array used as a data structure for dispatch tables in the GNU Objective-C runtime system.
1 Introduction It is a general problem in dynamically typed object oriented languages that the message lookup mechanism is a bottleneck. Implementations of dynamically typed object oriented languages have been using various caching techniques for reducing the overhead involved in message lookup. Sparse arrays provide a new way to reduce this overhead by featuring constant time lookup at a moderate cost of space. The reason why some advanced lookup technique is needed at all, is that the code representing a given (receiver type, message name) pair inheritly cannot be located at compile time, since new methods and classes may be added to a program at runtime. This problem has caused a large amount of research in the area of optimizing the method lookup. In principle, dynamic languages have used two dierent types of caching mechanisms: hash-caching and inline-caching. Neither technique inheritly provides constant time access|sparse arrays do.
Hash Caching Hash caching is the traditional way of reducing message lookup overhead. A lookup cache maps (receiver type, message name) pairs to method implementations and holds the most recent or all used lookup results. The message `send' rst consults the cache (hash-table) and if the probe fails, it does do a normal (expensive) lookup through the superclass hierarchy. Finally, the result is inserted into the cache for later use. Lookup caches like this are very ecient in reducing lookup overhead. Berkeley Smalltalk for example, would have been 37% slower without a cache [13]. Authors address: Kresten Krab Thorup, Dept. of Mathematicsand ComputerScience, Aalborg University, DK-9220 Aalborg , Denmark
Figure 1: A sparse array with two buckets non-empty Existing Objective C compilers (NeXT, Stepstone) use various variants of hash caching for reducing the overhead of message lookup.
Inline Caching Current implementations of e.g. Smalltalk and Self use inline caching to reduce lookup overhead. This technique consists in inserting the address
(of the method) directly in place of the call to the messenger. Then all methods check if they are the right receiver for the message, and if not call the normal message sending mechanism. This technique is described in [4]. A further extension to this technique called PICs (for polymorphic inline caches) takes advantage of runtime compilation to insert specialized messengers directly at the call point. This technique is described in [8]. Because Objective-C is a compiled language, it cannot take advantage of the inline caching techniques.
2 Sparse Arrays
Introduction Even with a lookup cache, a message send takes considerably longer than calling a function directly, because the cache must be probed for every message sent. Sparse arrays provide a way to reduce this overhead. The sparse array is a simple data structure well known from various disciplines of computer science. It was invented to keep sparse data, i.e. data which spans a large domain, but is sparsely de ned. This is exactly the nature of message lookup|it maps (receiver type, message name) pairs to method implementations, and each receiver type only implements a small subset of all message names. The basic structure of a sparse array is shown on gure 1. The array is split into a number of sub-ranges, of which most are empty. The subranges are kept in small arrays called buckets . The structure then consists of a table of such buckets. Due to the nature of the data (Each class de nes only a small subset of all method names) most of the indices will point to a special `empty bucket', whereas subranges which actually contains elements dierent from `the missing element' will be in separate buckets. The data in the sparse array is in this case going to be methods, whereas the indices are representation of method names. The algorithms The runtime establishes a unique mapping from method names
to integers, so that these numbers may be used directly as indices into the sparse array. This mapping is represented by an indirection table so that a table lookup does not have to hash the method name every time. The message lookup mechanism is then called with a receiver and an element of the indirection table as arguments. The value returned by the lookup function is a pointer to the function implementing the method.
The most interesting point in using sparse arrays is the very ecient lookup function. Given an index into the array and the number of elements stored in a bucket, the correct bucket is located in the index at position i
bucket index = bucketi size ;
whereas the correct entry of that bucket is located at position element index = i mod bucket size:
This fact is expressed in the following piece of C code, also introducing the basic layout of the data structures involved: struct sbucket { elem_t elems[BUCKET_SIZE]; ... }; struct sarray { struct sbucket* buckets; struct sbucket* empty_bucket; ... };
/* table of elements kept */
/* table of buckets */ /* the special empty bucket */
/* return element stored at `index' of `array' */ elem_t sarray_get(struct sarray* array, int index) { struct sbucket* bucket; bucket = array->buckets[index/BUCKET_SIZE]; return bucket->elems[index%BUCKET_SIZE]; }
Inserting elements into the array is a only a little more complicated, since this involves allocating a new bucket if the bucket to be inserted into is currently empty.
Sparse arrays as dispatch tables Since the data represented in the sparse array
are methods, and given the fact that such are possibly inherited, we can reduce the space used for dispatch tables even further by `inheriting' the subranges which are not changed (overwritten) in a subclass. Figure 2 shows a simple example, where the method table of one class reuses a bucket from the dispatch table of its superclass, and de nes its own variant of another bucket. The insertion algorithm is aected by this, since it must make sure it is only changing the contents of the buckets it owns. The solution is to tag each bucket with its `owner' (a sparse array) and then also copy the bucket if it is not its own: /* At position `index' of `array' insert `element' */ sarray_at_put(struct sarray* array, int index, elem_t element) { struct sbucket** bucket = &(array->buckets[index/BUCKET_SIZE]); if((*bucket) == array->empty_bucket || (*bucket)->owner != array) /* check owner */ { (*bucket) = copy_bucket(array->empty_bucket); (*bucket)->owner = array; /* set owner */ } (*bucket)->elems[index%BUCKET_SIZE] = element; }
Figure 2: Sparse array `inheriting' another sparse array. It is desirable to reuse as many buckets as possible, since this will signi cantly reduce the memory usage of the dispatch tables in a complete system. One way to assure better reuse of buckets is to let the bucket size be small, since this causes better probability that a given bucket is not changed in the dispatch table of a subclass. However, reducing the size of buckets results in larger bucket tables. Another approach is to study how the method names are enumerated, and then choose a strategy which causes the most buckets to be reused. We have done extensive studies in this area, and it has been shown, that sorting the method names by the number of times the corresponding method is de ned increase the reusability, since methods often rede ned are grouped together thus leaving a lot of buckets which are never changed. Studying various combinations of enumeration techniques and bucket sizes we found that the memory usage was still unacceptably high, and the sorting of method names is very expensive for large systems. We have found that a much simpler approach is far more ecient. The solution is to use two levels of sparse arrays as illustrated in gure 3. This allows us to use bucket sizes as small as 8. The enumeration technique used for method names is very simple: This enumeration scheme consists solely of enumerating all messages in turn by class|so that new method names de ned within a class are assigned successive numbers. Given this numbering scheme, the methods de ned in a given class will be grouped, e.g. one group for the methods de ned by the given class, and one group for each superclass. This gives nice results, as a large number of rst level buckets will be outside these ranges, and thus be empty. Within each such group, some of the methods `inherited' will be overwritten but most will not. This allows for a fair amount of reuse within the second level buckets. Extensive testing on real-scale examples shows that a single indexed approach uses approximately 2k per class, whereas the double indexed approach uses only 1k per class in a typical system. Besides, The double indexed sparse array scales better for large numbers of method names and classes, as will be discussed in section 4 (Memory Considerations). So far we've been writing about `the empty bucket' and `missing elements.' These are not really empty, but rather lled with entries for a method corresponding to , so that even if a method is not implemented, something doesNotRecognize:
Figure 3: Double indexed sparse array as used in the GNU Objective C runtime system meaningful is returned from the lookup function. This technique also speeds up the lookup mechanism considerably, since anything looked up in the table is legal. The actual implementation in the GNU Objective-C runtime uses the doubly indexed array using a outer level index of 16 entries, and second level index of 8 entries.1 This is not described in further details here, since it is a trivial extension to the algorithms already described, which would only obscure the essence of this paper. Another problem, which must be addressed in a dynamic object oriented programming language is caused by the ability to add new methods and classes at runtime, since this may happen to extend the domain of method names which in turn enforces us to enlarge all dispatch tables in the system. This has not turned out to be a problem in the GNU Objective-C runtime, since this only involves expanding the outer most index. Given the fact that each level 1 index addresses a subrange of 16 8 = 128 method names, we will only have to perform this enlargement for every 128 new method names introduce and each such enlargement involves adding 1 entry to the outer most index.
3 Performance results Till now we have only been using one benchmark program for performance tests. We intend to port more benchmarks tests from the Self and Smalltalk benchmark suites. The benchmark program currently used is `RichardsBenchmarks' which is part of the benchmark suites for Smalltalk and Self.2 The program emulates a 1 2
These ranges are chosen on basis of empirical tests of real-scale examples Available by anonymous ftp from self.stanford.edu in the directory /pub/benchmarks
Sparse arrays Hash caching Seconds 0
1
2
3
4
Figure 4: The benchmarks results from running `RichardsBenchmark' on a SPARC SLC with GNU Objective-C using the two message lookup mechanisms, sparse arrays and hash caching NeXT hashing Sparse arrays Hash caching Seconds 0
1
2
3
4
Figure 5: The benchmarks results from running `RichardsBenchmark' on a NeXT Station with GNU Objective-C using the two message lookup mechanisms, sparse arrays and hash caching, as well as using the NeXT Objective-C runtime. small operating system, and contains a tight loop which schedules `processes' doing polymorphic lookup. This is considered a good test for speed of the message lookup mechanism when we have a small number of dierent method names, which are extensively rede ned. The program calls the message send mechanism approximately a million times during a run. For comparison, we have implemented a traditional hash-caching algorithm in the GNU Objective-C runtime. This is modeled very closely after the algorithm used in the NeXT Objective-C runtime. The following ( gure 4) is the test results of running RichardsBenchmarks on a Sun SLC (SPARC) with 20 MB given in seconds. The result shows, that the sparse array approach used in the causes the total program to be approximately 50% faster3 than using a hash caching algorithm. While this is true for the SPARC architecture, the picture is a little dierent for the 68040 processor on which the NeXT Station is based. We have included the corresponding benchmark results from running the test program using the NeXT runtime. The NeXT implementation is hand written assembler, which naturally makes their code slightly faster than our C implementation. Figure 5 shows the test results from running the test program on a NeXT. Since the hashing algorithm used for reference was written to be a very close copy of the NeXT algorithm, we conclude that, even though their hashing is slightly faster, in general the sparse array technique is better than hash caching. For the NeXT case, the sparse array is approximately 6% faster than the hash cache. That is, using the GNU Objective-C gures. We have been studying the reason why the dierence in performance gain is noticeably dierent for the SPARC versus the 68040. One reason could be the different compilers|though gcc is used on both platforms, the optimizations possible for either architecture are likely to be dierent. We believe the main reason is the very dierent instruction sets, and dierent costs for the same instructions on the two targets. On the SPARC branch instruc3
By n% faster, we mean that n = (slowest 100)=fastest.
tions are very expensive, and such are avoided in the sparse array lookup, since it `always succeeds', whereas the caching algorithm may miss, and enter a loop for probing. Both algorithms has to perform the same number of memory fetch instructions.
Comparison with other languages It is to some extent dicult to compare performance of Objective C message lookup with other dynamic languages like Smalltalk and Self, since those languages use runtime compilation in their reduction of message lookup overhead. Also it is hard to compare performance of Objective C with static languages like C++ and Eiel, since they have full knowledge of all classes and methods at compile time needed to implement optimal lookup. Consequently, only the comparison with NeXT objective-C is reliable, and tests relative hereto are not included.
4 Memory considerations Since it is obvious, that the sparse array technique will use more memory than a hash caching technique, we have been doing a lot of our evaluation on the basis of memory usage. The tests are based on the following data, which is primarily extracted from the NeXT libraries: library appkit dbkit soundkit collkit
classes 76 23 14 27
method names 1661 404 213 155
methods 2629 516 260 293
Table 1: Test data used for the memory usage tests. A special test program was programmed, that builds all the dispatch tables for various combinations of the above libraries and prints the statistics. The next gure shows the the memory usage as a function of the full domain of the tables (method namesclasses). Figure 6 shows the eect of the inheriting sparse array|that it will use less memory per class/selector the more it already has. It turns out, that the data in gure 6 is developing as a function at the form: p
memory usage = k method names classes
Thus, the tables will grow more slowly, the larger the domain they cover.
5 Conclusion The Sparse Array approach for reducing overhead in method lookup in dynamic object oriented languages has proven to be very good. The examples included in this paper shows that the time consumption per lookup is up to 50% faster than comparable algorithms. Further more, sparse arrays are very memory ecient, using only approximately 1k of memory per class in typical systems. Also, the memory usage grows as a square root of the domain covered by the dispatch tables.
120
•
•
100
• •
80 60 40
Memory usage
•
• • 20
• • •
•
0
50.000
100.000
150.000
200.000
250.000
300.000
selectors * classes
Figure 6: The memory usage of the sparse array in the test application, as a function of the product of number of classes and number of distinct selectors.
6 Acknowledgments I would like to thank my department, `Department for Mathematics and Computer Science, Aalborg University' and the Free Software Foundation for making me able to present this article. Also, thanks to Jens Henrik Badsbjerg for helping me out with the statistical estimates. Thanks should also go to Paul Burchard and Frank Jensen for reviewing the paper. Finally, I would like to thank the entire GNU community, especially Richard Stallman and Georey S. Knauth for allowing me to develop this as part of the GNU Objective-C system, and thus as part of gcc.
References [1] Craig Chambers and David Ungar, Customization: Optimizing Compiler tech-
nology for Self, a Dynamically-Typed Object-Oriented Programming Language . Proceedings of SIGPLAN '89, Portland, OR, 1989. [2] Brad J. Cox, Object Oriented Programming: An Evolutionary Approach .
Addison-Wesley, Reading, MA, 1986. [3] Craig Chambers, The Design and Implementation of the Self Compiler, an Optimizing Compiler for Object-Oriented Programming Languages. Ph.D. Thesis, Stanford University, 1991 [4] L. Peter Deutsch and Alan Schiman, Ecient Implementation of the Smalltalk-80 System . Proceedings of the 11th Symposium on the principles of Programming Languages. Salt Lake City, UT, 1984.
[5] R. Dixon, T. McKee, P. Schweitzer and M. Vaughan, A Fast Method Dispatcher for Compiled Languages With Multiple Inheritance . OOPSLA '89, New Orleans LA, 1989. [6] Margaret A. Ellis and Bjarne Stoustrup, The annotated C++ Reference Manual . Addison-Wesley, Reading, MA, 1990. [7] Adele Goldberg and David Robson, Smalltalk-80: The Language and Its Implementation . Addison-Wesley, Reading, MA, 1983. [8] Urs Holzle, Craig Chambers and David Ungar, Optimizing Dynamically-Typed Object-Oriented Languages With Polymorphic Inline Caches. ECOOP '91 proceedings, Springer Verlag `Lecture Notes on Computer Science 512', July 1991. [9] Glenn Krasner, ed., Smalltalk-80: Bits of History and Words of Advice. Addison-Wesley, Reading, MA, 1983. [10] Lewis J. Pinson and Richard S. Wiener, Objective-C: Object Oriented Programming Techniques . Addison-Wesley, 1991. [11] Richard Stallman, Using and Porting GCC . Free Software Foundation, MA, 1988. [12] Kresten Krab Thorup and Eric Herring, The GNU Objective-C Runtime . Free Software Foundation, (to appear). [13] David Ungar and David Patterson, Berkeley Smalltalk: Who knows where the time goes? . In [9].
Appendix I: Raw benchmark data These are generated by running the benchmark program `RichardsBenchmark' 20 times on lightly loaded machines, and calculating the average values. The original data varied with a margin of 5%. Architecture Algorithm Time 68040 GNU runtime (sparse arrays) 2940ms 68040 GNU runtime (hash cache) 3130ms 68040 NeXT runtime (hash cache) 2750ms SPARC GNU runtime (sparse arrays) 2500ms SPARC GNU runtime (hash cache) 3725ms