Online List Accessing Algorithms and Their Evidence ... - CS, Technion

8 downloads 0 Views 1MB Size Report
Online List Accessing Algorithms and Their. Evidence. Applications: Recent Empirical. Ran Bachrach and* Ran El-Yaniv. Abstract. This paper reports the results ...
Online List Accessing Algorithms

and Their Applications: Evidence

Ran Bachrach

and* Ran El-Yaniv

Abstract This paper reports the results of an empirical test of the performances

of a large set of online list accessing algorithms.

The dgorithms’

access cost performances were tested with

respect to request sequences generated from the Calgary Corpus. In addition to testing access costs within the traditional dynamic list accessing model we tested all algorithms’ relative performances as data compressors via the compression scheme of Bentley et al. Some of the results are quite surprising and stand in contrast to someco mpetitive analysis theoretical results. For example, the randomized algorithms

that were tested, all attaining competitive ratio less than performed

consistently

inferior

Recent Empirical

2,

to quite a few deterministic

algorithms that obtained the best performance. In many instances the best performance was obtained by deterministic algorithms that are either not competitive or not optimal.

deletion of an item. Requests for an access or deletion of an item x positioned ith from the front cost the algorithm i, the number of comparisons it must make in order to locate z while starting a (sequential) search from the front. A request for insertion of an item to a list currently holding e items costs the algorithm !.+ 1. In order to expedite future requests, a list accessing algorithm may reorganize its list periodically while attempting to place items that are expected to be accessed frequently closer to the front. Reorganizations are made via a sequence of transpositions of consecutive items. After an item z is accessed (inserted), z may be moved free of charge to any position closer to the front. Such transpositions that bring z forward after an access (insertion) are called free. Any other transpositions cost 1 each and are called paid, The importance of the list accessing problem arises from the fact that online list accessing algorithms are often used by practitioners. Although organizations of dictionaries as linear lists are often inefficient, there are

1 Introduction The list accessing problem has been studied for more than 30 years [15]. Indeed, plenty of interesting thevarious situations in which a linear list is the impleoretical results have been proposed. Nevertheless, this mentation of choice. For instance, when the dictiotopic of research suffers from a lack of experimental feednary is small (say for organizing the list of identifiers back that could lead to better and more realistic theory. maintained by a compiler, or for organizing collisions In this paper we report the results of a quite extensive in a hash table), or when there is no space to impleempirical study testing the access cost performance and ment efficient but space consuming data structures, etc. compression performance of more than 40 list accessing [7, 131. Other uses of list accessing algorithms are made algorithms. These include many well known as well as by algorithms for computing point maxima and convex new algorithms. hulls [6]. Finally, any list accessing algorithm can be In the rest of this introduction we describe the list used as the “heart” of a data compression algorithm accessing problem, the competitive ratio measure that [8, 21. Due to its great relevancy the list accessing probhas been used for many theoretical analyses and then lem has been extensively studied since 1965 (see e.g. outline the contents of this extended abstract. [15, 18, 7, 19, 16, 31). The list accessing problem. The list accessing probLet ALG be Competitive list accessing algorithms. lem revolves around the question of how to maintain any list accessing algorithm. For any sequence of and organize a dictionary as a linear list. In particular, requests u we denote by ALG(CT) the total cost incurred in this problem an online algorithm maintains a set of by ALG to service 6. One, recently popular performance items as an unsorted list. A sequence of requests for measure of online algorithms is the competitive ratio items is given and the algorithm must serve these re[19], defined as follows. An algorithm ALG attains a quests in the order of their arrival. The requests are competitive ratio c (equivalently, ALG is c-competitive) either for insertion of an item, to access an item, or a if there exists a constant CYsuch that for all request sequences u, AK(u) - c - OPT(~) 5 (Y where OPT is an optimal ofline list accessing algorithm. If ALG is ‘Institute of Computer Science, The Hebrew University of Jerusalem.

Email:

jranb.raniy}Bmath.huji.ac.il

53

54 randomized, it is c competitive against an oblivious adversary if for every request sequence u, E[ALG(u)] c . OPT(u) _< (Y where o is a constant independent of u, and EC.1 is the mathematical expectation taken with respect to the random choices made by ALG [9]. The best known competitive algorithms. In their seminal paper Sleator and Tarjan [19] showed that the well-known algorithm MOVE-TO-FRONT (MTF), a deterministic algorithm that moves each requested item to the front, is 2-competitive. Raghavan and Karp (reported in [14]) proved a lower bound of 2 - 2/(t + 1) on the competitive ratio of any deterministic algorithm maintaining a list of size 1. Irani [14] gave a matching upper bound for MTF showing that MTF is a strictly optimal online algorithm judged from the perspective of competitive analysis. Albers [l] discovered another deterministic algorithm called TIMESTAMP and proved that it is a-competitive. In [ll] an infinite family of optimal (Scompetitive) deterministic algorithms is presented. This family of algorithms include TIMESTAMP and MTF as special cases.l As for randomized algorithms, so far the best known upper bound is obtained by an algorithm called COMB due to Albers, von Stengel and Werchner [3]. The bound obtained is 1.6. The best known randomized lower bound, of 1.5, is due to Teia [20]. Contents and paper organization. This abstract outlines the results of a quite extensive study of many list accessing algorithms. In Section 2 we describe all the algorithms tested. These include representatives of 18 different families and more than 40 algorithms in all. Whenever known we specify bounds on the competitive ratios of the algorithms described. We have performed two experiments. The first attempts to rank the various algorithms in terms of their access cost performance, and the second, in terms of their performance as data compressors. Section 3 gives the details of these experiments. In Section 3.4 we briefly describe the results of other empirical studies relevant to accessing cost and compression. Sections 4 and 5 outline the main conclusions and some observations obtained by the first and second experiments, respectively. Finally, in Section 6 we draw our conclusions. To support our conclusions we adjoined to this extended abstract six appendices that include details of some key concepts, tables, graphs and some raw data obtained in our experiments.

‘Any

deterministic

list

accessing algorithm is termed here ratio of 2. An algorithm which the lower bound of 2 - 2/(!+ 1) is termed strictly optimal.

optimal if it attains a competitive attains

2

The algorithms

tested

In this section we describe the algorithms we tested. In the description of most algorithms we specify the algorithm’s action only with respect to an ACCESS Unless otherwise specified it is implicitly request. assumed that an INSERT request for an item z places x at the back of the list and then x is treated as if it was accessed. With respect to each algorithm we also specify bounds on its competitive ratio whenever they are known. Throughout this paper e denotes the total number of elements inserted into the list. Deterministic algorithms. Algorithm MOVE-TO-FRONT (MTF) is one of the most well known and used algorithms. This al orithm attains an optimal competitive ratio of 2 - 2 B(I+ 1) [19, 141. 2.1

Algorithm MTF: move t to the front.

Upon

an access for an item

z

Algorithm TRANSPOSE (TRANS) is perhaps the most “conservative” algorithm presented here and is an extreme opposite to MTF. The competitive ratio of TRANS is bounded below by 2-!/3 [ 191. Algorithm transpose

TFf.ANS: z with

Upon an access to an item z the immediately preceding item.

Algorithm FREQUENCY-COUNT (FC) attempts to adapt its list to the empirical distribution of requests observed so far. A lower bound of (C + 1)/2 on its competitive ratio is known 1191. Algorithm FC: Maintain a frequency counter for each item. Upon insertingan item, initialize its counter to 0. After accessing an item increment its counter by one and then reorganize the list so that items on the list are ordered in non-increasing order of their frequencies.

In our implementation we further require that if two items have the same frequency count then the item requested more recently is positioned in front of the item requested least recently. The following algorithm called MTFP is a more “relaxed” version of MTF. MTFZ can be shown to be be 2-competitive. Upon the ith request for an Algorithm MTF2: item I: move 3: to the front if and only if i is even.

Algorithm MOVE-AHEAD(k) (MHD(~)) proposed by Rivest [18] is a simple compromise between the relative extremes of TRANS and MTF. Algorithm MHD(k): Upon 2, move D forward k positions.

a request

for an item

Thus algorithm MHD(~) is TRANS. We have tested the algorithms MHD (k) with k = 2,8,32,12$. Algorithm MOVE-FRACTION(k) (MF( k)) proposed by Sleator and Tarjan is a slightly more sophisticated compromise between TRANS and MTF. For each k, algorithm MF(k) is 2k-competitive [19]. Algorithm MF(k): Upon a request for an item z currently at the ith position, move c [i//c] - 1 positions towards the front.

55 Thus, algorithm MF(~) is MTF. We have tested the algorithms MF(~) with k = 2,8,32,128. Algorithm TIMESTAMP (TS) due to Albers is an optimal algorithm attaining a competitive ratio of 2.

PI Algorithm TS: Upon a request for an item x, insert x in front of the first (from the front of the list) item y that precedes x on the list and was requested at most once since the last request for x. Do nothing if there is no such item y, or if x has been requested for the first time.

The following family of algorithms called the PRI FAMILY is a straightforward generalization of algorithm TS. Let m be a non-negative integer. Consider the following deterministic list accessing algorithm called PASS-RECENT-ITEM(~) (PRI(~)). It is known that for each m > 2 PRI(~) is 3-competitive (MRI(~) is known to ‘see below). ’ ’ ’ be 2-coGpetit&, Algorithm PRI(m): Upon a request for item 2, move x forward just in front of the first item z on the list that was requested at most m times since the last request for x. Do nothing if there is no such item t. If this is the first request for x, leave x in place (move x to the front).

Note that modulo the handling of first requests, algorithm PRI( 1) is identical to algorithm TS. Notice that the limit element of this family, as m grows, is MTF. We have tested the algorithm PRI(m) with m = 0, 1, . . . ,8 and with the insertion position being the last and the front. In the sequel each member of the PRI family, PRI(m), that inserts a new item at the first (resp. last) position is denoted pm’(m) (resp. PRI(?n)). The following family of algorithms is quite similar to the PRI FAMILY but with a significant change. Let m be a non-negative integer. Consider the following deterministic list accessing algorithm called MOVE-TORECENT-ITEM(~) (or MRr(m) for short . For each m 2 1 algorithm MRI(~) is P-competitive an II is thus optimal. The competitive ratio of algorithm MRI(O) is bounded below by O(d) [II]. Algorithm MRI(m): Upon a request for item 2, move x forward just after the last item t on the list that is in front of x and that was requested at least m + 1 times since the last request for x. If there is no such item z move x to the front. If this is the first request for x, leave x in place (move x to the front).

Although not immediately apparent algorithm MRI(~) can be shown to be equivalent to algorithm TS (modulo first requests for items). We have tested the algorithms Mm(m) with m = 0, 1, . . . ,8 and with the insertion position being the last and the front. In the sequel each member of the MRI family, MRI(~), that inserts a new item at the first (resp. last) position is denoted MRI’(~) (resp. MRI(m)). 2.2

Randomized algorithms. Algorithm SPLIT (SPL) due to Irani [14] was the first randomized algorithm that was shown to attain

a competitive ratio lower than the deterministic lower bound of 2 - 2/(.t + 1 , It is known that SPL is 31/16competitive (= 1.9375 1. For each item x on the list Algorithm SPL: maintain a pointer p(t) pointing to some item on the list. Initialy set p(x) = x. Upon a request for an item x, with probability l/2 move x to the front and with probability l/2 insert x just in front of p(x). Then set p(x) to point the first item on the list.

Algorithm 31T due to Reingold and Westbrook attains a competitive ratio of 1.75 - 1.75/e.

[16]

For each item on the list Algorithm BIT: maintain a mod-2 counter b(z) initially set to either 0 or 1 randomly, independently and uniformly. Upon a request for an item x first complement b(t). Then if b(x) = 0 move x to the front.

The following algorithm called RMTF is to BIT but somewhat surprisingly its worst mance is inferior. It can be shown that its ratio is not smaller than 2 [22]. Algorithm RMTF: Upon a request for probability 112 move x to the front, bility x leave x in place.

and with

very similar case perforcompetitive x, with probe-

family of algorithms COUNTER(~, S) The CTR(S, S)) due to Reingold, Westbrook and Sleator of algorithm BIT. t 171 is a sophisticated generalization Let s be a positive integer and S, a nonempty subset of {0,1)...) s- 1). For each item c on Algorithm CTR(s, S): the list maintain a mod s counter c(x), initially set randomly, independently and uniformly to a number in {O,l,...,s - l}. Upon a request for an item x, decrement c(x) by 1 (mod s) and then if c(x) E S move 3: to the front.

Thus, CTR(~,{~}) is BIT. Reingold et al. prove that CTR(7, {0,2,4)) is 85/49-competitive (% 1.735). From this family we have tested algorithm CTR(~, (0,2,4}). The family of algorithms RANDOM-RESET(S, D) (RST(S,D)) due to Reingold et al. [17] is a variation on the COUNTER algorithms. Let s be a positive inte er and D, a pr;bability distribution on the set S= ! O,l,..., s - 1) such that for i E S,D(i) is the probability of i. For each item x on the Algorithm RST(s, D): list maintain a counter c(x), initially set randomly a number in i E (0, 1, . . . , s - 1) with probability D(i). Upon a request for an item x, decrement c(x) by 1. If c(x) = 0 then move x to the front and then randomly reset c(x) using D.

The best RST(S, D) algorithm, in terms of the competitive ratio, is obtained with s = 3 and D such that D(2) = (a - 1)/2 and D(1) = (3 - & /2. The competitive ratio attained in this case is $ 3 M 1.732. In this family we have only tested this algorithm. Let p E [0, l] The following is family of algorithm called TIMESTAMP (TS(~)) due to Albers [l] that is a kind of randomized combination of algorithm TS and MTF. For each p, TS(~) is proven to be max{2 - p, 1 + p(2 - p)}-competitive.

56 Algorithm TS(p): Upon a request for an item 2, with probability p execute (i) move I to the front; and with probability 1 -p execute (ii) let y be the first item on the list such that either (a) y was not requested since the last request for z; or (b) y was requested exactly once since the last request for 2 and that request for y was served by the algorithm using step (ii). Insert G just in front of y. If there is no such y or if this is the first request for z leave 2 in place.

The optimal choice is easily shown to be p = (3 - fi)/2 in which case the competitive ratio attained is the golden ratio 4 = (1 + fi)/2 w 1.62. From this family we have tested this algorithm and denote it by RTS. In [3], Albers, von Stengel and Wechner present the following algorithm called COMB which is a probability mixture of BIT and TS. Algorithm COMB is shown to be 8/f>-competitive. To date this bound is the best known for a list accessing algorithm. Algorithm COMB: Before serving any request choose algorithm BIT with probability 4/5, and algorithm TS with probability l/5. Serve the entire request sequence with the chosen algorithm.

Although for any data set the empirical performance of COMB can be determined from the performances of BIT and TS, we have tested COMB separately, as a control for the statistical significance of our tests of randomized algorithms. 2.3 Benchmark algorithms. In order to obtain some reference result we tested the performance of the following two “bad” algorithms. The first of the benchmark algorithms acts in a way that at the outset seems to be the worst possible. Algorithm MTL: Upon a request move t to the back of the list.

for an item

G

Although the model requires that we charge MTL for the paid exchanges it performs we ignored these costs and only counted access costs. The second benchmark algorithm fixes a random permutation of the items on the list. Algorithm

RAND: Upon a request for insertion of an item z, insert 3: at a random position chosen independently and uniformly among the fZ+ 1 possible position (where before the insertion there are < items on the list.

were generated from it, then we give the details of the experiments performed. 3.1 The data set. The Calgary Corpus is a collection of (mainly) text files that serves as a popular benchmark for testing the performance of (text) compression algorithms [5]. The corpus contains nine different types of files and consists of I7 files overall. In particular, this corpus contains books, papers, numeric data, a picture, programs and object files. We used each file in the corpus to generate two different request sequences. The first sequence waa generated by parsing the file into “words” and the second, by reading the file as a sequence of bytes. With respect to the word parsing, a vrord is defined as the longest string of non space characters. For some of the non text files in the corpus (e.g. pit, geo) this parsing does not yield a meaningful request sequence and therefore we ignored the results corresponding to such sequences. Table 3 in Appendix B specifies for each file and for each of these parsings the length of the request sequence generated, the number of distinct requests and the ratio between these two quantities. The information in this table (in particular these ratios) can be used to identify less (and more) “significant” sequences. For example, the (word) sequences generated from ob j I and ob j2 are of minor interest since on average almost every word in the request sequence is new (the ratios of total number of words to distinct words are 1.63 and 1.46, respectively). According to this meaSure we would expect the four word-level sequences corresponding to the files progp, progl, book2 and book1 to be of greater significance than the rest of the word-level sequences. In general, as can be seen in the table, the two kinds of request sequences (word- and byte-based) are considerably different, with the word-based sequences resulting in very long lists and relatively short sequences (compared to the list length). On the other hand, the byte-based sequences correspond to very short lists and very long sequences. 3.2

3

The

experiments

performed

Each of the above algorithms was tested with respect to a relatively large data set called the Calgary Corpus. We performed two different experiments. The goal of the first experiment was to rank the access cost performance of the various algorithms within the traditional dynamic model. In the second experiment we applied each algorithm as a data compressor and tested the overall compression ratios obtained. We start with a description of this data set and how request sequences

Experiment 1: access cost performance. In this experiment each algorithm started with an empty list and served request sequences that were generated from each of the Corpus’ files. Our primary concern here was to rank the algorithms according to their total access costs and study the relationships between their competitive (theoretical) performances. To obtain statistically significant results each of the randomized algorithms (except for COMB) was executed 15 times per each request sequence. The results for COMB were computed as a weighted average from the

57 costs obtained for TIMESTAMP and the (average) costs obtained for BIT (recall that COMB is a probability mixture of these two algorithms).

the MOVE-AHEAD(~) family (MHD(~)) with respect to request sequences distributed by Zipf’s law. A few other simulation results testing various properties of particular algorithms with respect to sequences generated via 3.3 Experiment 2: compression performance. Zipf’s distribution are summarized in a survey by Hester In this experiment we tested the performance of all and Hirschberg [13]. There are also a few empirical tests of the perforalgorithms as data compressors using the compression scheme of Bentley el al. [S]. Specifically, in this mance of some list accessing algorithms applied to text scheme, whenever a dictionary word is to be encoded, compression. Bentley, Sleator, Tarjan and Wei tested a binary encoding of its current position on the list is the performance of MTF-compression algorithms with transmitted (using some variable length prefix code). various list (“cache”) sizes with respect to several text After that the list accessing algorithm rearranges the files (containing several C and Pascal programs, book list as if the word were accessed. If the word is not on sections and transcripts of terminal sessions). They performance with that the list, the word itself must be transmitted. In this also compared the algorithms’ case we encoded the word as a stream of bytes using of Huffman coding compression and found that for a a secondary byte-level list accessing compressor that sufficiently large cache size (e.g. 256) MTF compreswith Huffman coding. Albers starts with all 256 possible bytes on the list. Both word sion is “competitive” and byte compressors were implemented using the same and Mitzenmacher [2] compared the compression peralgorithm. To restore the data the receiver performs formance of algorithm TIMESTAMP with that of MTF with the inverse operations. That is, the receiver can recover respect to the Calgary Corpus. They considered both each word or byte whose location was transmitted. word and byte (character) parsings. Using Elias prefix None of the algorithms was actually implemented to encoding (see Appendix A) they obtained the following is significantly better compress and restore data. Note that in order to do results. TIMESTAMP-COmpRSSiOn than MTF-compression with respect to character (byte) that each of the word and and byte lists must contain a special “escape” symbol that must be transmitted encoding, but both are not “competitive” with standard each time there is a transition between the word and Unix compression utilities. With respect to “word” enbyte lists and vice versa. In our experiment these coding TIMESTAMP is often (only marginally) better than extra charges were ignored so the results obtained in MTF compression. The word-based compression perforthis experiment are useful only for initial comparison mance of both these algorithms is found to be close to between the various algorithms. However, (For each of the that of standard Unix compression utilities. algorithms the absolute compression performance is in note that the results of this experiment do not count general somewhat worse than the one reported here.) the encoding of new words that are not already on the Each algorithm was tested with respect to various list. A few other studies compare the performance of variable length binary prefix encodings. In particular, particular, more sophisticated list accessing compreswe have tested each algorithm with 6 different encodings sors that alter the basic scheme. Burrows and Wheeler including d-encoding, Elias, w, start-step-stop and one [lo] tested the performance of an MTF compressor that new encoding scheme (to the best of our knowledge). operates on data that is first transformed via a “blockExcept for the new encoding scheme, all the other sorting” transformation. Grinberg et al. [12] tested the schemes are defined in [4]. A brief description of the performance of an MTF compressor that uses “secondary encoding schemes used appears in Appendix A. lists” . We note (based on their results and ours) that these more sophisticated schemes achieve in general bet3.4 Previous empirical studies. A few empirical ter compression results than the basic scheme. studies of the performance of online list accessing algoFrom the above, the only comparable studies are rithms have been conducted. Bentley and McGeoch [7] those of Bentley and McGeoch and Albers and Mitzentested the performance of MTF, FC, TRANS with respect macher. First we note that our study supports the qualto request sequences generated from several text files itative results of both these studies. However, they are (4 files containing Pascal programs and 6 other English not comparable quantitatively (the Bentley-McGeoch text files). The sequences generated from the text files study is incomparable to ours as it used a different corby parsing the files to “words” with a word defined as pus; the quantities reported in the Albers-Mitzenmacher an “alphanumeric string delimited by spaces or punctustudy are incomparable to ours as they have not meaation marks”. It was found that FC is always superior to sured the transmission costs of new words). TRANS and that MTF is often superior to FC. Tenenbaum Compared to the known empirical studies, the re[21] tested the performance of various algorithms from sults reported here are significantly more comprehen-

58 sive and provide insights into many algorithms, among which some that have never been tested. In addition, our results test the performance of list accessing algorithms with respect to both dictionary maintanence and compression applications.

latter two. The performance relation between MTF and FC is somewhat different with respect to the byte-based sequences. Here in most instances FC significantly outperforms MTF (MTF is better than FC w.r.t objl, obj2, paper6 and paper6). One possible hypothesis could be 4 Experiment 1: access cost performance that in the instances where FC outperforms MTF the request sequence resembles more closely one that was genAlthough some of the results are hard to interpret there erated via a probability distribution. Nevertheless, this are certain consistent results that are quite interesting hypothesis is not supported by the fact that in all such and/or surprising. First, consider Table 1 which speciinstances MTF outperforms TRANS.~ For example, this fies the four best algorithms with respect to each of the is also the case w.r.t the two longest byte-based request individual files. The table also specifies the ratio of the book1 and book2 (lengths sequences corresponding to worst cost obtained (always by MTL) to the best. It is interesting to note that the initial hypothesis that the of 768,771 and 610,856 requests, respectively). Although the randomized algorithms performed more relevant (word-level) request sequences are those algorithms it is that exhibit large ratios of total number of requests to poorly relative to the best deterministic still interesting to rank their relative performance. With distinct items is supported by the ratio of best perforrespect to the byte-level sequences the best randomized mance to worst performance in Table 1. For example, the four most “significant” sequences according to the algorithm was always COMB. The worst performance was most often obtained by RMTF and RST. (Note that first measure, corresponding to the files progp, progl, book2 and book1 are exactly those that generated the we do not consider here algorithm RAND, which was always the worst randomized algorithm.) In all sequences largest ratios of best to worst performance. the second best algorithm was either TS(P) or BIT. Note It is most striking that in all cases the best perforthat COMB and TS(p) are the best algorithms in terms mance is obtained by deterministic algorithms. Among of their competitive ratio. In general, the ratio of worst the best performing algorithms with respect to wordbased sequences are MF(~), FC and members of the PRI (randomized) performance to the best one was in the and MRI families. Among the best algorithms with re- range of 1.04-1.08 with an anomaly observed with respect to byte-based parsing are FC, members of the MRI spect to the files paper4, paper5 and paper6 which and PRI families and MHD (8). Notably, FC(2) is only 4- resulted in the ratios 1.14, 1.17 and 1.33. (Interestingly, competitive and FC is not competitive at all. The mem- these papers where written by the same author.) Thus, bers of the MRI(~), m _> 1 and PRI(~) (TIMESTAMP) are in general the randomized algorithms achieved similar With respect to the four more “signif2-competitive. However, algorithm MRI(O) is also not performance. icant” word-level sequences (progp, progl, book2 and competitive. bookl) the best randomized algorithms were BIT (progp, Not only is the best performance often obtained by (deterministic) algorithms that are not optimal, the progl) and SPL (book1 and book2). Whenever BIT was the best the ratio of worst to best (among the ranbest performing deterministic algorithms consistently domized algorithms) was significantly higher (1.15,l. 17) and significantly outperform all randomized algorithms. ratio for the cases where SPL was These results stand in contrast to what could be ex- than the corresponding pected given the worst case, competitive analysis re- the best (1.07, 1.04). Inspection of the relative performance of the MOVEsults. We note that the average access costs of the ranTO-FRONT based algorithms (MTF,MTF~,BIT and RMTF) domized algorithms are statistically significant. Note yields the following conclusions. The worst performance that the costs obtained for randomized algorithms are obtained by RMTF. In the averages of 15 independent runs for each request se- was almost consistently byte-level sequences the best algorithm was very often quence (which are found to be statistically significant). MTF:!, then MTF and a few times BIT. Among the The results obtained in our experiment support the MTF was mostly more significant word-level sequences qualitative results obtained in the Bentley-McGeoch experiment. In all instances MTF outperformed TRANS. the favorite. With respect to the MHD(~) algorithms no consisAlso, in the word-based sequences, MTF often outpertent monotonicity was observed with respect to perforforms FC. With regards to the more significant wordmance and the parameter Ic. It is interesting to note based sequences (progp, progl, book2 and bookl), the ratio of FC to MTF cost is approximately 1.13, 1.19, 0.98 and 0.91, respectively. Thus, MTF outperformed FC in ‘With respect to all non trivial distributions it is known that the first two sequences and under-performed FC in the TRANS has better asymptotic performance than MTF [18].

File name

Top four algoril thms (word partition, )

1.009, 3.597

8) 2)

that

in various

..1*1. MRI MAT

1.023,3.438

1) (3)

I

1.011, 2.318

(8)

I

1.023,2.663

1)

I

1.006, 2.324

for the individual

MRI II II

II

MRI MRT

runs with respect to word and byte parsing of each of the

instances of the word-level sequences significantly worse than TRANS (= MHD( 1)). This fact is quite surprising given that for such sequences (e.g. book2) more “agressive” algorithms are expected to perform better. The MRI(VZ) and PRI(~) family exhibited consistent and almost perfect monotonocity in terms of their performance as a function of the parameter m. For example, perfect monotonicity was obtained with respect to the byte-level sequence generated from book2. In many cases the algorithms MRI(O) and PRI(O) acheived the best performance. With respect to the word-level sequences the performance was sometimes increasing with m and sometimes decreasing with m. Curious results where obtained for the sequence generated by book2 where the Pm(m) algorithms achieved better performance with lower values of m and the MRI(m) algorithms with larger

MHD(~) performed

Cost ratio (&h/l&, MTL/lst)

1.010, 2.097

‘3)

Table 1: The four best algorithms corpus files

Top four algorithms (byte partition) -” .rl-.r/n\

cost ratio (&h/l& MTL/lst)

values. Based on these empirical results and taking into account the time and space complexities of the better performing algorithms, one of the best algorithms, in terms of access cost performance and complexity is MF(2). In fact, after observing this we performed some initial experiments with various other algorithms in the family MF(~) and found that the algorithm ~~(3/2) (i.e. the algorithm that advances a requested item 2/3th of its way to the front) was a champion in most of the byte-level sequences. Hence, if we were forced to choose one single list accessing algorithm for the purpose of dictionary maintenance (based on these empirical results) it would have been MF(~) with some k E [3/2,2]. N evertheless, it is important to note that although the members of the MRI (and PRI) families are time and space consuming, the combination of their

60 their empirical the best. 5

Experiment

and theoretical

2: data

performance

is among

compression

Table 2 summarizes the overall average compression ratios (bit/char) obtained by the algorithms with respect to the various binary encoding schemes used. The averages are calculated with respect to the entire corpus. In this table each row corresponds to one algorithm and each column corresponds to one particular binary encoding. Surprisingly, the best average compression ratios are obtained by FC and and MRI(O). Similar performances were obtained by various other members of the MRI and PRI families and MHD(~) . These results were obtained using the RB-encoding scheme. We note that with the exception of algorithms MF(4), MF(~) and PRI'(O), the RB-encoding yielded the best compression ratios for all algorithms. The worst performance (RBencoding, and not including algorithms MTL and RAND) was obtained by algorithms MF@), MF(~) and TRANS. Note that algorithms FC and MM(O) where the best regardless of the binary encoding used. A striking fact is that the ranking of the randomized algorithms (RB-encoding) almost matched their competitive ratios, where the best to worst performance was obtained by COMB, TS(~), BIT, CTR, RST, SPL and RMTF. Note that algorithm RST is theoretically superior to algorithm CTR. Note also that a lower bound on the competitive ratio of BIT is not known. Nevertheless, it is important to mention that the best performance of COMB was consistently due to the deterministic algorithm TIMESTAMP (PRI( 1)), which consistently outperformed BIT. Finally we note that the results of this experiment support the qualitative results of the AlbersMitzenmacher compression experiment [2]. Namely, the compression ratio obtained by algorithm TIMESTAMP was consistently better than those obtained by MTF compression (regardless of the binary encoding used).

may be the case that they do not exhibit enough locality of reference (e.g. many of the files are English text files). Note however that this particular criticism does not apply to the compression experiment. Thus, it would be of great interest to investigate the statistical properties of these sequences in order to validate the results presented here. In particular, it would be of major importance to devise a meaningful, quantitative measure of locality of reference that could be used to classify request sequences and further investigate the correlation between various algorithms and their performance with respect to sequences. To the best of our knowledge, no such measure has been studied. Also, it would be of great importance to put together an appropriate corpus that could be used to test the performance of data structures and algorithms for dictionary maintenance. Although the results concerning compression performance do not give true compression ratios they do give lower bounds. These results clearly indicate that the list accessing compression scheme by itself will not give compression ratios that are competitive with popular compression algorithms such as those based on Lempel-Ziv schemes. Nevertheless, our very preliminary experimentation with more sophisticated list accessing based compression algorithms suggests that significantly better results will be obtained using more sophisticated variations on the basic scheme. For example it would be very interesting to experiment with secondary lists [12], the Burrows-Wheeler block-sorting transformations [lo], two different list managers (one for new words, and one for words already on the list), and dynamic transition between different basic list accessing algorithms in order to adapt to changing levels of locality of reference. Also, it would be important to explore various other variable length prefix encodings (such as members of the START-STEP-STOP scheme) and attempt to match them to the list manager behavior. Acknowledgments

6

Concluding

remarks

Some of the results reported in this paper stand in contrast to various theoretical studies of the list accessing problem. Nevertheless, it is important to keep in mind that these results are based on performance measured with respect to a particular data set that can be criticized. One major criticism is that the sequences generated belong to two rather extreme families of request sequences. In the word-based sequences the typical ratio of total number of requests to distinct requests is on the low side. On the other hand, the byte-based sequences have a small number of distinct requests and due to the nature of the particular files in the corpus it

We thank Susanne Albers, Allan Borodin, Brenda Brown, David Johnson and Jeffery Westbrook for useful comments. References [l]

S. Albers. Improved randomized on-line algorithms for the list update problem. In Proceedings of the 6th Annual

ACM-SIAM

Symposium

on Discrete

Algorithms,

pages 412-419, 1995. [2] S. Albers and M. Mitzenmacher. Average case analyses of list update algorithms, with applications to data compression. Technical Report TR-95-039, International Computer Science Institute, 1995.

61

Table 2: Overall average results for the compression experiment. Each entry gives the average compression ratio, bits/byte, for the entire corpus. The best results appear in boldface and the three best results are marked with stars [3] S. Albers, B. von Stengel, and R. Werchner. A combined BIT and TIMESTAMP algorithm for the list update problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, 1995. [4] T. Bell, J.G Cleary, and I.H. Witten. Text Compression. Prentice Hall, 1990. [5] T. Bell, J.G Clear-y, and I.H. Witten. Text Compression, Appendix B. Prentice Hall, 1990. The Calgary

Corpus is available at ftp site: f tp. cpsc . ucalgary . ca. [6] J.L. Bentley, K.L. Clarkson, and D.B. Levine. Fast linear expected-time algorithms for computing maxima and convex hulls. In Proceedings of the 1st ACMSIAM Symposium on Discrete Algorithms, pages 179187, 1993. [7]

J.L. Bentley and C. McGeoch. Amortized self-organizing sequential search heuristics. cations of the ACM, 28(4):404411, 1985.

analysis of Communi-

[8] J.L. Bentley, D.D. Sleator, R.E. Tarjan, and V. K. Wei. A locally adaptive data compression scheme.

Communications [9] A. Borodin,

of the ACM,

N. Linial,

29(4):320-330,

and M. Saks.

An optimal

1986. online

algorithm for metrical task systems. Journal of the ACM, 39:745-763, 1992. [lo] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Techuical Report 124, Digital System Research center, 1994. There are infinitely many competitive[ll] R. El-Yaniv. optimal online list accessing algorithms, June 1996. submitted to SODA 97. [12] D. Grinberg, S. Rajagopalan, R. Venkatesan, and V.K. Wei. Splay trees for data compression. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, 1995. [13]

J.H. Hester and D.S. Hirschberg. Self-organizing linear

search. ACM Computing Surveys, 17(3):295-312, 1985. Two results on the list update problem. [14] S. Irani. Information Processing Letters, 38(6):202-208, June 1991.

62 On serial files with relocatable records. [151 J. McCabe. Operations Research, 13:609-618, July 1965. WI N. Reingold and J. Westbrook. Randomized algorithms for the list update problem. Technical Report YALEU/DcS/TR-804, Yale University, June 1990. N. Reingold, J. Westbrook, and D. Sleator. RandomP71 ized competitive algorithms for the list update problem. Algorithmica, 11:15-32, 1994. PI R. Rivest. On self-organizing sequential search heuristics. Communications of the ACM, 19, 2:63-67, February 1976. WI D.D. Sleator and R.E. Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM, 28(2):202-208, 1985. B. PO1 Teia. A lower bound for randomized list update algorithms. Information Processing Letters, 47:5-g, 1993. WI A. Tenenbaum . Simulations of dynamic sequential Communications of the ACM, search algorithms. 28(2):790-791, 1978. PI J. Westbrook, 1996. personal communication.

A

Variable

length

prefix

free binary

l

l

encodes &encoding: 2llog(l + LlogiJ)] bits Elias encoding:

i

with

encodes i with

1 +

[log i]

+

2Llog i] + 1 bits

w-encoding: i is encoded by writing the binary representation of i, b(i), then appending (to the left)

the binary

representation

of lb(i)1 -

1. This

process is repeated recursively and halts on the left with coding of 2 bits. A single zero is appended on the right to mark the end of the code l

l

l

J-encoding: bits

similar

to w-encoding,

halts with

3

START-STEP-STOP: The START-STEP-STOP family produces a great variety of codes. Each code in this family is specified by three parameters (see the exact definition in [4], Appendix A). From this family we have tested the (1,2,5)-encoding.

The RB-encoding.

This code assumes that at each B on the maximum possible integer to be encoded is known. The idea is straightforward. We write the length of the binary encoding concatenated with the binary encoding time

an upper

bound

In order (excluding the most significant bit). to exploit all possible codes we use the following variation. Define c = [log( [log B] + 1)l. Set f = 2c - [log B] - 1; that is f is the number of unused codes. Let I < i < B be an integer to be encoded. If i 2 f we encode 2c - i using c bits. Otherwise, i > f, we encode the number [log(i - f)l with c bits appended by i-f in binary, excluding its most significant bit using [log(i - f)j - 1 bits.

encodings

We have used the following six variable length prefix binary encodings. The precise definition of the each of the first five codes can be found in [4], Appendix A. Here we specify the lengths obtained by some of the encodings and provide a brief description if the length is not a closed form formula. Also we give the definition of the new encoding, called RB-encoding. l

Table 3: Total number of requests, number of distinct requests and the ratio between them as generated from each file via byte and word parsing

B

The

Calgary

Corpus

For each of the 17 files in the Calgary corpus, Table 3 specifies its length, the number of distinct elements and the ratio between these two, according to byte parsing and word parsing. The Calgary Corpus can be downlowded from ftp site: f tp. cpsc . ucalgary . ca.