Dynamic Profiling for Dictionary Based Code Compression

0 downloads 0 Views 55KB Size Report
Mixed Static/Dynamic Profiling for Dictionary Based Code Compression. E. Netto. 1. CEFET-RN ... as a source of information for the compression algorithm. Most compression ..... R. Brown. MiBench: a free, commercially representative.
Mixed Static/Dynamic Profiling for Dictionary Based Code Compression E. Netto1 CEFET-RN IC-UNICAMP [email protected]

R. Azevedo

P. Centoducatte G. Araujo IC-UNICAMP {rodolfo, ducatte, guido}@ic.unicamp.br

Abstract Many compression techniques have been proposed to accommodate ever increasing software pieces into restricted memory area in embedded systems. Recently, these techniques have been shown to improve other important design constrains like energy and performance. This paper proposes a blended dictionary model based on static/dynamic profiling that lead to best trade-offs on compression, performance and energy savings. We also propose a new dictionary based code compression algorithm, independent of the cache organization and processor, to support our experiments. A mix of benchmarks from Mediabench and MiBench suites revels that compression ratios of 75% can be obtained while decreasing bus accesses to the cache by 31% for the Leon processor. These results approach simultaneously the best solutions of when using pure static or pure dynamic information based dictionaries techniques.

1. Introduction System-on-Chip (SoC) architectures are expanding from the traditional low-end applications to sophisticated resource dependent systems. One common choice is changing the 8/16bit CPUs for robust 32bit RISC/DSP processors. These modules come from third part companies (as IP cores) and are integrated into SoCs along with a complex memory subsystem and I/O. The price for high-performance is the narrowed requirements for energy efficiency and chip area. Code density for RISC processors, on the other hand, is not their best feature. The regular, easily decodable, instruction set architecture - ISAs - of RISCs have the side effect of increasing code size. This price, however, is too costly for the embedded market, and as so, researches have devoted much effort to alleviate the memory footprint, while keeping the desirable performance and regularity of these processors.

1

Supported by CNPq grants #94949493/2001 and #9349494/2001-1

Code compression is a possible solution to squeeze code size. Techniques from data compression field have been considered as the base to derive methods that can be applied to code. Unfortunately, special requirements like random access and on-the-fly (fast) decompression discard some outstanding data compression methods. Since the Compressed Code RISC Processor – CCRP [1] was introduced, researchers have shown that the benefits of compression go beyond reducing code size, reaching energy consumption reduction and performance improvement [2,3,4,5]. The better results for energy and performance come from the case where the decompressor engine is positioned between the processor and the cache (Processor-Decompressor-Cache architecture, PDC). Thus, instructions are compressed in cache, effectively increasing its capacity. The critical path, on which resides the decompressor, imposes even more tough restrictions for the compression engine design. Usually, the simpler the compression technique used, the simpler (and probably faster) is the decompressor, although compression ratio could not be the best possible. The frequently used solution to simplify decompression is using a small table, that works as an instruction dictionary. The original instructions in the code are then replaced by indices to this dictionary. As an index is smaller than the original instruction (for small tables), it yields a smaller (compressed) code. Commonly, indices are 8 bit long for 256 entries dictionaries. Another used idea comes from the thumb’s rule “10/90” which states that most (90%) of the code is rarely (10%) executed. This opens up the opportunity for more sophisticated compression methods, like those based on arithmetic coding, that apply only to the “cold” code. This usually minimizes execution overhead due to the decompressor [4,5]. This approach uses dynamic profiling as a source of information for the compression algorithm. Most compression techniques based on dictionary gather instruction statistics by statically counting its occurrences in the code [3,6] or using dynamic profiling information to obtain the most executed instructions [2].

In the present work we propose blending static and dynamic profiling based dictionaries, to take advantage of both compressibility and performance by reducing cache hit ratio and bus traffic simultaneously. We also develop a method for compression, based on dictionary and independent of the cache and processor that explores the best opportunities related to code compression for dynamic and static instruction count. This compression method performs as well as previous PDC based dictionary methods in terms of compressibility, reaching an average 25% code size reduction (compression ratio = 75%). We organized this paper as follows: Section 2 presents the related work. In Section 3 we explain the basic observations that led us to propose the blended dictionary and we outline the algorithm used for unification. The compression method is described in section 4. Section 5 presents the experimental results, and finally, section 6 presents our conclusions.

2. Related work The work from Lekatsas et al [3,6] and Benini et al [2] are the closest to ours, as they use small dictionaries and PDC architectures. We describe next, some of the main ideas they have proposed. In [3] the SPARC ISA is split into four classes: instructions with immediates; branches; fast dictionary; and uncompressed. Instructions with immediates are compressed using arithmetic coding and account for most of the static occurrences in the code. Branches are compressed using an extra field for defining the size of the displacement. Instructions with no immediates and which are not branches are allocated to a 256 entries dictionary. In this set, the most appearing instructions, based on static account, occupy the dictionary while the remaining instructions fall down into the uncompressed class. The instructions are then prefixed with a variable sized, uniquely decodable, preamble to identify its class. Uncompressed instructions receive a preamble of 3 bits yielding a 35 bit instruction, which may imply in a performance penalty if it belongs to a frequent execution trace. The decompressor engine is a set of parallel pipelines, one for each class, but its area is not mentioned, neither its speed. In [6] the Xtensa 1040 uses a pure static dictionary technique. The advantage of this platform is that it has already a variable sized ISA with 24 and 16 bits narrowed instructions while code is 32-bit wide. The 24 bit instructions are then compressed to 8 or 16 bits. The decompressor is smart enough to identify an instruction as compressed or not, its size and starting bit, and can access the dictionary when necessary in just one cycle. Auxiliary registers are used to hold pieces of instructions that arrive

from cache incomplete, being necessary a second cycle to reassemble the original instruction. The performance improvement, in terms of cycle count, is approximately 25% and compression ratio (vaguely mentioned) is on average 65%. No specific cache performance report is available. The compression method proposed in [2] uses a PDC architecture for DLX, which integrates the decompressor engine control with the cache controller. The compression algorithm uses a cache line as the repository for compressed instructions. A dynamic dictionary2 of 256 entries is used to supply information for the compressor algorithm. The cache line in the experiments is 4 x 32bit long. An invalid instruction opcode (a sentinel) is used to identify if a cache line is compressed or not. The first 32 bits of each compressed cache line are used for the sentinel along with a set of auxiliary flags to signal the presence of compressed/uncompressed instructions in each byte of the remaining line. If the sentinel is not present the entire cache line is uncompressed. Instructions are not allowed to pass the limit of a cache line, and branch targets are always word-aligned, as in [6]. The compressor algorithm just compresses a cache line if more than four instructions, including those that belong to the dictionary, fit into a cache line (at least one has to be compressed), otherwise it keeps the original, since no compression would result. The experiments show a compression ratio of 72% on average and a cache hit ratio increase of 19%. Finally, energy consumption is reduced by 35%. One side effect of this implementation is that the cache access time is increased. Our compression algorithm differs from Lekatas’ [3] because it does not chunk instructions in classes, neither combines different methods, simplifying decompression without much area overhead. It differs from [6] because it uses regular size instructions and fixed index slots to simplify decompression. Moreover, no decompression activity involves more than one access to the cache. It is also different form Benini’s [2] because it is not coupled with the memory system, so no overhead in cache access time is experimented. Moreover, although scalable, Benini’s approach depends on the cache organization. For every cache line size, adaptations are necessary. Our approach, on the other hand, is completely independent of the cache organization. Beyond that, we allow unaligned branch target, which is forbidden in all the aforementioned work. Moreover, we could not find any work in code compression that makes effective use of mixed static/dynamic profiling.

2

The dictionary is constructed based on dynamically profiling instruction execution.

Number of Redundant Instructions

256

100%

Search Pegwit Djpeg

192

75%

Cjpeg Adpcm_enc Adpcm_dec

128

50%

Dijkstra

fetched) instruction from the DD and add to the UniD. Then, we take the first (most appearing) instruction from the SD and check if it is already in the UniD. If so, we discard it, otherwise we include it. This avoids duplicated instructions to participate in the resulting dictionary. We repeat the last step toggling from DD to SD until 256 distinct instructions are added to the UniD.

4. Compression method 25%

64

0%

0 1

64

128

192

256

Dictionaries Entries

Figure 1: Similarities between dictionaries

3. Static x dynamic profiling: how close they are One of the key insights of our work comes from the fact that a small ordered dictionary (256 entries) based on static instruction count (most appearing first) has not much similarities with its counterpart based on dynamic profiling. Notice from highligted grid lines in Figure 1, that less than 25% of the instructions are redundant for the first half of each dictionary (first 128 entries), with an average of 13%. This implies that many opportunities for compression are lost when choosing the dynamic profile information. On the other hand, if the choice is a static dictionary, many opportunities for cache misses and bus traffic reduction are lost. Considering these observations, we propose a Unified Dictionary, UniD, to exploit simultaneously the benefits of compression from the static account and the improvements in performance due to the dynamic measurements of instruction usage. As far as we know, this is the first work to approximate the best results from both dictionaries techniques at the same time.

3.1. Blending algorithm The algorithm used to blend the two dictionaries has the goal of electing the most used instructions from both dynamic and static scenarios to fit into the UniD. The assumption is that choosing the instruction which statically appears the most will help in compression ratio, and at the same time, choosing the instruction that is fetched the most will help in performance. The first step in our approach is preparing two ordered histograms of instructions for the static dictionary (SD) and the dynamic dictionary (DD). We take the first (most

The compression method we devise here can use the SD, the DD and the UniD dictionaries, so comparisons are fairly accurate without the interference of a better or worse algorithm. The method uses a special 32-bit word to keep three slot of 8 bits for indices to the dictionary and a sentinel (unimplemented opcode). We named this word ComPacket (Compressed Packet). Figure 2 depicts its structure. As a sentinel for the SPARC ISA is obtained with only 5 bits (op=002 and op2=0002) we use the remaining 3 bits to signal special characteristics of the ComPacket. Bit S, is used to identify the number of indices present in the ComPacket. This allow us to pack 2 or 3 indices for instructions in the dictionary. Packing only one index would not yield any compression as a ComPacket is 32bit long, the same size as an uncompressed instruction. Whenever S=0, three indices are present, otherwise, only two are available. The TT pair of bits indicates from which index the decompressor should begin decompressing in a case of a branch into the ComPacket. This means that branch targets are not necessarily word-aligned, but it imposes that only one target can be present in each packet. When TT is 002, the first indexed instruction is a target; when TT=012, the second, and when TT=102 the third.

4.1. The compression algorithm The first step of the compression algorithm marks in the code all the instructions which belong to the dictionary. They receive a (D) mark. Also, the instructions which are targets of branches/calls receive and extra mark (T). In the second step triples of adjacent instructions that are marked (D) and have no more than one (T) mark are found. If this condition applies, assembly them as a size 3 ComPacket (3 indices into the dictionary). This step is repeated for two adjacent (D) marked instructions. Finally, the branches in code are patched to reflect the new addresses. Figure 3 depicts the compression algorithm for instructions a to i. It does not compress a, b and c because two targets are present, so the smaller ComPacket is chosen for packing a and b. Instructions c, d and e are

32 bits

Sentinel S TT

Slot A

00 000 = unimp

Slot B S

Slot C

TT

S = 0 ComPacket contains 3 indices S = 1 ComPacket contains 2 indices TT = 00 target is first index TT = 10 target is second index TT = 01 target is third index

Figure 2: ComPacket format uncompressed compressed

B[-3]

a b c d e f g h i

S1 S0

a •b •c d B[-1] f S1 g • h i

(D) (DT) (DT) (D) (D)

e

(D) (DT)

00 000

1

01

Sentinel

S

TT

Figure 3: Compression algorithm Table 1: Compression ratio SD (%) Original Size w/o with (Bytes) Dict Dict 48,016 86 Search 88 88,160 88 Pegwit 89 125,232 86 Djpeg 87 112,400 86 Cjpeg 87 9,136 55 Apcm_en 66 9,104 55 Apcm_de 66 72,592 58 Dijkstra 60 78 Average 73

DD (%) w/o with Dict Dict 90 92 97 98 98 99 94 95 67 78 61 72 95 96 86 90

UniD (%) w/o with Dict Dict 87 89 89 90 88 89 88 89 56 67 56 67 60 61 75 79

compressed using the size 3 ComPacket format. After compression branch f is patched to reflect the new position of the target instruction. An special observation is that branches and calls are not allowed to participate in the dictionary because they are PC relative. The problem comes from the mapping between memory locations before and after compression. For instance, instruction f, B[–3] in Figure 3, has changed its displacement field after compression, and probably this new instruction B[–1] will not belong to the dictionary. In other situations B[–3] will be changed to B[–2] and so on. Thus we discard branches/calls when building the dictionary.

5. Results The platform for our experiments uses a simulator of the Leon processor (SPARC v8) [7] developed in our

labs. This simulator has an interface to the DineroIV cache evaluation tool [8] that we used to measure cache misses. The benchmarks are extracted from MiBench [9] and Mediabench [10]. They are a string search algorithm, Search, commonly used in office suites; Pegwit, an encryption tool; Djpeg and Cjpeg, used for compressing and decompressing images from and to JPEG formats; Adpcm encodes or decodes audio; and Dijkstra, an algorithm used in network routers. Table 1 presents the results for compression of the .text section of the object code generated by LECCS (with –O2 option), a GCC based compiler for the Leon. We show the compression ratio in terms of the final compressed size over the original size. We also explicitly consider the overhead caused by the dictionary, as it is part of the memory footprint of each program. First, we ran the compressor for the static dictionary (SD column). This is supposed to present good result for compression, once it is based on instruction static count. Compression ratio is on average 73% and 78% without and with the dictionary overhead, respectively. We then repeated the same experiments now using the dynamic based dictionary (DD column). The results show a compression ratio of 86% and 90%. Finally, we used our unified dictionary (UniD column). We experimented very close results to the ones produced by the SD based method, reaching a compression ratio of 75% and 79%. As a side effect of compressing code in PDC architecture we measured the effective decrease in cache misses caused by the increased spatial locality we obtain when using cached compressed instructions. We ran simulations for cache sizes from 64 bytes to 32Kbytes for all benchmarks and found the average point in which an increase in size would not bring much benefit for cache performance. This size is a 128 byte instruction cache. In Table 2 we show the miss ratio of benchmarks for this cache size in the three scenarios: SD, DD and UniD based compression. The best miss ratio, which is 40% of the original, occurs by the use of the DD. Our UniD performs as well, with a reduction to 42%, while the static dictionary experiments a less significant value. Thus, by mixing instructions from the SD and the DD dictionaries we can have higher compression ratios with smaller miss ratio, thus reducing power consumption. Energy is also affected by bit toggles in buses (hamming distance between dynamically adjacent instructions and adjacent addresses), so we have also measured this parameter. Table 3 points the total percentile reduction including the code and address buses toggles. More than 30% reduction is reached by using the DD or the UniD. Finally, we ran some experiments in which the last instruction in the decompressor engine is kept in a local buffer, so that, no new access to the cache is necessary

when the next instruction is part of the same ComPacket. As this buffer is a simple 32bit register, its impact is very small while it saves much energy by minimizing bus activity and keeping cache unused for 2 or 3 cycles. The results of these experiments are shown in Figure 4, in which we again see the effect of using a blended dictionary as a better choice than using the SD based method, keeping at the same time the benefits obtained with a DD. The results are normalized to the original value and show that 31% of the cache accesses are prevented by using the UniD dictionary. One might observe correctly that the processor cycle count is also affected by compression. By using a decompression engine in front of the Instruction Fetch stage of the processor pipeline we incur in one cycle penalty every taken branch, but this can be compensated by the miss ratio reduction. In fact, this is a very sensible parameter, depending much on the cache miss penalty. We have experimented an increase in 5% for the SD, and decreases of 7% and 6% for the DD and the UniD, when using a two-cycles miss penalty. The best number goes up to 29% of reduction if the penalty is 10 cycles.

Table 2: Miss ratio

Search Pegwit Djpeg Cjpeg Apcm_enc Apcm_dec Dijkstra

Instructions Executed 8,070,065 32,976,116 3,707,977 15,070,171 9,527,331 7,091,219 52,871,103 Average

Orig. (%) 15.8 9.9 13.7 13.3 24.8 18.8 8.2 14.9

SD (%) 15.9 9.3 13.2 13.6 15.5 20.8 8.1 13.8

DD (%) 10.8 7.3 8.3 7.0 6.2 0.0 1.7 5.9

UniD (%) 12.4 8.1 11.7 7.1 3.1 0.0 1.9 6.3

Table 3: Bus toggles

Search Pegwit Djpeg Cjpeg Apcm_enc Apcm_dec Dijkstra

Original Toggles

SD (% orig.)

DD (% orig.)

UniD (% orig.)

111,361,609 409,946,579 45,875,901 194,645,626 126,567,600 96,667,537 665,188,662 Average

97 99 99 99 88 97 99 97

72 67 59 75 60 59 62 65

76 70 73 79 64 65 66 70

6. Conclusions

7. References [1] A. Wolfe and A. Chainin. Executing compressed programs on an embedded RISC architecture. Proc. of Int’l Symp. on Microarchitecture. pp. 81-91, Dec 1992. [2] L. Benini, A. Macci and A. Nannarelli, Cached-code compression for energy minimization in embedded processor. Proc. of ISPLED’01. pp 322-327, Aug. 2001. [3] H. Lekatsas, J. Henkel and W. Wolf. Design and simulation of a pipelined decompression architecture for embedded systems. Proc. of ISSS´01. pp. 63-68, Oct. 2001. [4] S. Debray and W. Evans. Profile-guided code compression. Proc. of the ACM SIGPLAN on Programming Language Design and Implementation. pp. 95-105, Jun 2002.

SD

DD

UniD 100

100 80

94 68 69

60 40 20 Av er ag e

jp eg Ad pc m _e nc Ad pc m _d ec D ijk st ra

C

jp D

Pe

gw

it

eg

0 Se ar ch

Bus Accesses (%)

Orig

So far, we have presented a new approach for combining the best characteristics from multiple dictionary based compression methods. The unified dictionary, UniD, is as effective in compression as it is in improving performance. One distinguishing benefit of the UniD is that it may be used by any compression method based on dictionary. We also present a cache and processor independent algorithm for compression that produces a best compression ratio of up to 55% for the Adpcm, averaging 73% for our set of benchmark. For the case when we use the unified dictionary these numbers go to 56% and 75% respectively. Cache misses are reduced to 40% of the original ratio on average, or to 42% when using the UniD. The number of cache accesses is also reduced by 31%, same as bus toggles minimization.

Figure 4: Cache accesses [5] Y. Xie, W Wolf and H. Lekatsas. Profile-driven selective code compression. Proc. of DATE’03. pp. 462-467, Mar 2003 [6] H. Lekatsas, J. Henkel and V. Jakkula. Design of one-cycle decompression hardware for performance increase in embedded systems. Proc. of DAC’02. pp. 34-39, Jun 2002. [7] G. Gaisler. (2003, May) Leon [Online]. Available: http://www.gaisler.com [8] M. D. Hill. (2003, May) DineroIV trace-driven simulator. [Online]. Available: http://www.cs.wisc.edu/~markhill/DineroIV. [9] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, R. Brown. MiBench: a free, commercially representative embedded benchmark suite. IEEE 4th annual Workshop on Workload Characterization. Dec 2001. [10] C. Lee, M. Potkonjak and W. Mangione-Smith, Mediabench: a tool for evaluating and synthesizing multimedia and communication system. IEEE MICRO-30. pp. 330-337, Dec 1997.

Suggest Documents