1596
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008
Code Compression and Decompression for Coarse-Grain Reconfigurable Architectures Nazish Aslam, Mark John Milward, Ahmet Teyfik Erdogan, and Tughrul Arslan, Senior Member, IEEE
Abstract—This paper presents a code compression and on-the-fly decompression scheme suitable for coarse-grain reconfigurable technologies. These systems pose further challenges by having an order of magnitude higher memory requirement due to much wider instruction words than typical VLIW/TTA architectures. Current compression schemes are evaluated. A highly efficient and novel dictionary-based lossless compression technique is implemented and compared against a previous implementation for a reconfigurable system. This paper looks at several conflicting design parameters, such as the compression ratio, silicon area, latency, and power consumption. Compression ratios in the range of 0.32 to 0.44 are recorded with the proposed scheme for a given set of test programs. With these test programs, a 60% overall silicon area saving is achieved, even after the decompressor hardware overhead is taken into account. The proposed technique may be applied to any architecture which exhibits common characteristics to the example reconfigurable architecture targeted in this paper. Index Terms—Data compression, memory architecture, memory management.
I. INTRODUCTION
T
RADITIONALLY, algorithm functions are either statically realized in hardware with application specific integrated circuits (ASICs) or temporarily run on general-purpose processors (GPPs). These two cases form the boundaries of a large space of design exploration for possible ways of computing on silicon. The center ground of this space is filled by configurable and reconfigurable computing devices, with field-programmable gate arrays (FPGAs) being a well known commercially successful example of the former. Reconfigurable computing technology aims to combine the performance of ASICs with the programmability/flexibility found in digital signal processing (DSP) processors/GPPs in a unified and easy programming environment. Over the last few years, many reconfigurable architectures have been proposed Manuscript received May 08, 2007; revised September 03, 2007 and December 04, 2007. Current version published November 19, 2008. The work of N. Aslam was supported by the Engineering and Physical Sciences Research Council (EPSRC) under Grant GR/T18448/01. The work of M. Milward, A. T. Erdogan, and T. Arslan was supported by the Engineering and Physical Sciences Research Council (EPSRC) under Grant GR/S24053/01. This work was supported by Spiral Gateway Ltd. N. Aslam is with Institute for System Level Integration, Scotland, EH54 7EG U.K. and also with the Spiral Gateway Ltd, ETTC, University of Edinburgh, Edinburgh, EH9 3JL U.K. (e-mail:
[email protected]; nazish.
[email protected]). M. J. Milward is with the School of Engineering and Electronics, University of Edinburgh, EH9 3JL U.K. (e-mail:
[email protected]). A. T. Erdogan and T. Arslan are with the School of Engineering and Electronics, University of Edinburgh, EH9 3JL U.K. and also with the Institute for System Level Integration, Scotland, EH54 7EG U.K. (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TVLSI.2008.2001562
and developed both in industry and academia, such as Matrix [1], Garp [2], Elixent [3], PACT XPP [4], SiliconHive [5]. Unlike FPGAs, reconfigurable cores allow the internal logic of the silicon to change during run-time. This flexibility and high performance corresponds to a high amount of parallel processing units with their associated interconnects. This requires a large bitwidth instruction fetch mechanism to concurrently supply multiple instructions from the memory to the several parallel processing units. As a consequence, these cores suffer from excessively large program memories as well as very wide instruction fetch bitwidths/program bus. Code compression has played a crucial role in tackling such issues by reducing the amount of information needed to represent the code. A lot of research has gone into code compression for embedded systems, particularly for reduced instruction set computing (RISC)-based architectures [6]–[8], to reduce silicon area occupied by the program memory and consequently reduce the power consumption. However, code compressions in multiple-issue architectures face extra challenges than single-issue architectures due to the need of decompressing a very large instruction word quickly enough so not to compromise the speed of the processors. Yet a reduced instruction fetch bitwidth is also desired to minimize the wiring congestion and power consumption. This paper presents a code compression and decompression scheme for a coarse-grained reconfigurable architecture [9]. The reconfigurable architecture offers a very high number of parallel processing units and thus has an ultra wide instruction width. It is dynamically reconfigurable, thus it requires the ability to store many configuration codes in memory which program the processing units for a particular moment in time. Since reconfigurable computing is a relatively new technology, code compression techniques targeting multiple-issue architectures are currently only found for very large instruction word (VLIW) and transport triggered architecture (TTA) processors in literature. The effectiveness of such existing code compression techniques for the target coarse-grain reconfigurable architecture have already been analyzed in the authors’ previous work [10], [11]. A suitable VLIW processor code compression scheme was implemented for the target reconfigurable architecture; this scheme will now be referred to as Compression Scheme 1 (CS1). The performance of this technique was compared to the authors’ previously proposed unit-grouping code compression scheme specifically created to target coarse-grain reconfigurable cores; this technique will be referred to as Compression Scheme 2 (CS2). A selection of test programs were used to evaluate both schemes and a comprehensive comparison was performed.
1063-8210/$25.00 © 2008 IEEE
ASLAM et al.: CODE COMPRESSION AND DECOMPRESSION FOR COARSE-GRAIN RECONFIGURABLE ARCHITECTURES
The asymmetric nature of code compression allows the compressor to be made as complex and computationally intensive task as required since compression is performed only once at compile time. However, since the decompressor has a direct effect on the targeted coarse-grain reconfigurable processor’s performance, the decompression hardware should be kept as small and simple as possible to minimize the area overhead and latency. By applying code compression, there is consequently an inevitable delay between the program memory and the reconfigurable processor; hence, the aim is to minimize this trait so that it does not become the speed bottleneck for the actual processor. Furthermore, it is desirable to minimize the overall area taken up by the decompressor logic so that the benefits achieved by performing compression, and the sacrifice made in terms of extra latency, are not lost by having a large decoder hardware. It is well recognized that higher compressions and smaller decoders may be achieved if a code compressor and decompressor are customized to an available set of program codes. This is however not an option for the target reconfigurable architecture, as well as for conventional processors, since they should be capable of running several programs and those programs may change in future via software downloadable upgrades. Thus, a generic design is needed which could give reasonable compressions across most programs, and would be easily scalable to other generations of the core with an increased number of functional units. In this paper, CS2 is rigorously analyzed and a substantially improved scheme is derived which makes use of new techniques to yield much improved compressions, yet at a significantly lowered silicon cost. The contributions of this paper are as follows: • improvement of dictionary selection scheme compared to previous approaches by observing that higher order bits do not change as frequently as lower order bits in subinstructions targeting some of the units; • changing unit-groupings and codeword construction to generate bitstream which is compression friendly; • allowing a mixture of compressed and uncompressed codes for sub-instructions within a wide instruction, providing a partial dictionary bypassing mechanism; • efficient decompression engine design primarily due to smaller dictionaries used for only the less frequently changing higher order bits; • retaining the generality of the decompressor engine to ensure that a broad range of compressed programs can be downloaded and run on the target reconfigurable processor. This paper is organized as follows. Section II gives an overview of general reconfigurable architectures suitable for using the proposed compression scheme, and later gives specific details about the motivational example reconfigurable architecture. Section III reviews related work done on code compression and analyzes their suitability for the targeted architecture. Authors’ previously published work on code compression for a coarse-grain reconfigurable architecture is summarized in Section IV. This section also briefly describes the experimental setup. Section V performs an in-depth analysis of the previous code compression scheme targeting coarse-grain reconfigurable architectures. Section VI presents the newly
1597
Fig. 1. Rigid and flexible instruction formats.
proposed code compression scheme which has evolved from the previous work to offer significant improvements. Section VII gives the experimental results for the new compression scheme and performs a comparison with the previous work. Finally, Section VIII concludes this paper. II. TARGET ARCHITECTURE A. Targeted Reconfigurable Architectures The code compression techniques introduced in this paper for coarse-grain reconfigurable architectures can be partially or wholly applied to any architecture which possess the common characteristics mentioned next. The target architecture is expected to have a high number of independent functional units running concurrently. Each of the unit may perform a simple or a set of complex computations. All the individual units are expected to be appropriately configured at each execution cycle, where this configuration is performed by a large configuration stream, or a wide instruction, fetched from the program memory. All units should get configured concurrently by their corresponding sub-instructions. The targeted reconfigurable architectures are expected to have rigid instruction word formats similar to traditional VLIWs, rather than modern VLIWs with flexible instructions [12], i.e. the individual sub-instruction positions within a wide instruction directly correspond to specific functional units, thus, their code is less dense due to many inactive units (see Fig. 1). For rigid instruction format, the sub-instruction bits can range in size depending upon what type of functional unit they are targeting. For example, some sub-instructions may have 4 bits to configure functional units performing additions, while others may be 9 bits for multiplier units, 24 bits for logic units, etc. Even though the sub-instruction sizes may vary, the size of the wide instruction is expected to always remain fixed. Furthermore, the coarse-grain core should be dynamically reconfigurable, i.e., the core should be able to reconfigure itself during run-time to execute general-purpose programs efficiently. This is important as the proposed compression scheme
1598
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008
Fig. 2. Traditional VLIW and target core instruction formats. (a) Traditional VLIW instruction format. (b) Target reconfigurable architecture’s instruction format.
has been constructed to ensure a generic design which allows the execution of newly downloaded programs, with no restrictions on the size or types of programs. B. Motivational Example Reconfigurable Architecture The recently developed industrial reconfigurable instruction cell based architecture [9] was targeted for applying the proposed code compression techniques. The architecture is able to provide dynamic hardware reconfigurability and a high throughput. It uses a custom-built scheduler to effectively extract instruction-level parallelism from general-purpose high level language codes [13]. The architecture consists of an array of heterogeneous functional units, where the number and type of these units are parameterizable upon application. For this paper, an architecture with 64 functional units was chosen, although the target architecture could have several fold more units if desired. This implies significantly more processing units than other existing non-reconfigurable multiple-issue architectures, e.g., VLIWs typically have up to 12 processing units. Furthermore, for VLIW, each processing unit can perform a series of different functions like an arithmetic logic unit (ALU), whereas the units of the target reconfigurable architecture are more specific purpose, such as multiplier, divider, shifter, adder, logic, etc. Each unit, depending upon its type, can have a varying number of configuration bits associated with it. Each processing unit performs a specific subset of primitive operations, and the associated configuration bits are used to configure that unit’s operation appropriately at any moment in time. For the target 64 unit reconfigurable architecture, a unit can have anything from 3 up to 32 configuration sub-instruction bits depending upon its type. The number and type of functional units can be chosen as needed but become fixed after fabrication of the chip. Thereafter, the value represented by the configuration sub-instructions can be parameterized to adjust the behavior of the corresponding unit for any given time. Fig. 2 shows the typical instruction formats for a traditional VLIW processor and the targeted reconfigurable architecture. As mentioned previously, both have rigid instruction formats,
but the reconfigurable architecture instructions do not show the operand fields. This is because the target system is based on the Harvard architecture for accessing memory. This implies that there are two independent memories: one for data (operands) and the other for instructions (opcodes). The code compression in this paper does not concern itself with the operands which are stored in the data memory. It focuses on compressing the functional units’ configuration instructions, whereas the routing interconnects’ configuration instructions are handled independently due to their different redundancy characteristics, as well as to keep the architecture modular. Currently, the target reconfigurable architecture does not employ an instruction cache. Thus the decompressor will be placed in between the program memory and the reconfigurable architecture. As the decompression will take place for each instruction fetch, the decompressor becomes a critical part of the instruction execution pipeline, therefore has to be made as fast as possible. This is like having a post-cache decompression architecture, as opposed to a pre-cache which can tolerate a slower decompressor as decompression only takes place when a cache miss occurs [14]. A post-cache architecture is known to be advantageous over a pre-cache, as it gives higher power and memory savings. The reconfigurable architecture permits the mapping of both dependent and independent datapaths for execution over multiple clock cycles. Given the decompressor is synthesized to run at the maximum clock frequency of the core, the latency associated with subsequent instructions fetch and decode can be hidden. However, a high latency penalty is observed when a jump is required. This penalty may be reduced by employing a look-ahead technique to either statically or dynamically predict a jump, although this has not been done for the current work presented in this paper. The main program memory for the target reconfigurable architecture holds the compressed versions of an assortment of programs to be executed on the reconfigurable system. This memory may be modified or added to while the reconfigurable system is being used post-manufacture. C. Typical Code Structure and Redundancies For the 64 unit version of the reconfigurable architecture used in this paper, each wide instruction is fixed at 474 bits; these wide instructions will now be referred to as steps. Each step can have a maximum of one jump sub-instruction. A jump (a.k.a. branch) instruction is used to change the flow of program. Examining a typical program code, poor code density is obvious. Most steps contain “No Operation” (nop) instructions for inactive units, similar to traditional VLIW processors. Units are inactive if they are not utilized in a given step and these can occur frequently due to inter-instruction dependencies or due to having excessive resources. They can be identified by their all zero configuration bits. This is labeled spatial redundancy (see Fig. 3). The second type of redundancy apparent is the repetition of configuration settings for a given unit several times throughout the lifetime of a program. This is labeled temporal redundancy. Performing code profiling revealed a very frequent usage of units expecting large configuration sub-instructions, thus any
ASLAM et al.: CODE COMPRESSION AND DECOMPRESSION FOR COARSE-GRAIN RECONFIGURABLE ARCHITECTURES
Fig. 3. Typical code redundancies.
compression achievable on these is desirable. There are other forms of redundancies present in the program codes. However, those redundancies are either found to be more specific to a given program and cannot be generalized or occur so infrequently that the overhead of tackling them outweighs the advantage gained by removing them. For example, there are some sub-instruction values which are continuously used in subsequent steps of a program. A special control bit to indicate that the next step would make use of the same sub-instruction(s) as in the current step can be easily implemented. This would mean that the particular sub-instruction(s) need not be retransmitted in the following step. The decompressor would only require the addition of a gated reset for some of the design registers. However, the penalty is that of including an extra bit per codeword to indicate whether the sub-instruction is repeated in the following step or not. This extra bit per codeword is only justified if the trait of repeated sub-instructions in continuous steps was observed frequently, which was found not to be the case. III. RELATED WORK Different lossless compression schemes can be found in abundance in literature. However, it is important to note that code compression has different requirements to other forms of lossless data compression, thus, the same compression schemes cannot be directly applied, though some ideas may be borrowed. Many well known data compression schemes provide very good compression ratios but they typically decompress files from beginning to end in a very sequential manner. This is not feasible for code compression which normally require each compressed instruction to be encoded as such that its decompression and execution can be done independently, without waiting for the subsequent instructions to be decoded, as otherwise an unacceptable time delay will be introduced. Also programs require the ability to make conditional jumps to new locations within the code. Whether or not a jump is taken directly depends upon how the condition is evaluated at execution. This in return mandates the previous requirement of decompressing and executing individual instructions immediately without waiting for subsequent instructions to decode, as that effort may be wasted if a jump is required. This paper measures compression in terms of the widely used metric of compression ratio (aka compression efficiency), which is defined as the total size of compressed program plus the dictionary bits, divided by the original program size. Lower ratio implies better compression. The dictionary bits in this equation represent the overhead of initializing the dictionaries with the appropriate content. The two general categories of lossless code compression are statistical- and dictionary-based, and they are presented next.
1599
A. Statistical Methods Statistical compression extracts statistical information from the data and uses this information to perform the compression. These techniques usually require the upfront availability of the programs to be compressed during the design of the compression technique, and the corresponding decompressor hardware design is usually made very specific to the currently available programs. This specificity however does result in good compression ratios and low decoder area overhead. Many statistical methods result in variable length codewords. Thus, it becomes necessary to first establish the range of bits for the next instruction, and only then the extraction and decompression can start for that particular instruction. This becomes a very serial operation and following instructions cannot be decoded until the prior ones have been, increasing the overall processing latency. Only recently a new Tunstall-based variable-to-fixed code compression has been implemented by Xie et al. [12] that can perform parallel decode, thus making it suitable for multiple-issue post-cache architectures. This scheme provides modest compression ratios of around 0.825. Nevertheless this scheme is targeted towards modern VLIWs whose dense codes are more difficult to compress than traditional VLIWs. Statistical methods perform better than dictionary based methods for dense codes [12]. Other statistical compression methods which are also targeted for modern VLIWs include work done by Larin and Conte [15] who use Huffman coding to compress instructions in three different ways, allowing varying degrees of trade-off to be made between the compressed program size and the decompressor size. Further work is done by Xie et al. [16] who have used arithmetic coding with a Markov model and report compressions of 0.673 to 0.697. B. Dictionary-Based Methods Dictionary-based schemes compile dictionaries of frequent instructions found in a program and replace those instructions with the corresponding dictionary index. These schemes normally result in poorer compression than statistical methods but tend to give faster and simpler decompression logic. Dictionary based methods may be static, adaptive or semi-adaptive. In static methods, the dictionaries are assumed to be predetermined and pre-initialized in ROM. Static methods are normally applied to systems which have a known fixed set of programs running on them and those programs are available during decoder design time (see [17] for an example). On the other hand, adaptive methods build up their dictionary content at run-time during decompression from the compressed program. Such techniques do not require the storage of any dictionary initialization bits in the program memory, and allow new programs to be compressed and run on the target system. Though, good compression is only achievable if the compression unit is sufficiently large. This is not very suitable for code compression since random access into the program memory is required to be able to jump to new locations when the program flow changes, and to facilitate this, small compression units are necessary. Nonetheless, [18] provides an example. Semi-adaptive dictionary methods are the most common method where the dictionaries are initialized with appropriate content just before the program utilizing
1600
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008
it is uncompressed by the decoder. The dictionary initialization information is extracted from the original program before performing compression and is stored with the compressed program in the program memory. These schemes are suitable for allowing new programs to be run on the target system after the decoder design. Traditional VLIWs only have dictionary-based code compression methods in literature. Nam et al. [19] propose a semiadaptive dictionary-based compression scheme for traditional VLIWs, and use isomorphism to create two dictionaries, one for storing operations, and the other for operands. Only the frequently occurring instruction words are compressed, while the other instructions bypass the dictionaries. Their compression scheme becomes worse when the number of functional units increases from 4 to 12. Compression ratios from 0.63 to 0.71 are reported although no discussion on silicon area or decoding speed of the algorithm is found due to a lack of decoder implementation. This technique cannot be applied to our target reconfigurable architecture since its instruction format only comprises opcodes (configuration values), but no operand field. Ishiura and Yamaguchi [17] also use dictionaries where they apply automatic field partitioning to partition instructions into smaller bit-sets to keep the corresponding dictionaries small. They report very good compression ratios of between 0.46 and 0.60. This is a static technique and expects the upfront availability of program codes which will be run on the processor. Once the best compression based on the automatic field partitioning concept has been found, only then can the decompressor be designed, making it very customized to the available programs. Further dictionary-based compression schemes are provided by Ros and Sutton [20] who investigate compression at three levels of instruction granularity. Their most efficient compression scheme using instruction factorization gave an average compression ratio of 0.683. However, it requires sequential decompression of instructions. This technique exploits the opcode/operand instruction format of VLIWs, which is not available for the target architecture. Other design parameters such as area or latency are not shown, and furthermore, dictionary initialization bits are not added to the compression ratio calculation, which would worsen the reported ratios. The second reported scheme which performs dictionary compression on instructions gives the lowest delay for retrieving the original code, though at the cost of worsened compression ratio; 0.815. Nevertheless, as their work is actually targeted towards modern VLIWs, this high ratio is understandable. This technique has been applied to the target reconfigurable architecture in the authors’ previous work [10], and the results are summarized in Section IV of this paper. The third technique performing operand factorization on instruction words is very similar to the work of Nam et al. [19], which again exploits the opcode/operand nature of instruction formats, and allows parallel decompression, but at the expense of worsened compression ratio of 0.847. IV. PREVIOUS COMPRESSION SCHEME IMPLEMENTATIONS Authors’ previous work undertaken in [10] and [11] looked at implementing an existing code compression scheme for
the target reconfigurable architecture. Due to the lack of code compression schemes specifically targeting coarse-grain reconfigurable architectures, a multiple-issue VLIW processor code compression scheme was applied (CS1) [10]. The results were recorded and compared against the authors’ previously proposed unit-grouping code compression scheme (CS2) [10], [11]. It was observed that even though there are existing multiple-issue code compression techniques in literature, many of the schemes are not suitable for the target reconfigurable architecture due to having different architecture characteristics. Both these code compression schemes were semi-adaptive dictionary-based and are briefly summarized next. A. Compression Scheme 1 (CS1) Spatial redundancy is removed by eliminating all the nop subinstructions for inactive units from a given step. Special reserved tags are added to identify the unit to which a sub-instruction is intended for, and to also signify the completion of a step. Next, the temporal redundancy is removed by using the common technique of applying an independent dictionary for each functional unit. Dictionaries are compiled by recording all the unique sub-instructions used to configure the corresponding unit. Then the entire program code is scanned and the sub-instructions for each unit are replaced by the corresponding dictionary’s appropriate index bits. The dictionaries are implemented as SRAMs. They are also used to serve a dual purpose of acting as a local instruction memory for short uncompressed programs, if desired ( 1025 steps). In other words, the decoder can operate in two modes. In one mode, it works as a standard decoder decompressing the compressed programs stored in program memory. While in the other mode, the dictionaries can be used as a local (uncompressed) program memory for short programs. Such uncompressed programs would be downloaded into main program memory in an uncompressed form, and would simply be copied into this local memory at run-time (start up) to speed up future fetches. In this case, the rest of the decompressor logic is disabled and only the dictionaries are active. Consequently, this enables the decompression to be completely bypassed in order to avoid the mandatory latency introduced by the decoder in the path of instruction fetch and execution. The capability of initializing all the dictionaries is already implemented in the decoder. For the target reconfigurable architecture, 64 independent dictionaries were needed to serve the 64 concurrently running functional units. With this scheme, several codewords can be decoded in parallel. The efficient parallel decode of several codewords is only possible by keeping all the codewords of the same fixed size. Based on the outcome of performing some code profiling, the dictionary sizes were fixed to 1024 words each. 1024 sized dictionaries were deemed to have a low probability of having an overflow occurring for future downloaded programs. An overflow can occur if the dictionary capacity is insufficient to hold all the unique instruction entries for a given program. Each codeword was fixed to 16 bits, as shown in Fig. 4. B. Compression Scheme 2 (CS2) This technique is based on the unit-grouping scheme first introduced in [10] by the authors and later improved in [11]. The
ASLAM et al.: CODE COMPRESSION AND DECOMPRESSION FOR COARSE-GRAIN RECONFIGURABLE ARCHITECTURES
1601
TABLE I PERFORMANCE COMPARISON OF PREVIOUS IMPLEMENTATIONS
Fig. 4. Fixed size 16-bit codeword.
Fig. 5. Fixed size 19-bit codeword.
summary of the code compression scheme given here is that of the improved version. In this compression scheme, the temporal redundancy is dealt with first. The main difference from CS1 is that a dictionary is assigned to a group of sub-instructions. A group may be made up of one large sub-instruction, or a set of several sub-instructions corresponding to several units. For the implementation on the target reconfigurable architecture, a group contains up to four individual sub-instructions. The dictionaries for each group are compiled by entries of grouped sub-instructions, in which at least one of the sub-instruction is active and gives a unique entry. Then the entire program code is scanned and all grouped sub-instructions appearing in the corresponding dictionary are replaced by the appropriate dictionary index. The dictionary depth for all the groups was determined by the maximum size of dictionary required for the largest width individual sub-instruction. In the target 64 unit reconfigurable architecture, the largest individual sub-instruction is 32 bits wide. After performing some code profiling, a 1024 word dictionary was deemed reasonably sufficient for it. Thus, the remaining units were grouped together to ensure that a 1024 word dictionary would provide sufficient capacity for all the unique sub-instructions for a single test program, with a low probability of an overflow occurrence. Performing the unit-grouping resulted in 25 groups, thus 25 independent dictionaries. These dictionaries also serve the dual purpose of acting as a local memory for short programs wanting to bypass the code compression schemes. As discussed in [11], more effective use of the limited dictionary space can be made for grouped unit sub-instructions, by making use of special control bits in the compressed codewords. These special control bits allow one dictionary location to hold multiple different sub-instructions, thus increasing the effective capacity of the dictionaries without incurring any silicon area cost. In this version of the design for the target reconfigurable architecture, four control bits were used per codeword, thus this means that one dictionary location can effectively hold up to 16 unique values in it. The spatial redundancy was later removed by eliminating all grouped sub-instructions which are only made up of nop instructions. Eight codewords were chosen to be decoded in parallel with this scheme. All the codewords were made of an equal length of 19 bits, as shown in Fig. 5.
DIK = time required for reading a location from 1024 word dictionary. C. Experimental Setup To devise a compression scheme, some test programs for the target reconfigurable architecture were obtained and statically analyzed. The compression on program codes is performed once at compile time before the code is downloaded onto the reconfigurable architecture’s program memory. The code compressors have been implemented in C programming language. The compressor takes a compiled program as its input and outputs a compressed version, along with any dictionary initialization information. The test programs used for performing the code profiling and compression scheme evaluations are all DSP applications, namely minimum error, 2-D DCT, WiMax and H.264. The number of steps representing the programs for execution on the target reconfigurable architecture is 25 for minimum error, 35 for 2-D DCT, 432 for WiMax, and 19921 for H.264. WiMax is a universal technology for wireless broadband internet access and is based on the IEEE 802.16 standards. H.264 is a standard for performing video compression. H.264 used is based on the open source FFmpeg. DCT and minimum error are small modules commonly found in many DSP algorithms, like JPEG or MP3. The selection of these test programs fairly represent the type of compute intensive algorithms which may be run on the target architecture. All the decompressor designs have been implemented using Verilog HDL and synthesized under normal operating conditions onto UMC 0.13- m technology using Synplify ASIC. Any large dictionaries in the decoder are implemented as SRAMs, which are automatically generated using Virtual Silicon’s Memory Compiler. The power consumption of the designs are estimated using datasheets for the individual SRAMs, and a power estimation tool from Synopsys called Power Compiler for the remaining decoder logic. D. Comparison of CS1 and CS2 CS1 was implemented for the 64 unit reconfigurable architecture and compared against CS2 specifically created for coarsegrain reconfigurable architectures. The results of the comparison are summarized in Table I. Fig. 6 shows the compression ratios observed for the various test programs with both compression schemes. Note that the
1602
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008
Fig. 7. Dictionary utilization.
Fig. 6. Compressions achieved with test programs.
four test program sizes are normalized. After applying the compression schemes, it was observed that the program sizes are equivalent for a couple of test programs, while for the 2-D DCT and the H.264, substantial compression improvements were observed with CS2. This occurs since these programs contain a higher level of parallelism within them which was evident from the fact that on average there is a higher level of resource utilization per step. As noted in Table I, CS2 performs better in such a situation, as more steps within the program are either at or are approaching the worst case step compression scenario. V. ANALYSIS OF CS2 Further silicon area savings may be achieved for CS2 by: 1) improving the compression ratios and 2) reducing the size of the decoder hardware. Reducing the area occupied by the decompressor hardware is desirable to justify the use of code compression for medium to short sized programs. The smaller the decoder, the sooner the area overhead introduced can be cancelled from the area savings achieved through performing compression on the program memory. Analyzing CS2, it was noted that the silicon area occupied by the dictionaries contributed to over 95% of the total area taken up by the decompressor hardware. Thus, to obtain any significant area reductions for the decoder, the dictionaries need to be addressed. A. Dictionary Utilization In CS2, all the group dictionaries were kept at the same depth of 1024 words each. This section discusses why this was originally done and justifies why some of these dictionaries can be completely removed for CS3. 1024 word dictionaries were originally selected for three reasons. First, it allowed the dictionaries to serve a dual purpose of acting as a local memory for storing short uncompressed programs. Second, keeping all the dictionaries of the same fixed depth means the number of index bits for each dictionary is 10 bits. Thus, all the compressed codewords can easily be made of a fixed size, permitting the parallel decode of several codewords. Third, allowing a large enough dictionary for each group ensured the chances of a dictionary overflow occurring for future downloaded programs was mitigated. This was necessary as CS2 does not allow the mixing of compressed and uncompressed codewords.
Analyzing the dictionary contents for the CS2, it was observed that significant portions of the dictionaries remain empty for all the available test programs, whether those programs were large or small. Fig. 7 shows that only groups 1–8 and group 24 dictionaries are heavily populated for the largest test program, H.264. Observing the very low utilization of the available 1024 word dictionary capacity for some of the groups hampers the justification of using them at that size. The data in Fig. 7 should not be mistaken to suggest that these group dictionaries are only used infrequently throughout the lifetime of a program; these groups may exhibit heavy temporal redundancy for their limited set of unique grouped sub-instructions, thus access the same small set of dictionary locations very frequently. By making dictionaries shorter, there remains a risk of overflow occurrence for other programs which may be downloaded on to the program memory in the future and happen to use more unique sub-instructions from its broad range of possible values in a given single program. Even though in CS2 the grouped dictionaries are all of the same depth, i.e., 1024 word each, the word widths vary. Thus, the areas occupied by these dictionaries differ significantly, since some dictionaries only have 12-bit words whereas others have up to 32-bit words. It is interesting to observe that all the dictionaries with wider word widths have high utilization, whereas the dictionaries with smaller word widths have a low utilization. This suggests that larger groupings can be created so to have higher number of bits per group. However, doing this will worsen the compression ratio, as a lot more redundant information will have to be stored in the dictionaries, and more control bits will be needed per codeword. Another option is to reduce the size of these dictionaries significantly. By doing this, the dictionaries can no longer be used as a local memory for short uncompressed programs. This also means that the dictionary index bits will now be reduced from 10 bits down to perhaps 5 or 6 bits. At first glance, this appears to be in favor of achieving a better compression; however this is not the case. Even though the number of index bits needed for smaller dictionaries are fewer, all the codewords are still required to be of the same fixed size in order to allow for the parallel decode of more than one codeword. This mandates the need to add extra “padding” bits to the grouped sub-instruction codewords with dictionary sizes smaller than 1024 words. Thus, no noticeable advantage in the compression ratio is achieved, but the silicon area occupied by the decompressor will be reduced. Special control bits have been used to indicate which parts of the dictionary contents should be masked off in order to increase
ASLAM et al.: CODE COMPRESSION AND DECOMPRESSION FOR COARSE-GRAIN RECONFIGURABLE ARCHITECTURES
the effective capacity of the dictionaries [11]. These control bits increase the size of the fixed size codewords to gain this advantage. However, if the dictionary sizes are drastically reduced for some of the under-utilized dictionaries, then due to the need of keeping fixed size codewords, the extra padding bits which are needed may be more effectively used to also serve this purpose. However, if further attention is given, it is realized that the purpose served by these special control bits for very short dictionaries is actually not worthwhile. Instead, placing the original grouped sub-instructions directly into the codeword is possible. Thus, with the original 19-bit codewords, 5 bits are used for unit identification, whereas the remaining 14 bits may now be used to store the original sub-instructions directly into the codeword. This means that the dictionaries associated with such groups may be completely removed. A minimum of 19-bit fixed codewords are still needed for the retained highly utilized dictionaries, and their associated 4 control bits to increase the effective capacity of the dictionaries; even though the number of these control bits may be varied as desired. B. Dictionary Overflow This section discusses the various techniques which may be used to mitigate/eliminate dictionary overflows from occurring, and why a dictionary bypassing mechanism is chosen for CS3 over other possible approaches. When using dictionaries, the instructions found in the original code are replaced by indices into the dictionary holding the instruction being replaced. Compression occurs as long as the index is smaller than the instruction being replaced. This implies that the capacity of the dictionary always has to be at most half the size of the whole instruction space in order to provide a minimum of 1-bit saving. Hence, if there are new programs to be decoded, a dictionary overflow risk cannot be guaranteed to never occur. As the target reconfigurable architecture requires a generic code decompressor which can decode a broad range of programs, the dictionaries need to be made large enough to ensure that the chances of an overflow occurring in the future are rare. It can be seen from Fig. 7 that the dictionary utilization for groups 1–8 is approaching the dictionary capacity of 1024 words with our largest test program H.264. Thus, for future downloaded programs, there is a reasonable probability that an overflow might occur for one or more of these dictionaries if those programs are also very large and have more unique sub-instructions per group. One option is to increase the dictionary sizes to 2048 words, thus increasing the 10-bit dictionary index to 11 bits. This would mean that the fixed size codewords for all the groups would need to increase by 1-bit in order to continue allowing for parallel decode of several codewords. It may or may not worsen the compression ratio by much, as all the remaining units can be regrouped to make more effective use of the extra bit in a codeword. It would however certainly increase the silicon area of the decoder, increase the power consumption of the design, and would also take a longer time to read out a value from a deeper dictionary. If the dictionary sizes are increased for the selected highly utilized units, then it can be predicted that most often substantial portions of the dictionary would remain redundant.
1603
Another option is to tweak the allocation process used for mapping the sub-instructions to functional units to ensure that it works more favorably with the compression scheme. Thus, in a given step, if there is a choice of more than one unit which may be used to execute a particular sub-instruction, then the unit which has previously already used the same sub-instruction should be picked over a unit which has never executed such an instruction before. This ensures a new entry is not needlessly added to another dictionary if it is already available for use in a dictionary for another same type unit. This can also help reduce the number of initialization bits needed for initializing the dictionaries, and to some extent avoid duplicate sub-instruction initializations in separate dictionaries. This and some other techniques can improve the performance of the decompressor, but cannot be relied upon for avoiding dictionary overflows from occurring. An alternative option is to allow some sub-instructions to be stored directly in their original form in the codewords, while others may use the dictionary. This is a well-known dictionary bypassing technique implemented in various dictionary-based compression schemes. The implementation of a dictionary bypassing mechanism for a sequential decoder is easy, however parallel codeword decoders face extra hurdles. This is since the codewords directly storing the original sub-instructions expand the size of the compressed codeword. Consequently, performing parallel decode of several instructions becomes challenging. The parallel decode dictionary bypass mechanism implemented in [19] performs a complete dictionary bypass for an entire step since the opcode and operand dictionaries concern the complete wide instruction. The bypassing mechanism presented in this paper can perform partial bypass of some dictionaries for some of the sub-instructions within a step and is presented in Section VII. C. Sub-Instruction Values The contents of the groups 1–8 and group 24 dictionaries were analyzed using all the available test programs. Groups 1–8 consist of only one unit each, where each unit is of the same type and expects a 32-bit sub-instruction. Therefore, the dictionary contents of all of them were collated and analyzed together. It was observed that the nop sub-instruction was most frequently used. Whereas, for the non-nop sub-instruction values, the most active values are concentrated at 0 up to about value to region. and then at roughly the From this information, it can be deduced that the least significant bits (LSbits) of a sub-instruction value are more dynamic, while the most significant bits (MSbits) are fairly static. A further analysis was conducted to judge approximately how many LSbits were most frequently changing. This was done by analyzing the number of unique values left in each of the eight dictionaries if the dictionaries were only made up from the MSbits. The largest H.264 test program was used. As shown in Fig. 8, if a certain number of LSbits are removed, and placed directly in a compressed codeword, then the remaining number of unique MSbits entries is reduced significantly. This highlights the fact that most of the value changes occur in the lower order bits of the sub-instruction, thus, such frequently changing portions of the value are not very well suited to be placed in a dictionary.
1604
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008
Fig. 10. Codewords. Fig. 8. Static portions of sub-instruction values.
only the most frequently referenced sub-instruction MSbits. The remaining MSbits may be better stored directly in the codeword in an uncompressed form. VI. PROPOSED CODE COMPRESSION SCHEME (CS3)
Fig. 9. Frequency of call to each group dictionary index.
Whereas, the higher order bits of the sub-instructions change infrequently with respect to the number of times they occur, thus making more effective use of the available dictionary space. By separating 21 MSbits, a dictionary with 64 words is sufficient for holding all the unique values occurring for the H.264, however, 128 words would be needed to mitigate the chance of an overflow occurring for other programs. The distribution of the sub-instruction values for group 24 is not concentrated in any particular regions of the spectrum; therefore this technique cannot be applied to it. D. Indexing Frequency 128 word dictionaries are constructed for groups 1-8 for their 21 MSbits of the sub-instructions. The frequency of each index call is recorded and shown in Fig. 9 for each group. It is observed that a handful of indices are very heavily referenced while others are only occasionally. The infrequently called indices are not only taking up valuable on-chip dictionary area and remain redundant most of the time, but some are also worsening the compression ratio. This is since those locations have to be initialized which requires the storage of extra dictionary initialization bits in the program memory. For such rarely used MSbits, the use of dictionary space is inefficient. Thus, it is better to reduce the dictionary size yet further, and initialize the small dictionary with
The compression techniques proposed in this paper have been derived from the authors’ previous work [10], [11], however offer significant improvements. The silicon area occupied by the decompressor has been drastically reduced, while observing better compression ratios for all the test programs. Parallel codeword decoding is maintained. The decompressor is general-purpose, therefore capable of decoding any program, regardless of type and size. Furthermore, CS3 also eliminates the risk of a potential dictionary overflow occurring for a future downloaded program. The proposed approach comes under the dictionary category of lossless compression and is semi-adaptive. The authors’ previously proposed concept of unit-grouping taken from CS2 is applied to the target reconfigurable architecture. Applying the unit-grouping concept formed 22 groups, where each group may consist of up to five individual functional units. The groupings have been formed to ensure that a fixed 20-bit codeword is suitable. The 20-bit is dictated by the minimum size of codeword required; which is the codeword in Fig. 10(b). The 20-bit codeword may differ in construction depending upon what type of group is being targeted to give better compressions, rather than applying the same codeword to all, as was done previously. Three different types of groupings were identified, where the codewords may look as shown in Fig. 10. The sub-instructions are now directly stored in the compressed codeword for the grouped units which do not appear to make effective use of the available dictionary space. See Fig. 10(a). Five bits are used to identify all the 22 unique groups, whereas the remaining 15 bits are made up of a combination of sub-instructions. There may be a need to add one or two padding bits for some groups if their total number of bits for the grouped sub-instructions is less than 15. Fig. 10(b) and (c) show the codewords used for groups 1–8. Given the sub-instruction characteristics identified earlier for these groups, it was decided to create an 8-word dictionary for each group storing only the 21 MSbits of the sub-instruction values. Thus, a 3-bit dictionary index is required. Five bits are
ASLAM et al.: CODE COMPRESSION AND DECOMPRESSION FOR COARSE-GRAIN RECONFIGURABLE ARCHITECTURES
still required to uniquely identify all the 22 groups. The remaining 11 LSbits of the sub-instruction are directly stored into the codeword, since their dynamic nature makes them less suited for dictionary storage. The “linked” bit is set to logic 0 if the desired MSbits word is present in the corresponding dictionary. Otherwise, it is set to logic 1 to indicate that the required MSbits are being sent in the next codeword. Thus, if a linked codeword is required, the Fig. 10(c) codeword is sent immediately after. This codeword always starts with the two bits “10” to ensure its correct handling. As parallel decode is required, the decoder does not know from any other codeword whether the adjacent codeword is linked, therefore, it requires its own unique tag. These two bits are followed by 18 MSbits of the actual codeword, while the remaining 3 bits of the 21 MSbits are sent in place of the 3 dictionary index bits of Fig. 10(b) codeword. Using this technique can guarantee that a dictionary overflow should never occur, while maintaining parallel decode ability. For group 22 (previously known as group 24 in [10] and [11]), the characteristics of the data are different and Fig. 10(d) codeword is used. Here, 5 bits are used to uniquely identify all the 22 groups. A 256-word dictionary has been assigned requiring eight dictionary index bits. The control bits concept proposed in [11] has been used for this codeword, applying 6 control bits to mask off adjacent 4 bits in the 24-bit sub-instruction. This means that the 6 status bits effectively increase the 256 word dictionary capacity, without actually incurring any silicon area cost. Each location is capable of representing up to 64 unique values. The assignment of the status bits may be varied as such that some of the status bits are only associated with a couple of the bits of the actual value, while the other status bits are associated with four or more. It is better to assign a status bit to as large set of bits of the actual value, if those bits change infrequently. Whereas, a status bit should be assigned to as small number of bits, e.g., 2 bits, if those actual bits change very frequently. A. Code Compressor Software Fig. 11 shows a dataflow graph representing the tasks performed by the code compressor. There are two major phases. The first phase determines which sub-instruction values of a given group should be used to initialize the corresponding dictionary in hardware, if there exists one. For the CS3 implementation in this paper, groups 1–8 and group 22 have been assigned a dictionary each. Thus, to be able to construct the contents by which the dictionaries should be initialized, the appropriate test program to be compressed is examined. The sub-instructions representing the different grouped units are separated. For groups 1–8, the 21 MSbits of the sub-instruction values are extracted, ensuring that all the unique 21 MSbit patterns are recorded once. Next, the test program is re-traversed, and the frequency of occurrence of each unique 21 MSbit pattern is recorded. Finally, the eight most frequently referenced MSbit patterns are selected for hardware dictionary initialization and the appropriate bits are generated. On the other hand, for group 22, all the unique sub-instruction values are recorded. Then an algorithm tries to reduce the number of identified unique sub-instruction patterns by making use of the status control bits concept. Finally, both the dictionary initialization bits and the compressed program are put together, ready for download on to the program memory.
1605
Fig. 11. Compressor software flow.
The compression is performed on a per basic block basis in order to ease how conditional jumps can be taken. A basic block is a group of instructions between two consecutive potential branch targets. The target addresses for each branch get updated during program compression according to their new locations in the compressed program memory. With this compression scheme the best compression achievable for a single step is 474 bits reduced down to 20 bits. This happens if only one unit is active. However, the worst case step compression would actually expand the 474 bits step to 600 bits, and this can occur if all 64 units are active within a step and all 8 dictionaries for groups 1–8 are being bypassed and two codewords are being used to represent each sub-instruction for these groups. B. Corresponding Decompressor Hardware Completely removing the under-utilized dictionaries for most of the groups and significantly reducing the size of the remaining dictionaries drastically reduces the decompressor area. Furthermore, the demultiplexers needed in the decoder design are also smaller, as now 8 of 1-to-22 demultiplexers are required which result in further area savings over the previous implementation. However, there has been an introduction of some extra decompressor logic to handle all the different types of codewords correctly. Given the decompressor design should be capable of decoding more than one codeword in parallel, it is desirable for the total number of bits fetched from the program memory to have a byte boundary for ease of fetching; however, the codewords themselves may or may not obey byte boundaries. Due to this reason, CS1 and CS2 expected 128 bits and 152 bits instruction fetch bitwidths. The easiest way to ensure that the byte boundary is always obeyed is by decoding eight codewords concurrently. Each codeword may be made of any length, yet
1606
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008
Fig. 13. Codeword processing.
TABLE II BITWIDTH, DECODER AREA, AND COMPRESSION RATIOS
Fig. 12. Decompressor hardware design.
the resulting instruction fetch bitwidth will always obey byte boundary restrictions. The number of codewords fetched concurrently can easily be reduced or increased as desired. More codewords mean faster decode per step but requires a wider bitwidth. Fig. 12 shows a diagram of the implemented decompressor hardware. This decompressor assumes that it is able to process up to eight codewords in parallel from the compressed program memory; hence it expects 160 bits bitwidth. This is clearly smaller than the initial 474 bits instruction fetch bitwidth requirement. The design has a one stage pipeline in order to increase its throughput. The decompressor may be used in a preor post-cache architecture. The “step processing” module consists of a finite state machine which is used for control purposes of the entire decoder. The “handshake controller” module ensures that the pipelined sections of the decoder are synchronized with each other. The “linked codewords detector” module is used to monitor the “linked” bit found in Fig. 10(b) codewords intended for groups 1–8. To do this detection, the module monitors the sixth MSbit of all the incoming eight codewords concurrently. On encountering a logic “1,” and ensuring that the corresponding codeword only belongs to group 1–8, the module indicates to the rest of the decoder that a certain identified codeword is linked, therefore, the next codeword adjacent to this identified codeword should not be sent to the demultiplexers for normal processing. This adjacent codeword should have a form as shown in Fig. 10(c). It is important that the adjacent codeword itself does not get misinterpreted as being a group 1–8 codeword since this can in turn potentially disable the codeword adjacent to itself too. Thus, it is important to ensure that the linked codeword be assigned a new unique group identification tag of its own. This would in general mean that a 5-bit comparator would need to be implemented in the decoder. Groups 1–8 possess unique identification tags ranging from to . As any codeword with an identification tag can never belong to groups 1–8, a 2-bit starting with comparator proves sufficient—minimizing the decompressor area and ensuring a better compression ratio.
The “codewords bypassing dictionaries” module shown in the diagram is merely a collection of wires (see Fig. 13). If any of the eight incoming codewords belong to groups 1–8 and are “linked,” this module routes the correct portions of the codewords to the rest of the decoder. When the reconfigurable core executes a step, and identifies a branch to a new location, it simply sets the target address on the “jump to program memory address” line of the decompressor. This causes the decompressor to entirely flush the pipeline and start decompressing steps from the newly specified address. In CS3, not only the dictionaries are more efficiently utilized, there is also no limit to what size of programs the decompressor is able to handle. VII. PERFORMANCE ANALYSIS The performance evaluation results for CS3 are given in Table II alongside the results for CS2. The total decompressor area for CS3 is 91.8% lower than CS2. The bulk saving is down to making significant cuts on the dictionaries which previously contributed to about 95% of the total area, and now only contribute about 50%. Furthermore, by reducing the size of the demultiplexers in CS3 decoder, and easing the wiring congestion as a result, the decompressor logic area is not increased by much, even after the introduction of new techniques to handle the different forms of codewords. For all the test programs, whether small, medium, or large sized, better compressions are achieved with CS3. Given a decompressor introduces its own area overhead, the area savings achieved by compressing short programs are always lost. The effective area savings achieved for the test programs is analyzed in Table III. The size of the original uncompressed program is found in terms of equivalent SRAM area. Then the size of the compressed program, including any dictionary initialization bits, is measured in terms of SRAM area and the fixed size of the total decompressor area added to it. 128-bit wide SRAM is used for all the estimations. From the area saving
ASLAM et al.: CODE COMPRESSION AND DECOMPRESSION FOR COARSE-GRAIN RECONFIGURABLE ARCHITECTURES
TABLE III EFFECTIVE AREA SAVINGS
1607
TABLE V PERFORMANCE COMPARISON
DIK = time required for reading a location from 1024 word dictionary. D256 = time required for reading a location from 1024 word dictionary.
TABLE IV POWER CONSUMPTION (MICROWATTS PER MEGAHERTZ)
results, it can be seen that CS2 is only beneficial for compressing large programs, whereas CS3 can easily be justified for use with medium or large sized programs. The area occupied by the decoder hardware is equivalent to having approximately 49 kB worth of data stored in program memory for CS2, and approximately 2.2 kB for CS3. One advantage of using dynamically reconfigurable technology is that one processor can be used to run various programs, given those programs need not be run simultaneously. Thus, the program memory will store instructions for different programs. If the combined compression achieved for all those programs, large or small, exceeds the 49 kB for CS2 or 2.2 kB for CS3, then performing compression is justified in terms of the total silicon area savings. If all the four test programs are compressed, an overall silicon area saving of approximately 49% is achieved with CS2, and 60% with CS3, taking into account the decoder area. Table IV shows the power consumption if the four test programs are assumed to be stored in the program memory. Approximately 48% power reduction is observed with CS2, whereas CS3 resulted in 58% over the original system. For CS3, an automatically generated SRAM is only used for the 256 word dictionary, whereas the remaining smaller 8 word dictionaries are synthesized with the rest of the logic, thus explaining the large difference in the logic power consumption between the two designs. The silicon area and power savings come at a cost of reduced performance, as shown in Table V. The propagation delays for both designs are equivalent, allowing 500-MHz clock frequency. Since both decompressors have a single stage pipeline,
the shortest time to decode a step is two cycles, plus the longest time taken to read one of the dictionaries. As all the design dictionaries are read in parallel, only one dictionary read time has to be included. The largest dictionary in CS2 is 1024 word, 32-bit wide, so the time to read is longer than that for CS3 design. Here, the largest dictionary is only 256 words, 24-bit wide. For the worst case scenario where all the units within a complete step are active, CS2 takes almost the same time as CS3 to decode a single complete step. The reconfigurable architecture executes a step over multiple clock cycles. The latency associated with fetching and decoding subsequent steps can be hidden. However, the latency penalty is observed when a jump is required or at the start of a program decompression. This latency may be two clock cycles or in the worst case be as high as six clock cycles for CS3. For gaining the area and power savings, the performance degradation is unavoidable when a decoder is added in the path of instruction fetch with any scheme. Although the effect of it may be minimized by incorporating other techniques such as branch prediction, or using a pre-cache decoder layout, or by simply increasing the number of instructions decoded in parallel at the expense of reduced area/power savings. VIII. SUMMARY A highly efficient lossless code compression scheme is implemented for a multiple-issue coarse-grain reconfigurable architecture, in an attempt to minimize silicon area and bitwidth. A comprehensive analysis shows our new technique is far more appealing than the previously proposed works. Our novel technique of units-grouping combined with the use of special control bits to increase the effective dictionary sizes, and the use of a partial dictionary bypassing mechanism, achieved significant compression ratios in the range of 0.32 to 0.44. The design concepts are easily scalable to increased number of functional units. REFERENCES [1] E. Mirsky and A. DeHon, “Matrix: A reconfigurable computing architecture with configurable instruction distribution and deployable resources,” in Proc. IEEE Symp. FPGAs Custom Comput. Mach., Apr. 1996, pp. 157–166. [2] J. R. Hauser, “Augmenting a microprocessor with reconfigurable hardware,” M.S. thesis, Comput. Sci. Dept., Univ. California, Berkeley, 2000. [3] D-Fabrix Processing Array, Reconfigurable Signal Processor. Bristol, U.K.: Elixent Ltd., 2005 [Online]. Available: www.elixent.com [4] OFDM Decoder for Wireless LAN—Whitepaper. Munich, Germany: XPP, PACT, May 2002 [Online]. Available: www.pactcorp.com [5] Reconfigurable Computing. Eindhoven, The Netherlands: Philips, 2005 [Online]. Available: www.siliconhive.com [6] C. Lefurgy, P. Bird, I.-C. Chen, and T. Mudge, “Improving code density using compression techniques,” in Proc. 30th Int. Symp. Microarch., Dec. 1997, pp. 194–203.
1608
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008
[7] S. Liao, S. Devadas, and K. Keutzer, “Code density optimization for embedded DSP processors using data compression techniques,” in Proc. 16th Conf. Adv. Research VLSI, 1995, p. 272. [8] A. Wolfe and A. Chanin, “Executing compressed programs on an embedded RISC architecture,” in Proc. Int. Symp. Microarch., Dec. 1992, pp. 81–91. [9] S. Khawam, I. Nousias, M. Milward, Y. Ying, M. Muir, and T. Arslan, “The reconfigurable instruction cell array,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 1, pp. 75–85, Jan. 2008. [10] N. Aslam, M. Milward, I. Nousias, T. Arslan, and A. Erdogan, “Code compression and decompression for instruction cell based reconfigurable systems,” in Proc. IEEE Int. Parallel Distrib. Process. Symp., Reconfig. Arch. Workshop, Mar. 2007, pp. 1–7. [11] N. Aslam, M. Milward, I. Nousias, T. Arslan, and A. Erdogan, “Code compressor and decompressor for ultra large instruction width coarsegrain reconfigurable systems,” in Proc. IEEE Symp. Field-Program. Custom Comput. Mach., Apr. 2007, pp. 297–298. [12] Y. Xie, W. Wolf, and H. Lekatsas, “Code compression for embedded VLIW processors using variable-to-fixed coding,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 5, pp. 525–536, May 2006. [13] Y. Yi, I. Nousias, M. Milward, S. Khawam, T. Arslan, and I. Lindsay, “System-level scheduling on instruction cell based reconfigurable systems,” in Proc. Des., Autom. Test Eur., Mar. 2006, pp. 381–386. [14] Y. Xie and W. Wolf, “Profile-driven selective code compression,” in Proc. Des., Autom. Test Eur., Mar. 2003, pp. 462–467. [15] S. Larin and T. Conte, “Compiler-driven cached code compression schemes for embedded ILP processors,” in Proc. 32nd Int. Symp. Microarch., 1999, pp. 82–92. [16] Y. Xie, W. Wolf, and H. Lekatsas, “Code compression for VLIW processors,” in Proc. Data Compression Conf., 2001, p. 525. [17] N. Ishiura and M. Yamaguchi, “Instruction code compression for application specific VLIW processors based on automatic field partitioning,” in Proc. Workshop Synthesis Syst. Integr. Mixed Technol., 1997, pp. 105–109. [18] C. H. Lin, Y. Xie, and W. Wolf, “LZW-based code compression for VLIW embedded systems,” in Proc. Des., Autom. Test Eur., Feb. 2004, vol. 3, pp. 76–81. [19] S. Nam, I. Park, and C. Kyung, “Improving dictionary based code compression in VLIW architectures,” Trans. Fundam. Electron., Commun. Comput. Sci., vol. E82-A, pp. 2318–2324, Nov. 1999. [20] M. Ros and P. Sutton, “A Hamming distance based VLIW/EPIC code compression technique,” in Proc. Compilers, Arch. Synthesis Embedded Syst. Conf., Sep. 2004, pp. 132–139.
Nazish Aslam received the M.Eng. degree in computing and electronics from Heriot-Watt University, Edinburgh, U.K., in 2004. She is currently pursuing the Eng.D. degree with the Institute for System Level Integration, Livingston, U.K., which is sponsored by Spiral Gateway Ltd. Her research interests predominantly lie in the field of signal processing and its application in reconfigurable computing. Recent research activities have been focused on the mapping of software and algorithms to reconfigurable computing architectures, and program code compression schemes.
Mark John Milward received the B.Eng. degree in electronic and electrical engineering and the Ph.D. degree in the area of parallel lossless compression from Loughborough University, Loughborough, U.K., in 2000 and 2004, respectively. He then joined the System Level Integration Group in the University of Edinburgh, Edinburgh, U.K., as a Research Associate. His research interests incorporate reconfigurable architectures, parallel hardware architectures, and lossless compression. Currently, he is working as a Senior Engineer with Spiral Gateway Ltd., ETTC, University of Edinburgh, where he is continuing his work on reconfigurable systems.
Ahmet Teyfik Erdogan received the B.Sc. degree in electronics engineering from Dokuz Eylul University, Izmir, Turkey, in 1990, and the M.Sc. and Ph.D. degrees from Cardiff University, Cardiff, U.K., in 1995 and 1999, respectively. He is currently a Senior Research Fellow with System Level Integration, School of Engineering and Electronics, University of Edinburgh, Edinburgh, U.K. He is also a member of the Institute for Integrated Micro and Nano Systems, Edinburgh, U.K., and the Institute for System Level Integration, Livingston, U.K. His research interests include VLSI design, circuits and systems for communications and signal processing, design of energy efficient digital circuits and systems, computer arithmetic, application-specific architecture design, and reconfigurable computing. He has published several journal and conference papers in these areas.
Tughrul Arslan (SM’06) holds the Chair of Integrated Electronic Systems in the School of Engineering and Electronics, University of Edinburgh, Edinburgh, U.K., and is also a cofounder and the Chief Technical Officer of Spiral Gateway Ltd., ETTC, University of Edinburgh. He is a member of the Integrated Micro and Nano Systems (IMNS) Institute and leads the System Level Integration Group (SLIg) in the University. His research interests include low power design, DSP hardware design, system-on-chip (SoC) architectures, evolvable hardware, multi-objective optimization and the use of genetic algorithms in hardware design issues. Prof. Arslan is an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS: I—REGULAR PAPERS, a member of the IEEE CAS Committee on VLSI Systems and Applications, and sits on the editorial board of IEE Proceedings on Computers and Digital Techniques and the technical committees of a number of international conferences. This year he is the General Chair of the NASA/ESA Conference on Adaptive Hardware and Systems, and Co-Chair of ECSIS Bio-Inspired, Learning, and Intelligent Systems for Security Symposium (BLISS). He is a principal investigator on a number of projects funded by EPSRC, DTI, and Scottish Enterprise together with a number of industrial and academic partners.