Estimating data bus size for custom processors in ... - Springer Link

3 downloads 4282 Views 359KB Size Report
data bus based on the characteristics of the application. It finds the ... In this study, we aim at customizing the data bus width of custom processors such as.
Des Autom Embed Syst (2006) 10:5–26 DOI 10.1007/s10617-006-8706-8

Estimating data bus size for custom processors in embedded systems∗ † ¨ Emre Ozer · Andy P. Nisbet · David Gregg · Owen Callanan

 C

Springer Science + Business Media, LLC 2006

Abstract We propose a method to estimate the data bus width to the requirements of an application that is to run on a custom processor. The proposed estimation method is a simulationbased tool that uses Extreme Value Theory to estimate the width of an off-chip or on-chip data bus based on the characteristics of the application. It finds the minimum number of bus lines needed for the bus connecting the custom processor to other units so that the probability of a multicycle data transfer on the bus is extremely unlikely. The potential target platforms include embedded systems where a custom processor (i.e. an ASIC or a FPGA) in a system-on-a-chip or a system-on-a-board is connected to memory, I/O and other processors through a shared bus or through point-to-point links. Our experimental and analytical results show that our estimation method can reduce the data bus width and cost by up to 66% with an average of 38% for nine benchmarks. The narrower data bus allows us to increase the spacing between the bus lines using the silicon area freed from the eliminated bus lines. This reduces interwire capacitance, which in turn leads to a significant reduction of bus energy consumption. Bus energy can potentially be reduced up to 89% for on-chip data buses with an average of 74% for seven benchmarks. Also, reduction in the interwire capacitance improves the bus propagation delay and on-chip bus propagation delay can be reduced up to 68% with an average of 51% for seven benchmarks using a narrower custom data bus.

Keywords Buses . Custom processors . Embedded systems . Extreme value theory . Statistics

∗ This

study was supported by an Enterprise Ireland Research Innovation Fund Grant IF/2002/035.

† Emre

¨ Ozer is now with ARM Ltd., Cambridge, United Kingdom.

¨ E. Ozer, D. Gregg, O. Callanan Department of Computer Science, Trinity College, Dublin 2, Ireland A. P. Nisbet Department of Computing and Mathematics, Manchester Metropolitan University, Manchester, UK Springer

¨ E. Ozer, A. P. Nisbet et al.

6

1. Introduction Wires become a real bottleneck to wire delay and energy consumption as semiconductor technology scales [8]. As more modules are packed into a chip, the complexity of communication between these modules increases. Hence, both on-chip and off-chip wire delay decrease at a slower rate than logic gate delay as feature size decreases. Communication between the modules in a System-on-a-Chip (SoC) as well as off-chip communication become crucial in terms of bus delay and bus power consumption. Buses are interconnection structures used for on-chip and off-chip communication within SoCs. They have a deep impact on the system performance and power consumption. For instance, using a slower on-chip bus between the processor and its cache in an SoC may degrade the overall system performance. Also, bus power consumption contributes significantly to the total system power consumption owing to the high number of transitions and to crosstalk between adjacent bus lines. Wire delay, which determines how fast buses can operate, depends on wire resistance, capacitance, interwire capacitance and on simultaneous transition of bus lines. Bus power consumption depends on bus switching activity, power supply, wire and interwire capacitances. In deep sub-micron technologies, the interwire capacitance dominates the total wire capacitance [18]. It becomes a greater contributor to on-chip bus power consumption than wire capacitance and is also additionally an important factor in determining bus delay. In this study, we aim at customizing the data bus width of custom processors such as ASICs or FPGAs. We propose a method to estimate the required data bus width for a custom processor in an embedded system. Our estimation method analyzes the I/O and memory access patterns of an application that is designed as a custom core, and performs probabilistic estimation of the data bus width using Extreme Value Theory [10, 20] from Applied Statistics. The estimated data bus width is probabilistically optimal in the sense that it guarantees that a multi-cycle data transfer over the bus at any time is extremely unlikely. Our target system can be an SoC where a custom processor communicates with other processor cores and/or on-chip memory modules through on-chip buses, or the target system can be a system-on-a-board where the custom processor chip communicates with other units through off-chip buses on a board. For the latter case, the number of off-chip bus pins/pads is an important factor for determining the cost of a bus system. This is particularly important for designing processor systems for embedded platforms. Thus, it is essential to keep the data bus width, particularly off-chip buses, as narrow as possible to reduce the pin cost. There is a trend in the embedded world [14, 27] towards keeping the off-chip bus width small at the cost of performing some multi-cycle data transfers using time-multiplexing techniques. One recent study [1] also suggests that the off-chip bus power can be as much or more than the power consumption of the embedded processor itself. We have four objectives for performing custom data bus design: 1) a reduction in the number of physical wires, which, in turn, gives 2) a reduction in the cost of I/O pins/pads for off-chip data buses. A narrower on-chip data bus can reduce interwire capacitance by allowing us to increase the spacing between the bus lines in the silicon area freed from the eliminated bus lines. A reduction in the interwire capacitance also 3) improves on-chip bus delay, and 4) on-chip bus energy consumption. The rest of the paper is organized as follows: Section 2 discusses the related work. Section 3 describes the bus model, bus energy consumption, wire delay and bus cost. Section 4 introduces the data bus width estimation method along with the theoretical background of Extreme Value Theory. Then, Section 5 presents the experimental framework, our methodology and the statistical and experimental results of several integer embedded applications. Springer

Estimating data bus size for custom processors in embedded systems

7

Next, Section 6 describes analytical and comparative models for bus width, cost, energy consumption and bus delay. Finally, Section 7 concludes the paper with a discussion.

2. Related work Some studies focus on generating automatic bus interfaces for system-level or functional partitioning of an application under certain constraints or conditions. Each partition is considered as a process and all processes communicate through channels such as buses. They study different bus structures under several constraints such as the number of partitions, data transfer rates, and communication costs in order to improve either performance or power consumption. For instance, [16] and [4] investigate different bus structures to find out the best bus model that satisfies some given constraints and optimizes the communication time between processes. Givargis et al. [7] explore a design space with multiple communication channels and several bus encoding schemes and try to find the bus width that minimizes the power consumption under such constraints. Similarly, Pandey et al. [17] presents a hardware/software partitioning framework where an application is partitioned into several communicating processes through shared memory. They optimize the communication delay between processes for a given bus width and buffer size. Our data bus width estimation customization differs from these techniques in two ways: (1) In our model, the whole application is to run on a custom processor, so there is no functional or hardware/software partitioning of the application. (2) We estimate the probabilistically global data bus width of the custom processor without using any system-specific design constraints, e.g. cost or performance. Recently, [2] proposes a caching technique to reduce 64-bit bus to 48-bit bus. This enables them to design a faster bus by widening each individual bus line. They find out that the least significant bits in buses show high entropy while low entropy is observed in the most significant bits. They send the high entropy data (i.e. 44 LSBs) over the bus while keeping the low entropy data in a cache at the transmitting side. The cache is duplicated at the receiving end of the bus as well. The cache is indexed by the next 3 bits and the rest of 17 bits are cached. If there is a match in the cache entry indexed by the 3 bits, then this means that the cache at the receiving end stores the same 17-bits, so only the 3-bit index is sent over the bus. A control bit in the bus tells the receiving end if there is a match in the cache. If there is a match, the receiving end reads the 17 bits from its cache. If there is no match, then the 17 bits are sent over the bus in the next cycle. Our technique also exploits the low/high entropy property of data buses when estimating the probabilistically global data bus width. In comparison to this technique, our custom narrow data bus has much lower cost as it does not require any caching hardware structures. There have been several studies that focus on reducing switching activity due to consecutive bus transactions for reducing the data bus power. One of the first bus coding techniques for reducing bus power consumption is proposed by Stan et al. [24]. They present a bus invert coding method that works well for uncorrelated data patterns. In the bus invert coding method, the current data value to be transferred is inverted to reduce the number of transitions if the Hamming distance between the current data and previous one is more than the half of the bus width. If data patterns are not random but correlated, the bus invert method does not perform well. Several other encoding techniques are proposed for correlated data patterns on data buses such as [11, 15, 19, 25] and [29]. Bus encoding methods are also used to reduce switching activity within a single data bus value considering inter-wire capacitances between the parallel bus lines [9, 21] and [22]. Similarly, [23] and [26] employ encoding techniques to reduce on-chip bus delay for deep submicron technologies. Our custom data Springer

¨ E. Ozer, A. P. Nisbet et al.

8 Fig. 1 Bus model

Bus Line1 Bus Line2

Cwire

Cinter

Cwire Bus LineN-1 Bus Line N

Cinter

Cwire Cwire

bus width estimation design, which reduces the number of bus lines for reducing bus power consumption and cost, is not alternative to the bus encoding techniques. One of the bus encoding techniques can also be designed on top of the narrow custom data bus to reduce bus switching activity. However, this comes at a cost of increasing the data bus width by a few lines in order to fully exploit the features of its coding/decoding algorithms. 3. Bus model Figure 1 shows the approximate bus model used in this paper. Cwir e is wire capacitance of each bus line and Cinter is interwire or crosstalk capacitance between adjacent bus lines. Interwire capacitance is noise caused by coupling to neighboring wires and has a very important effect on both power consumption and wire propagation delay. Capacitance is defined as described in [8] with the following equation where ε0 , εhori z , εver t , I L Dthick , K, H, W and S are the permittivity, horizontal and vertical dielectric constants, dielectric layer thickness, Miller Multiplication constant, wire height, wire width and the spacing between the adjacent wires, respectively.   H W C = ε0 2.K .εhori z . + 2.εver t S I L Dthick

(1)

The most significant implication of equation 1 is that capacitance can be reduced when the spacing (S) between the adjacent wires in the bus is widened. 3.1. Energy consumption We use the bus energy consumption model as described in [22]. The bus energy equation is given by the following equation. Here, SWi denotes the switching activity of the bus line capacitance i, and SW j, j+1 represents concurrent switching activity of the neighbor bus line capacitances j and j+1.   N N −1   2 E bus = Vdd SWi + Cinter SW j, j+1 Cwir e (2) i=1

j=1

Energy consumed on each bus line is caused by switching wire capacitance from on/off to off/on. As the distance between bus lines gets smaller, interwire capacitance between neighboring bus lines is formed. The interwire capacitance takes effect during the switching of the line, and this contributes to the total bus power as the number of bus lines increases. As technology scales down to deep sub-micron levels, the interwire capacitance becomes several orders of magnitude larger than the wire capacitance [18]. This means that interwire Springer

Estimating data bus size for custom processors in embedded systems

9

capacitance will become a major contributor to on-chip bus power consumption in future microprocessors and SoC systems. In order to decrease the interwire capacitance, parallel bus lines should be spaced sufficiently far apart, as suggested by equation 1, but since this requires extra silicon area it may not be a feasible choice for an embedded system where any extra silicon area usage, no matter how small, is costly and power-inefficient. Thus, reduction in the bus width can allow the chip designer or CAD tools to increase the spacing between the bus lines using the chip area freed from the eliminated bus lines. This technique reduces the interwire capacitance without any claim for extra silicon space. 3.2. Wire delay For an evenly spaced bus of N bits, an approximation of the bus propagation delay of a bus line k is given by the following equation [18], τk = gC W (0.38RW + 0.69R D ) wher e C W = Cwir e L and RW = rwir e LCwir e

(3)

Here; L, rwir e and R D represent wire length, wire resistance and driver resistance respectively. g is the correction factor and depends on the ratio of capr = CCinter and on the switching wir e activities of neighboring bus lines. In the best case where all three lines (i.e. k−1, k and k+1) transition in the same direction, g is 1. However, in the worst-case where bus lines k−1 and k+1 transition in the opposite direction of bus line k, g becomes (1 + 4.capr ). For the best-case, the interwire capacitance does not have any effect on the bus propagation delay since g is 1. However, the effect of interwire capacitance can be very significant for cases other than the best-case, and is particularly significant for the worst-case. Bus propagation delay may vary by over 500% between the worst and best cases if the interwire and wire capacitances are the same. This delay gap can be tremendous for on-chip buses designed in deep sub-micron technologies since the interwire capacitance becomes several orders of magnitude larger than the wire capacitance. It is clear that reductions in the interwire capacitance obtained by increasing the spacing between the bus lines also improve the bus propagation delay, particularly for on-chip buses. 3.3. Off-chip data bus cost A custom processor, which can be either in an SoC system or in a system-on-a-board, communicates to the outside world through off-chip buses. Each extra bus line requires a contact (i.e. pin or pad) on the chip edge to connect it to the off-chip memory or peripherals. Thus, every extra bus line makes the overall system more complex and requires expensive packaging and possibly larger die size. Due to the high cost of off-chip buses, embedded and DSP processors [14, 27] generally have narrower off-chip data buses than on-chip buses. These processors have built-in bus interfaces that perform a multi-cycle data transfer over the bus when the bit-width of a data value to be transferred over the bus is greater than the off-chip bus width. A reduction in the number of off-chip data bus lines results in a reduction of the width and cost of pins going off-chip.

Springer

¨ E. Ozer, A. P. Nisbet et al.

10

4. Data bus width estimation The previous section has pointed out that reducing the number of bus lines gives good improvement in cost, performance and energy consumption. This is particularly relevant to embedded systems where power and cost must be balanced against system performance. The data bus width estimation technique proposed in this paper can compute the required number of data bus lines for a custom processor with a finite probability so that it is very unlikely that multi-cycle data transfers over the bus will be required. Thus, this provides probabilistically optimal data bus width and cost for a given custom processor. In theory, the application must be run with all possible data combinations of its inputs in order to find the optimal data bus width. However, this requires an exhaustive search problem with exponential time complexity. For example, consider a program with one input variable whose value covers the range [−2n−1 , 2n−1 − 1] in two’s complement representation. The number of all possible input combinations is M = 2sn where s represents the number of input elements and n is the bit-width of the input variable. The application under observation must execute each one of M data inputs and observe values to find the globally maximum absolute data value on the bus. This clearly has exponential time complexity. Given such complexity, we have chosen to use statistical and probabilistic approximation to estimate the optimal data bus width. Instead of running the application with a population of all possible input data sets, we draw a sufficient number of input data samples from this population and run the application with each sample input data set. In this paper, our analysis focuses on only integer value traffic on the data buses for integer embedded applications. After running the application with all sample input data sets, we collect the data on the data bus, analyze them and make an estimation about the globally maximum absolute data value that may be transferred over the bus. However, we estimate the global maximum absolute data value with a probability. This probability is called the probability of a multi-cycle data transfer and is defined as the probability of a data value being larger than the data bus width. Our goal is to estimate the probabilistically optimal data bus width, which is the optimal data bus width with a given probability of a multi-cycle data transfer. This probability can be selected so infinitesimally that the chance of getting a multi-cycle data transfer on the data bus is extremely unlikely. 4.1. Extreme value theory Extreme Value Theory [10] [20] is a statistical analysis theory of extreme values such as maxima and minima. The theory has been used to model wind speeds, maximum temperatures, floods and risk analysis in finance. An extreme value is the largest or smallest value in a data set and Extreme Value Theory models these extremes from statistical samples. A sample may consist of data points or values of a random variable. If upper extreme values are to be observed, then the maximum value from the sample is taken. If lower extreme values are to be observed, then the minimum value from the sample is taken. When seeking the probabilistically optimal data bus width, we are interested in estimating the globally maximum absolute data value on the data bus. For n samples, n maximum values are extracted and statistically analyzed to find their probability distribution. Extreme Value Theory states that extreme values fit into one of the three classes of distributions, which are called the extreme value distributions. These distributions are known as Gumbel, Fr´echet and Weibull distributions and their cumulative probability distributions (CDF) are given by the following equations: Springer

Estimating data bus size for custom processors in embedded systems

11

Type 1, (Gumbel) F(x) = e−e

x−μ ( σ )

(4)

Type 2, (Fr´echet)  F(x) =

0 e

 x−μ −ξ



σ

if x < μ if x ≥ μ

(5)

Type 3 (Weibull)

F(x) =

⎧ ⎨

e−

 μ−x ξ σ

⎩0

if x ≤ μ if x > μ

(6)

Here; μ, σ and ξ represent the location, scale and shape parameters. These three extreme value distributions can be represented with a unified single distribution called the generalized extreme value (GEV) distribution. Its CDF is shown below. GEV(x) = e−[1+ξ (

1 x−μ − ξ σ )]

(7)

When ξ > 0, the distribution is a Fr´echet type with a right long tail. If ξ = 0, the distribution is a Gumbel type with long tails on either side. If ξ < 0, the distribution is a Weibull type with a left short tail. μ, σ and ξ parameters can be estimated using one of several parameter estimation techniques. In this study, we use the Maximum Likelihood Estimation (MLE) technique [3] to estimate these three parameters. This technique forms a likelihood function for the set of the given extreme values. For a given probability distribution, the likelihood function determines the probability of selecting this set. The likelihood function contains some unknown parameters, and the values of these parameters that maximize the likelihood function are the Maximum Likelihood Estimators. So, the unknown parameters in the likelihood function are μ, σ and ξ for the GEV distribution, and the MLE technique estimates the best values of these three parameters so that the likelihood of fitting the observed extreme values into the GEV distribution is maximized. 4.2. Tail quantile extrapolation The pth quantile, x p , where p is between 0 and 1, for a distribution with a CDF means that p proportion of all values in the distribution is equal to or smaller than x p . We use the tail quantile of the generalized extreme value CDF to extrapolate the estimated global maximum value beyond the range of the sample data values for a certain probability. GEV(x p ) = p ⇒ x p = μ −

σ 1 − (−ln( p))−ξ ξ

(8)

Springer

¨ E. Ozer, A. P. Nisbet et al.

12

4.3. Estimating global maximum data bus value μ, σ and ξ parameters define the location, scale and shape of the GEV distribution. They are computed from the extreme value data samples for which one of the GEV distributions is fitted. Once they are estimated, they are placed into equation 8. μ, ˆ σˆ and ξˆ represent the MLE-estimated parameter values. Also, p in equation 8 is replaced by (1 − pmdt ) where pmdt is the probability of a multi-cycle data transfer. The tail quantile extrapolation with the estimated GEV parameters and the probability of a multi-cycle data transfer can be rewritten as below: xmax = μˆ −

σˆ ˆ 1 − (−ln(1 − pmdt ))−ξ ˆξ

(9)

Basically, equation 9 states that the probability that the data bus values in the fitted GEV distribution are larger than xmax is pmdt . Here, xmax is the estimated global maximum data bus value for a selected pmdt . If pmdt is selected small, the probability of the data bus values being larger than the estimated global maximum data bus value also becomes small. By selecting a relatively small probability of pmdt , it is possible to reach a probabilistically optimal data bus value for which a multi-cycle data transfer over the bus is extremely unlikely. Finally, the estimated data bus width is computed by taking the base 2 logarithm of xmax and adding one to it, which represents the sign bit. Data Buswidth = log2 xmax  + 1

(10)

4.4. Potential platforms for custom data buses Figure 2 shows embedded system platforms where our method can be applicable. It can be used in an SoC platform where units such the CPU, on-chip memory and a custom processor core are connected with a shared bus as shown in Fig. 2a. The custom core connects to the on-chip shared bus or system bus through its own data bus whose width is determined by our estimation method. Another system type where our method can be applied is a systemon-a-board model where the CPU, I/O units, memory and the custom core are designed on a board as shown in Fig. 2b. Here, the custom core is connected to the board-level system bus through an off-chip data bus. Our technique is not limited to shared bus systems and can also be appropriate for point-to-point communications as shown in Fig. 2c. In this SoC system model where point-to-point communication links are used, each custom core is connected to memory and I/O with a dedicated data link or bus. Using our data bus width estimation method, the width of each dedicated data bus can be customized by analyzing I/O and memory traffic separately. Although Extreme Value Theory assures that the probability of a multi-cycle data bus transfer is extremely unlikely, a data value that is wider than the custom data bus width may still occur and should be transferred in multiple bus cycles. In this case, the custom processor needs to break up the data and send it in chunks. This does not require any extra hardware structure as we keep the narrower data in the existing wide bus buffers before sending it over the bus. For instance, a 32-bit data bus needs a 32-bit bus buffer(s) to keep the data. When we customize the data bus width, we still keep the narrower data in the 32-bit bus buffers. If the data bus width of the custom processor is 12 bits and a 14-bit data happens to be sent over the data bus, then the 14-bit data is written into the 32-bit bus buffer. In the first cycle, Springer

Estimating data bus size for custom processors in embedded systems

On-chip Memory

Off-chip Memory

CPU

On-chip System Bus

13

CPU

I/O

System Bus Custom Data Bus

Custom Data Bus

Custom Core

Custom Core

(a) System-on-a-Chip

(b) System-on-a-Board

CPU

On-chip Memory Custom Data Bus-1

Custom Data Bus-2

Custom Core 1 Custom Data Bus-3

Custom Core 2 Custom Data Bus-4

I/O

(c) System-on-a-Chip with Point-to-Point Data Links

Fig. 2 Potential embedded system platforms for data bus width estimation and customization

the least significant 12 bits is sent over the data bus followed by the subsequent 2-bit data in the second cycle. For this particular case, only two bus cycles are needed to complete the transfer, but the total number of multi-cycle bus transfers depends on the custom data bus width. Although some bus performance and some bus energy due to multi-cycle data transfers have to be sacrificed, the probability of multi-cycle data bus transfers can be made extremely unlikely so that they rarely occur as we show in the following section, and therefore their contribution to the total bus energy consumption and additional overhead to the bus performance is minimal. 5. Statistical and experimental analysis 5.1. Experimental framework We use nine embedded applications shown in Table 1, which cover a wide range of areas such as telecommunications, DSP, image processing, bioinformatics and security. adpcm encoder, g721decoder, unepic and mpeg2encoder are applications from telecommunications, and image processing, which are taken from the MediaBench suite [13]. susan and sha covering image processing and security areas are from the MiBench suite [7]. We also select fir and viterbi from DSP, and k-means clustering bioinformatics as other benchmark applications. The second column in the table describes the input probability distribution functions used to generate input samples. All applications use integer arithmetic and the data values Springer

¨ E. Ozer, A. P. Nisbet et al.

14 Table 1 Benchmarks

Benchmark

Input Probability Distribution

adpcm encoder g721decoder unepic mpeg2encoder susan sha fir viterbi k-means

Uniform g721encoder-generated epic-compression generated Uniform Uniform Uniform Uniform viterbi-encoder-generated Uniform

communicated over the data bus are all integers. Each application is assumed to be implemented as a custom processor core such as an ASIC or FPGA processor. 5.2. Experimental methodology For each application, the data bus traffic is monitored by collecting data communicated through 32-bit I/O and array operations as tabulated in Table 2. In our model, variables such as local and non-array global variables are kept in a local storage (e.g. registers) within the custom processor. So, the values of these memory variables are assumed to be not transferred off the processor, and therefore are not monitored during the execution of the application. Our experimental framework shown in Fig. 3 is based on the SUIF [28] compiler. The shaded boxes are the blocks that we implemented within this framework. The framework reads an application written in C and converts it into SUIF Intermediate Representation (SUIF-IR) Table 2 Monitored Activities

Fig. 3 Our experimental framework

Input Activity Output Activity Load Operation Store Operation

Reading from a file or an input stream Writing to a file or an output stream Reading an element from an array Writing to an element in an array

Application in C

SUIF Frontend SUIF-IR

Instrument SUIF-IR Instrumented SUIF-IR

SUIF to C Translator C

gcc compile Instrumented Executable

Data Bus Width Estimation Tool Estimated Data Bus Width

Springer

Estimating data bus size for custom processors in embedded systems

15

Get a New Sample

N Samples An Input Sample

Input Data Generation Instrumented Executable

Execute & Collect Data

Raw Data

pmdt

Estimate Estimated Data Bus Width

Maximum Data Bus Value with a finite probability

MLE-estimated μ, σ, ξ

Compute GEV Parameters

N MAXs

Filter Absolute Maximum Value for each Sample

Fig. 4 The data bus width estimation tool

using the SUIF frontend. The IR is instrumented by inserting C printf library calls into the locations where data values are communicated from/to I/O and load and store instructions. Then, the instrumented IR is translated back into C using the SUIF-to-C translator, and the instrumented C code is compiled using gcc to produce an instrumented executable. The instrumented executable is fed into the data bus width estimation tool. Figure 4 shows the details of the data bus width estimation tool. First, N random input samples are generated based on the input variables of the application. The application’s instrumented executable runs with random input data to conduct a simulation-based study and to perform the statistical data analysis of extreme values. The raw data outputted from the Execute and Collect Data stage is collected after the program has run all samples. Later, the absolute maximum data bus value is filtered from each sample. After the filtering process in the Filter stage, N MAX values are gathered for N samples and are fitted to the GEV model by computing the μ, σ and ξ parameters using the MLE in the Compute stage. We have used a software package called EVIM [5], which was written in MATLAB for extreme value analysis, to fit the generalized extreme value distributions. Essentially, we feed the N MAX extreme values, where N is the sample size, to this software package. The package analyzes the extreme data, estimates the μ, σ and ξ parameters using the MLE and outputs them. The Estimate stage takes these parameters and places them into equation 9 to estimate the global maximum data bus value for a selected multi-cycle data bus transfer probability, i.e. pmdt . Finally, the data bus width is determined as the bit-width of the estimated global maximum data bus value. 5.3. Input data generation strategy An input data sample should contain all necessary data of all input variables required to run the application. Each input data sample is generated randomly using the probability distribution(s) of the input variable(s) but it may not be possible to know the exact distribution of the input variables for a given application. However, a reliable statistical analysis can still be conducted using the strategy described by Law et al in [12]. They state that if very little or nothing is known about the probability distribution of a random variable except for its interval [a, b], then a beta distribution for an interval [a,b] with two shape parameters α1 and α2 is a sensible choice to generate random Springer

¨ E. Ozer, A. P. Nisbet et al.

16

input data for statistical analysis. The Uniform distribution is a special case of the beta distribution where α1 and α2 can both be set to 1. Using this statistical rule of thumb, we can safely generate data using the Uniform distribution within the interval of a variable. The interval of the variable can be inferred from the variable’s C declaration such as char, short, int or long. Thus, we use the Uniform distribution to generate data input samples in adpcm, mpeg2encoder, susan, sha, fir and k-means as we know only the data interval of their input variables. On the other hand, we use the following approach to generate input data for g721decoder, unepic and viterbi. Since we know that the input probability distributions of these benchmarks are the encoded data for decoders or the decoded data for encoders, the input data samples should be drawn from the population of the encoded or decoded data space. For instance, the viterbi decoder needs to consume input data that should be generated by the viterbi encoder. For this purpose, a random data sample is generated using the Uniform distribution, and then passed through the Viterbi encoder to produce an encoded data, which later forms a random input data sample for the viterbi decoder. 5.4. Custom data bus width estimation results In this section, we compute the estimated data bus width using the flow described in Figure 3 for each benchmark. The sample size, N, is set to 1000, which will be tested for statistical reliability in the following section. 1000 distinct data input sets are generated using the probability distribution of input variables in the application. Then, the application is run with each of the 1000 data input samples. For each run, we monitor the data bus activity and select the absolute maximum value observed in the data bus. This value represents the extreme value for this sample. After running 1000 data input samples, 1000 distinct extreme values are collected in a table. Later, these extreme data samples are analyzed by the extreme value analysis package to find out the best-fitting GEV distribution and its MLE-estimated μ, ˆ σˆ and ξˆ parameters. Once these parameters are estimated, the global maximum data bus value for a selected multi-cycle transfer probability can be computed using equation 9. Table 3 shows the estimated parameter values for each application after 1000 data samples. The last column in the table also provides the fitted GEV distribution types, and shows that Weibull distribution is the best-fitted GEV model for the majority of the benchmarks. Using the estimated distribution parameters in Table 3, we make a prediction about the global maximum data bus value using equation 9. After placing μ, ˆ σˆ and ξˆ into the equation, the

Table 3 Fitted GEV distribution types and their parameter estimates

Springer

Benchmark

μˆ

σˆ

ξˆ

GEV Distribution

adpcm encoder g721decoder unepic mpeg2encoder susan sha fir viterbi k-means

32747.02 31690.78 32768 6059.74 9974.15 2136920215.33 2694307.65 1789549734.74 352.87

27.08 2078.5 8.39 162.38 273.63 3120987.33 888922.65 25355.86 44.65

−1.36 −1.93 0.15 −0.99 −1.21 −0.27 −0.03 −1.27 −0.06

Weibull Weibull Fr´echet Weibull Weibull Weibull Gumbel Weibull Gumbel

Estimating data bus size for custom processors in embedded systems Input Sample

All Data Bus Values

Extreme Data Bus Value

0

1479225,1555506,1296898

1555506

1

1536401,1273334,-1876133

1876133

999

7989330,-5787190,5912340

7989330

17

Compute GEV Parameters μ = 2694307.65 σ = 888922.65 ξ = -0.03 Estimated Global Maximum Data Bus Value (xmax) xmax= 2694307.65 + 888922.65/0.003*[1-(-ln(1-pmdt))0.03 ] pmdt=0.8

xmax= 2694307.65 + 888922.65/0.003*[1-(-ln(1-0.8))0.03 ] xmax= 2268248.61

DataBuswidth

= log2(xmax) + 1 = log2(2268248.61) + 1 = 23 bits

Fig. 5 An illustrative example of extreme bus values, estimating parameters and computing data bus width with a probability of 0.8 for fir

only unknown is the probability of the multi-cycle data transfer, which should be selected a small value to minimize the chance of multi-cycle data transfers. Figure 5 shows an illustrative example of filtering extreme bus values, estimating paramaters and computing data bus width with a probability of 0.8 for fir. The table shows results for 1000 data input samples, and the first column in the table denotes the data input sample number. The second column gives all values observed on the data bus after running the program with that data input sample. Finally, the last column shows the filtered extreme data bus value out of all data bus values, which is the absolute maximum value. The 1000 extreme data bus values are now passed to the extreme value analysis software package to compute μ, ˆ σˆ and ξˆ parameters as shown in the second phase of the example. Once estimated, they are placed into equation 9 to compute the estimated global maximum data bus value, i.e. xmax . In order to compute it, we need to provide the value of the multicycle transfer probability, pmdt , which is selected as 0.8 in this example. Now, the data bus width can be computed as 23 bits after factoring the multi-cycle transfer probability value to the equation. Table 4 shows the estimated data bus width with various multi-cycle data transfer probability values. The sensitivity of the data bus width is measured by varying the multi-cycle data transfer probability from large values to almost infinitesimally small values. Selecting a low probability of multi-cycle data transfer leads to a wider data bus with less number of multi-cycle data transfers. If a higher probability is selected, the data bus width becomes narrower at a cost of higher chance of multi-cycle data transfers. Springer

¨ E. Ozer, A. P. Nisbet et al.

18 Table 4 Estimated data bus widths pmdt

0.8

0.5

0.1

10−6

10−9

10−25

10−100

adpcm encoder g721decoder unepic mpeg2encoder susan sha fir viterbi k-means

16 16 17 14 15 32 23 32 10

16 16 17 14 15 32 23 32 10

16 16 17 14 15 32 24 32 10

16 16 17 14 15 32 25 32 11

16 16 17 14 15 32 26 32 11

16 16 17 14 15 32 26 32 12

16 16 17 14 15 32 26 32 12

The probability values vary from 0.8 (i.e. high likelihood for multi-cycle data transfer) to 10−100 (i.e. multi-cycle data transfers are extremely unlikely). From this table, we can say that most of the applications (adpcm, viterbi, g721decoder, unepic, mpeg2encoder, susan and sha) are immune to the selected multi-cycle data transfer probability value, even though their estimated global maximum data bus values grow with decreasing multi-cycle data transfer probability. However, this growth is not large enough to increase the data bus width. For instance, let us examine mpeg2encoder with 0.8 probability. Its estimated global maximum data bus value (xmax ) and bus width for pmdt = 0.8 can be computed as follows after inserting estimated parameters into equation 9:

162.38 1 − (−ln(1 − 0.8))0.99 = 5961.033 0.99 = log2 xmax  + 1 = log2 5961.033 + 1 = 14 bits

xmax = 6059.74 + Data Buswidth

Now, let us apply the same formulas for a probability of 10−100 :

162.38 1 − (−ln(1 − 10−100 ))0.99 = 6223.76 0.99 = log2 xmax  + 1 = log2 6223.76 + 1 = 14 bits

xmax = 6059.74 + Data Buswidth

This example clearly shows the immunity of the data bus width to the variance of multicycle data transfer probability. Although the estimated global maximum data bus values increase with the decreasing probabilities (i.e. from 5961.033 to 6223.76), both are represented by the same data bus width (i.e. 14 bits). This fact also holds true for adpcm, viterbi, g721decoder, unepic, susan and sha as well. fir and k-means are the only applications that seem to be sensitive to the selected probability and then only slightly so. In fir, the data bus width stabilizes at 26 bits for the probability of 10−9 and lower values for the similar reasons explained above. The difference between the highest (i.e 0.8) and lowest (i.e. 10−100 ) probabilities for fir is 3 bits. In contrast, for k-means, the stabilization occurs at 12 bits for the probability of 10−25 and lower, and the difference between the highest and lowest probabilities is only 2 bits. Springer

Estimating data bus size for custom processors in embedded systems

19

Table 5 Empirical multi-cycle data transfer probabilities for 1 million input samples for a theoretical multi-cycle data transfer probability of 10−6 Benchmark

Number of Data Bus Overflows

Empirical Probability

adpcm encoder g721decoder unepic mpeg2encoder susan sha fir viterbi k-means

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

5.5. Monte Carlo simulation-based test We also perform a Monte Carlo Simulation test to measure the empirical multi-cycle data transfer probability in order to verify the correctness of the theoretical multi-cycle data transfer probability. We need to generate a large number of random input samples, execute each application with them and measure how many times a data value on the bus is larger than the estimated data bus width with a selected multi-cycle data transfer probability. In order to measure the empirical probabilities accurately, we need to run each application with a very large number of samples. If our selected sample size in the experiments, i.e. 1000, is correct, the empirical multicycle data transfer probabilities should be smaller than or equal to the theoretical probabilities. If the empirical multi-cycle data transfer probability happens to be greater than its theoretical equivalent, this implies that a sampling error is made when estimating the GEV parameters. If this is the case, then 1000 random input samples are not sufficient for statistical modelling and to make a reliable estimation about the GEV distribution parameters. We execute each application with 1 million inputs, and measure the empirical multicycle data transfer probability of each application for a selected theoretical multi-cycle data transfer probability of 10−6 . The probability of 10−6 implies that we expect at most 1 overflow value on the data bus after running an application with 1 million different input sets. Table 5 presents the number of overflows in the data bus and the empirical multicycle data transfer probability for a selected theoretical probability of 10−6 . The empirical probability is computed by dividing the number of overflows by 1 million. We see that not a single overflow is monitored for all benchmarks after 1 million different runs. Thus, the empirical probability of zero is much smaller than the theoretical probability of 10−6 , which supports our initial hypothesis that the GEV model can be reliably fitted with only 1000 samples. The significant implication of fitting a GEV model with only 1000 samples is that we can customize the data bus width very quickly. Executing an application with 1000 samples, computing the GEV parameters and estimating the probable data bus width take only a tiny fraction1 of the custom processor design process.

1

It takes less than half an hour for the longest application using a Pentium-4 PC Springer

¨ E. Ozer, A. P. Nisbet et al.

20 Table 6 Percentage reduction in the number of data bus lines Benchmark

Original Data Bus Width

Estimated Data Bus Width for pmdt = 10−6

Reduction

adpcm encoder g721decoder unepic mpeg2encoder susan sha fir viterbi k-means MEAN

32 32 32 32 32 32 32 32 32

16 16 17 14 15 32 25 32 11

50% 50% 46.875% 56.25% 53.125% 0% 21.875% 0% 65.625% 38.2%

6. Analytical and comparative study 6.1. Data bus width and cost reduction We present the reduction rates in the number of data bus lines in Table 6. We select the multi-cycle data transfer probability as 10−6 as it is shown to be a reliable choice by the Monte Carlo test for all applications. This, of course, does not guarantee that there will not be a multi-cycle data transfer but does imply that its occurrence is very unlikely. The Reduction column in the same table also shows the I/O pin or pad cost reduction rate if the data bus is customized for an off-chip interconnection. The data bus widths and costs for all benchmarks, except for viterbi and sha, can be reduced significantly. The average reduction in the data bus width and cost is 38% over nine benchmarks and this can be as high as 66% as in the case of the k-means clustering application. 6.2. Data bus energy reduction In this section, we present analytical models for off-chip and on-chip data bus energy reduction. 6.2.1. Off-chip data bus For off-chip buses, wire capacitance is a few orders of magnitude larger than interwire capacitance and therefore the term of interwire capacitance in equation 2 can be negligible. Thus, we can safely assume Cwir e Cinter . Now, the bus energy model of an off-chip data bus with X lines is given as below: 2 .Cwir e E o f f −chi p = Vdd

X 

SWi

i=1

If we assume the worst-case scenario that all bus lines switch at the same time, the off-chip bus energy equations for the original and narrower data buses become as follows: 2 E o f f −chi p original = Vdd .Cwir e .N ,

Springer

2 E o f f −chi p r educed = Vdd .Cwir e .M

(11)

Estimating data bus size for custom processors in embedded systems

21

Here; N and M are the number of bus lines for the original and the narrower data buses. Now, we can derive the percentage reduction in bus energy consumption using a narrower off-chip data bus with the following equation: % Reduction in Off − chip Bus Energy = 100.

N−M N

(12)

The bus energy reduction rates for the narrower off-chip data bus are exactly the same as the Reduction column in Table 6 since the reduction rate depends on only M and N values. 6.2.2. On-chip data bus In deep sub-micron technologies, large interwire capacitances must be dealt with in order to keep bus energy consumption at a certain level. Our claim is that reduction in the number of bus lines causes a potentially large reduction in the interwire capacitance, and therefore reduction in the on-chip bus energy consumption. This can be done by increasing the spacing between the lines since equation 1 states that capacitance is inversely proportional to the spacing of parallel wires. We derive the simplified form of energy consumption equations for both original and narrower data bus width models from equation 2. Similar to the off-chip data bus case, if we assume the worst-case scenario that all bus lines switch at the same time, then the on-chip data bus energy equation can be stated as follows: 2 E on−chi p original = Vdd (Cwir e .N + Cinter (N − 1))

(13)

Now, we need to derive the interwire capacitance of the narrower data bus. The main idea is to equally space the remaining data bus lines so as to use the silicon area emptied by the eliminated bus lines. So, the new spacing rate between the wires is computed with the following equation, assuming the original spacing is 1 unit: r =1+

N−M N −1 = M −1 M −1

(14)

Again, N and M represent the original number of data bus lines and the reduced number of data bus lines estimated by our estimation method. For instance, if the original data bus width is 8 and the narrower data bus width, after our estimation method has been applied, is 6, then the new spacing between the bus lines, r, becomes 1.4 units, that is each wire is given an equal share (0.4 units) of the remaining silicon area. Since the interwire capacitance is inversely proportional with the spacing between the wires, we can derive the energy consumption of the narrower data bus with the following equation:   Cinter 2 E on−chi p r educed = Vdd Cwir e .M + (M − 1) r When we replace r with equation 14,   (M − 1)2 2 Cwir e .M + Cinter . E on−chi p r educed = Vdd N −1

(15)

Springer

¨ E. Ozer, A. P. Nisbet et al.

22 1.5 μm

3μm

1μm

0.25μm

90nm

65nm

100

% Reduction in on-chip data bus power

90 80 70 60 50 40 30 20 10 0 adpcm encoder

fir

g721decoder

k-means

unepic

mpeg2enc

susan

MEAN

Fig. 6 On-chip data bus energy reduction rates for different feature sizes

On-chip bus energy reduction depends on wire and interwire capacitances, which, in turn, depend on feature size. As the feature size goes below sub-micron level, interwire capacitance for on-chip buses can become a few orders of magnitude larger than wire capacitance. Therefore, the power reduction rates of the narrower data bus are expected to increase as feature size decreases. We use several feature sizes from 3 μm (i.e. over micron) to 65 nm (i.e. sub-micron) when presenting the bus energy reduction rates. Wire and interwire capacitances for each feature size are computed using the capacitance-feature size trend graph provided in [18]. Figure 6 shows on-chip data bus energy reduction rates for several feature sizes. viterbi and sha are not shown in the graph since their data bus size cannot be reduced. The last column represents the average results for the seven benchmarks. For 3 μm, 1.5 μm and 1 μm feature sizes, the reduction rate for each benchmark rises moderately. However, when the feature size goes below 1 μm (i.e. micron), all benchmarks encounter a big boost in the reduction rates because the magnitude of interwire capacitance becomes much greater than that of wire capacitance. The narrower data bus can offset this increase in interwire capacitance by increasing the spacing between wires. Thus, the difference in the bus energy consumption between the original data bus and the narrower data bus widens in deep submicron technologies. In summary, k-means enjoys a reduction rate of 89% in 65 nm but this is only 75% in 1 μm. Similarly, the reduction rate in mpeg2decoder rises from 66% in 1 μm to 82% in 65 nm. Using 65 nm technology, an average of 74% energy reduction in the custom data bus can be attained. 6.3. On-chip data bus delay reduction The narrower data bus with increased spacing between the wires can also allow the data bus to have a shorter propagation delay than the original data bus. We derive the following bus delay equations for both buses using equation 3, given in Section 3.2, and assuming the worst-case Springer

Estimating data bus size for custom processors in embedded systems

23

switching scenario of the neighboring bus lines (i.e. g is replaced with (1 + 4.capr )).   Cinter .[Cwir e .L(0.38.Rwir e .L .Cwir e + 0.69R D )] τoriginalk = 1 + 4 Cwir e

(16)

When we increase the spacing between the bus lines, interwire capacitance is reduced by the r factor as given by equation 14.   Cinter .[Cwir e .L(0.38.Rwir e .L .Cwir e + 0.69R D )] τr educedk = 1 + 4 r.Cwir e   M − 1 Cinter .[Cwir e .L(0.38.Rwir e .L .Cwir e + 0.69R D )] τr educedk = 1 + 4 N − 1 Cwir e

(17)

Finally, the percentage improvement in the bus propagation delay is given by the following equation:     Cinter − 1 + 4 M−1 1 + 4 CCinter N −1 Cwir e wir e % Reduction in Bus Propagation Delay = 100.   1 + 4 CCinter wir e

(18)

Using equation 18, we can estimate approximate improvement in the data bus propagation delay in the narrower data bus model. Similar to Section 6.2.2, we use several feature sizes from 3 μm (i.e. over micron) to 65 nm (i.e. sub-micron). We compute the reduction rates for both the bus propagation delay and the wire and interwire capacitances for each feature size using the capacitance-feature size trend graph provided in [18]. Figure 7 shows the percentage reduction in the on-chip data bus propagation delay for each application and the last column represents the average results for the seven benchmarks. Similar to the on-chip data bus energy reduction results, the reduction rate for each benchmark 1.5 μm

% Reduction in on-chip data bus propagation delay

3μm

1μm

0.25 μm

90nm

65nm

100 90 80 70 60 50 40 30 20 10 0 adpcm encoder

fir

g721decoder

k-means

unepic

mpeg2enc

susan

MEAN

Fig. 7 On-chip data bus propagation delay rates for different feature sizes Springer

¨ E. Ozer, A. P. Nisbet et al.

24

rises moderately for feature sizes above 1 micron. However, as soon as the feature size goes down to sub-micron level, the reduction rates for all benchmarks improve significantly. The greatest reduction in the bus delay is 68% for k-means using the 65 nm feature size. Bus delay for mpeg2encoder can be reduced by 58% for the same feature size. An average of 51% reduction in the bus propagation delay can be obtained for seven benchmarks in 65 nm technology. Improvement in the data bus propagation delay can potentially allow the system designer or CAD tools to drive the narrower data bus with a higher frequency than the original data bus. If bus performance rather than bus power consumption is of major concern, then our estimation method allows much faster on-chip data buses. 7. Discussion and conclusion We have proposed an estimation method to customize the data bus width to the requirements of an application running in a custom processor. Our estimation method is a simulation-based tool and employs statistical sampling and Extreme Value Theory to estimate the optimal data bus width for which the probability of a potential multi-cycle data transfer over the bus is highly unlikely. Reducing the data bus width allows lower cost, lower power and potentially faster data buses. The potential target systems are embedded platforms where a custom processor such as ASIC or FPGA is connected to memory, I/O and other processors through a shared bus or point-to-point data links. The custom processor can be a system-on-a-chip or a system-on-a-board and the estimation method can customize both on-chip and off-chip data buses. Our estimation method is automatic in the sense that it can run as a standalone tool that generates a sufficient number of random input samples, computes probability distribution parameters and then spits out the custom data bus width. We also demonstrate that it can be very fast, and that 1000 random input samples can suffice to perform a reliable statistical estimation. We have performed experimental analysis of nine embedded applications to estimate bus width and cost reduction. Also, we have presented analytical models of bus energy consumption and bus propagation delay. Our results have shown that it was possible to reduce the data bus width and off-chip data bus cost by up to 66% with an average of 38%. Reducing the data bus width creates extra space by eliminating some of the bus lines. This offers the opportunity of increasing the width of the remaining bus lines. This reduces interwire capacitance, which leads to significant reduction in bus energy consumption of up to 89% with an average of 74% for on-chip data buses. It is also possible to improve the bus propagation delay if bus speed is of primary concern. Our results have shown that on-chip data bus propagation delay can be reduced up to 68% with an average of 51%. In this work, we only concentrate on custom design of data buses with integer data traffic. However, the very same method can be used to customize address buses if the virtual and physical address spaces are the same. If this is the case, our estimation method can customize the address bus without any change in the methodology. Similarly, our estimation method can also be applicable to fixed-point data values on the data bus. The only difference is that the integral and fractional parts of data values must be analyzed separately. After estimating the maximum value for the integral and fraction parts, the total data bus width can be computed as the sum of their individual bit-widths. Springer

Estimating data bus size for custom processors in embedded systems

25

References 1. Chatterjee, A., P. Ellervee, V.J. Mooney, J.C. Park, K-W. Choi, and K. Puttaswamy. System level powerperformance trade-offs in embedded systems using voltage and frequency scaling of off-chip buses and memory. In Proceedings of the 15th international symposium on System Synthesis, 2002. 2. Citron, D. Exploiting Low Entropy to Reduce Wire Delay. In Computer Architecture Letters, volume 3, Jan. 2004. 3. Evans, M.J. and J.S. Rosenthal. Probability and Statistics, The Science of Uncertainty. W.H. Freeman and Company, New York, 2004. 4. Gasteier, M. and M. Glesner. Bus-based communication synthesis on system level. ACM Transactions on Design Automation of Electronics Systems, 4(1), Jan. 1999. 5. Ramazan Gen¸cay, Faruk Sel¸cuk, and Abdurrahman Ulug¨ulyaˇgcı. EVIM: A Software Package for Extreme Value Analysis in MATLAB. Studies in Nonlinear Dynamics and Econometrics, 5(3), Oct. 2001. 6. Givargis, T. and F. Vahid. Interface Exploration for Reduced Power in Core-Based Systems. In International Symposium on System Synthesis (ISSS), Dec. 1998. 7. Guthaus, M.R., J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown. MiBench: A Free, Commercially Representative Embedded Benchmark Suite. In the IEEE 4th Annual Workshop on Workload Characterization, Austin, TX, Dec. 2001. 8. Ho, R., W. Mai, and M.A. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4), April 2001. 9. Komatsu, S., M. Ikeda, and K. Asada. Bus Data Encoding with Coupling-driven Adaptive Code-book Method for Low Power Data Transmission. In 27th European Solid State Circuits Conference, Villach, Austria, Sep. 2001. 10. Kotz, S. and S. Nadaraj. Extreme Value Distributions, Theory and Applications. Imperial College Press, 2000. 11. Kretzschmar, C., R. Siegmund, and D. M¨uller. Adaptive Bus Encoding Technique for Switching Activity Reduced Data Transfer over Wide System Buses. In Integrated Circuit Design. Power and Timing Modeling, Optimization and Simulation: 10th International Workshop (PATMOS 2000), Sep. 2000. 12. Law, A.M. and W.D. Kelton. Simulation Modeling and Analysis. McGraw Hill, 3rd ed., 2000. 13. Lee, C., M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In Proceedings of the 30th Annual IEEE/ACM International Conference on Microarchitecture (Micro-30), Raleigh, N.C., Dec. 1997. 14. Lemieux, J. Introduction to ARM Thumb, http://www.embedded.com/showarticle.jhtml?articleid= 15200241, 2003. 15. Lv, T., J. Henkel, H. Lekatsas, and W. Wolf. An Adaptive Dictionary Encoding Scheme for SOC Data Buses. In Proceedings of the conference on Design, Automation and Test in Europe (DATE), 2002. 16. Narayan, S. and D.D. Gajski. Synthesis of System-Level Bus Interfaces. In Proceedings of the European Conference on Design Automation (EDAC), 1994. 17. Pandey, S., H. Zimmer, M. Glesner, and M. M¨uhlh¨auser. High level hardware software communication estimation in shared memory architecture. In IEEE International Symposium on Circuit and Systems (ISCAS’2005), Kobe, Japan, May 2005. 18. Rabaey, J.M., A. Chandrakasan, and B. Nikoli´c. Digital Integrated Circuits. Prentice Hall Electronics and VLSI Series, 2003. 19. Ramprasad, S., N.R. Shanbhag, and I.N. Hajj. A Coding Framework for Low-power Address and Data busses. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7(2), June 1999. 20. Reiss, R.D. and M. Thomas. Statistical Analysis of Extreme Values. Birkh¨auser Verlag, Basel, Switzerland, 1997. 21. Shin, Y. and T. Sakurai. Coupling-Driven Bus Design for Low-Power Application-Specific Systems. In the 38th Annual ACM/IEEE Conference on Design Automation, Las Vegas, 2001. 22. Sotiriadis, P.P. and A. Chandrakasan. Low Power Bus Coding Techniques Considering Inter-wire Capacitances. In CICC 2000, May 2000. 23. Sotiriadis, P. and A. Chandrakasan. Reducing Bus Delay in Submicron Technology Using Coding. In IEEE Asia and South Pacific Design Automation Conf., 2001. 24. Stan, M.R. and W.P. Burleson. Bus-Invert Coding for Low Power I/O. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 3(1), March 1995. 25. Suresh, D.C., B. Agrawal, J. Yang, W. Najjar, and L. Bhuyan. Power Efficient Encoding Techniques for Off-chip Data Buses. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2003. 26. Bret Victor and Kurt Keutzer. Bus Encoding to Prevent Crosstalk Delay. In International Conference on Computer-Aided Design (ICCAD ’01), Nov. 2001. Springer

¨ E. Ozer, A. P. Nisbet et al.

26

27. Williams, J. HCS12 External Bus Design. In Application Note. Freescale Semiconductor, Aug. 2004. 28. Wilson, R.P., R.S. French, C.S. Wilson, S. Amarasinghe, J.M. Anderson, S.W.K. Tjiang, S.W. Liao, C.W. Tseng, M.W. Hall, M.S. Lam, and J.L. Hennessy. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. Technical report, Computer Systems Laboratory, Stanford University, 1994. 29. Yang, J. and R. Gupta. FV Encoding for Low-Power Data I/O. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, Aug. 2001.

Springer

Suggest Documents