IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 8, AUGUST 2014
1159
Modeling and Optimization Techniques for Yield-Aware SRAM Post-Silicon Tuning Ashish K. Singh, Ku He, Member, IEEE, Constantine Caramanis, Member, IEEE, and Michael Orshansky, Senior Member, IEEE
Abstract—SRAM cell design is driven by the need to satisfy several stability and performance criteria for all cells in the array in an energy-efficient manner. Significant randomness of FET threshold voltages makes achieving this difficult and limits both the minimum cell size and minimum array supply voltage. Postsilicon adaptivity in the form of an adaptive-voltage scheme in a partitioned SRAM array can be used to reduce impact of variability despite lack of any spatial correlation in realizations. This paper develops a novel optimization flow for yield-aware cell sizing and voltage selection under variability given the availability of post-silicon voltage tuning. We formulate a two-stage stochastic optimization problem in which the first-stage decision is to select cell size and possible voltage levels, and the second-stage decision is to assign each partition to an optimal voltage after manufacturing. We develop closed-form statistical models of array margin behavior and yield as a function of Vdd , cell size, and array size. We solve the problem using dynamic programming that minimizes power while meeting yield constraints on read, write, and static noise margins. The proposed flow allows designs that are on average 8% and up to 17% more power-efficient than the designs in which voltages are selected uniformly. The results also indicate that at high-yield levels power savings can be up to 32% in the active mode and 71% in the standby mode. Index Terms—Adaptive optimization, low-power SRAM, postsilicon adaptivity, statistical optimization.
I. I NTRODUCTION N nanometer technologies, the increase of process variation significantly impacts circuit yield. The impact of variability on the design of large SRAM arrays is especially severe. Some patterns of variability are highly systematic, such as those in photolithography and chemical-mechanical polishing. These spatially-correlated components of variation have a spatial scale that is larger than typical arrays, motivating us to focus on the random uncorrelated patterns of variation. Among the random variability patterns, the threshold voltage variation due to random dopant placement is paramount. SRAM cells are typically sized to be of minimum area, because of the
I
Manuscript received February 10, 2013; revised May 24, 2013 and July 28, 2013; accepted July 31, 2013. Date of current version July 15 2014. This paper was recommended by Associate Editor S. Vrudhula. A. K. Singh is with Terra Technology, Chicago, IL 60173 USA (e-mail:
[email protected]). K. He, C. Caramanis, and M. Orshansky are with the Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78746 USA (e-mail:
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2014.2317571
tight layout requirements for large arrays. Because the variance of threshold voltage variation is inversely proportional to the transistor area, the Vth variance of small size transistors is large and is growing [1]. SRAM cell design is driven by the need to satisfy static noise margin (SNM), write margin (WRM), and read current margin (RCM) over all cells in the array and these constraints determine both the minimum cell size and supply voltage. Increasing cell area and supply voltage can ensure that the noise margins are met. The requirement to meet noise margin constraints sets the limit on the smallest possible cell size and also on the minimum usable supply voltage Vdd for the array, commonly known as Vmin . Because threshold voltage variations impacting the cells in the SRAM array are independent, the margin specification needs to be met at very high sigma corners, five or six sigma, in order to reach acceptable yield, requiring significant cell upsizing, and increased Vmin [2], [3]. SRAM cell area is an important metric of the success of technology scaling, and variability makes such traditional scaling hard to sustain. Thus, reducing the negative impact of random Vth variability on SRAM area is an important goal [4]. One effective strategy for dealing with variability is postsilicon adaptivity. Adaptive circuit-level solutions, such as adaptive supply voltage and adaptive body bias have been employed in order to increase frequency, reduce standby leakage, and reduce switching power in logic circuits [5]–[8]. Earlier work has addressed the techniques for mitigating the impact of global die-to-die variation on the operation of SRAM arrays. In [9] and [10] adaptive body bias is employed to increase SRAM yield. Specifically, the read and hold failures at low Vth corners are reduced through reverse body bias while at high Vth corners forward body bias reduces the write and read failures. The concept of column-based voltage assignment to reduce voltage overhead due to intradie variability was described [11], however, the analysis of the optimal strategy for the design was not given. In [12], we presented a novel architecture in which an intraarray adaptive voltage scheme can be practically realized to reduce the overhead of high-sigma margin design on Vdd and cell area. The key idea is to be able to shift empirical distributions (realizations) of the design margins in a partition to meet the target specification. Because the partition is smaller than the whole array, the tail of the extreme-value Gumbel distribution is significantly reduced. For the partitions whose worst margin violates the specification, a higher voltage is selected to gain yield, otherwise voltage is reduced for power
c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 0278-0070 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1160
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 8, AUGUST 2014
reduction. Thus, we accept a larger single-cell spread, due to sizing-down the cell, and compensate for it by post-silicon adjustment of the empirical realization of margins in the partition. Partition-based voltage assignment means that the entire set of realized bitcell margins in a partition is shifted. That ability allows us to accomplish one of two things: 1) for fixed cell area, we are able to reduce supply voltage in partitions whose worst realizations are below the relevant constraint, and thus reduce the average supply voltage or 2) we are able to reduce cell size which results in larger spread in Vth , but which can now be tolerated because of the presence of tunability. The cost of realizing the scheme includes generating additional voltage levels, routing extra signals, and added control logic. The cost, however, is manageable because only a small number of discrete but optimally chosen voltage levels are needed: even with 2–4 voltage levels the architecture allows significant cell area and power reduction. The work in [12] demonstrated the effectiveness of intraarray tuning for active-mode power reduction in SRAM arrays. Yet post-silicon tuning is also very effective in improving sleep-mode power consumption. In sleep-mode operation, the voltage of SRAM arrays is often reduced to a level that guarantees that cell contents are not destroyed. The minimum voltage at which a cell reliably holds its state, known as data retention voltage (DRV), is determined by its SNM; the other margins are not relevant in the sleep mode. The DRV for an array determines the amount of power reduction in the sleep mode. The array DRV is set by the largest DRV over all the cells in an array [13]: for large arrays, the 6σ point of the DRV distribution is almost 3X of the mean value. At the same time, leakage power has exponential dependence on Vdd : a small Vdd reduction leads to significant leakage power reduction. The use of intraarray voltage tuning allows reducing the mean DRV for large arrays, leading to significant leakage power savings. In this paper, we develop a theoretical modeling and optimization framework for intraarray adaptive voltage design. The developed framework permits rigorously taking into account the availability of post-silicon tuning during SRAM cell sizing. It also allows optimally selecting the voltages available during post-silicon tuning. The selection is important: we show that a naive strategy significantly under-performs in terms of achievable power reduction. We develop models to quantitatively predict the statistics of noise margins as a function of the partitioning strategy. We use the framework of two-stage stochastic optimization to capture the interaction between the design-time sizing and post-silicon voltage adaptivity. The optimization strategy of [12], which is based on nonlinear optimization, works well when only a single design margin is involved. However, it is not easily amenable to simultaneously handling multiple margins, e.g., read margin, WRM, and SNM, that exhibit stochastic correlation due to their dependence on Vth . The earlier framework is also limited to working with the objective functions that are linear in the mean Vdd . The proposed solution method is based on dynamic programming that exploits the recursive structure of the objective function to solve a much more general optimization problem [14].
Fig. 1. Design specification increases due to long tail of Gumbel distribution for high number of cells.
The paper is organized as follows. In Section II, the architecture of a voltage-tunable SRAM array is described. Section III develops the framework for co-optimization of cell sizing and post-silicon voltage tuning using dynamic programming to handle multiple correlated design margins. Section IV presents the results of numerical experiments. II. P OST-S ILICON T UNING OF SRAM A RRAYS The challenge of SRAM design under variability comes from the need to simultaneously satisfy several noise margins which are all impacted by variability. In the active mode, SRAM yield is set by the need to satisfy three noise margins: RCM, WRM, and SNM. The SNM is defined as the minimum voltage noise required to flip the state of the cell. We adopt the dynamic WRM definition [15]. Under this interpretation the WRM aims to ensure that a successful write operation is performed in the period during which the wordline is turned on. RCM is needed to ensure that there is enough time to build sufficient bitline voltage difference for the sense amplifier during the cell read. RCM is captured by measuring cell read delay time. In the sleep mode, the supply voltage Vdd is lowered to minimize leakage; the minimum possible Vdd is the voltage at which the SNM is still positive. The satisfaction of noise margins is driven by the maximum of the cell margins in an array. The maximum over a large number of random variables can be described by extreme order statistics [16], [17]. For a typical SRAM consisting of a large number of cells, the worst-case margins are asymptotically distributed according to a Gumbel distribution, which is characterized by long tails (Fig. 1). This requires that each cell’s margins must be sufficiently high at high sigma corners in order to achieve the target yield. Post-silicon tuning is possible through partitioning the SRAM array into a set of tunable blocks which can be set to different supply voltages. Due to the smaller number of cells in each block, the realization of the worst margin is significantly smaller (indeed, exponentially smaller [18]). As shown in Fig. 2, because each partition contains only a sub-set of all
SINGH et al.: MODELING AND OPTIMIZATION TECHNIQUES FOR YIELD-AWARE SRAM POST-SILICON TUNING
1161
Fig. 4. pMOS switch network for a partition selects one of the available voltages.
of the margins after manufacturing has been demonstrated, for example, in [20]. Fig. 2. Adjustment of a partition’s Vdd shifts access times of all cells in a partition to fix violations or save power.
III. C O -O PTIMIZATION OF SRAM S IZING AND P OST-S ILICON T UNING VIA DYNAMIC P ROGRAMMING This section develops an optimization framework for optimally designing and understanding the impact of a small number of distinct supply voltage levels that can be adjusted in independently tunable blocks of cells. The developed optimization allows optimal voltage-level selection for post-silicon tuning and optimal cell sizing given the availability of tuning. A. Model of the Margin Distribution and Array Yield
Fig. 3.
SRAM architecture using row-based multiple voltage control.
cells in the array, there will be many partitions in which the worst realizations among the cells will still be less than the upper bound. In the SRAM architecture that implements intraarray voltage islands [12], the array is partitioned in a row-by-row manner, as shown in Fig. 3. The partitions are sets of rows that can operate at distinct voltage levels. The supply voltage of all the bitcells in the partition and the corresponding wordlines can be set to one of the allowed voltage levels. The voltage levels between partitions can be different. The voltages are generated by the on-chip voltage regulators and can be delivered to individual blocks through a pMOS switch network (see [19]) as shown in Fig. 4. After the SRAM is manufactured, the realized values of the design margins in each partition are measured. Based on a partition’s worst-case margin realization, a voltage is selected for that particular partition from a set of available voltage levels. The possibility of low-cost measurement
We assume that multiple distinct design criteria, or design margins, need to be simultaneously satisfied for each cell in an array to operate properly. We use the term margin to refer to all performance and stability design criteria that need to be satisfied. Let there be t cell design margins of interest. For each margin, we can assume without loss of generality that we have an upper bound, since a margin for which we have a lower bound can be negated to get an equivalent upper bound (for RCM, there is an upper limit on the cell read time. For SNM and WRM, the acceptable values are lower-bounded). We define a random vector margin = (m1 , . . . , mt ) that for each cell contains t design margins as components. We model the vector of margin values as following a multivariate normal distribution. The model is adopted because the underlying random behavior of Vth is Gaussian and because the dependence of individual margins on small deviations of Vth from its mean value can be approximated linearly. We denote the mean and the covariance matrix of margin as μ(W, Vdd ) and (W, Vdd ), where we explicitly capture the dependence of the moments of the Gaussian margin vector distribution on decision variables. The mean vector and the covariance matrix are characterized empirically; we describe this process in the following section. Meeting the design objectives for an entire SRAM array requires considering the maximum margin values over all cells in the array. Given the randomness of the margin vector, the maximum margin is a random vector obtained by taking the component-wise maximum over the random margin vectors for each cell. Formally, we define the vector marginmax (W, Vdd ) =
1162
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 8, AUGUST 2014
max(margin1 , . . . marginn ), where max is taken componentwise for each of the t components of the cell margin vector and n is the total number of cells in the array. The array yield is given by Y(W, Vdd ) = Pr(marginmax (W, Vdd ) ≤ margint )
(1)
where margint is the vector of target values for the t components of the margin vector. The above probability can be easily computed from the single-cell probabilities because of the independence of random behavior of the individual cells. We rely on the fact that for the independent and identically distributed random variables xi the following generally holds: Pr[ max(x1 , . . . , xn ) < t] = Pr[x1 < t]n . Therefore, we can write Y(W, Vdd ) = Pr(margin(W, Vdd ) ≤ margint )n .
(2)
B. Adaptive Power Optimization: Continuous Formulation We start by stating the problem of minimization of mean power under yield constraints for a traditional design paradigm, i.e., when voltage levels cannot be set independently for individual blocks. In this case, the goal is to find the optimal cell sizing and the supply voltage so that power is minimized while the yield constraints are met. Taking into account the impact of threshold voltage variability on cell-level power consumption, we denote by Pmean the single-cell mean power under Vth variability. Letting Yt be the target yield, we formulate the following optimization problem: minW,Vdd : nPmean (W, Vdd ) s.t.: Y(W, Vdd ) ≥ Yt .
(3)
This is the point of departure for the novel architecture we propose. We extend this formulation to a setting where voltage levels can be set independently in p tunable blocks. Initially, for the purpose of the exposition, we assume that the voltage can be continuously tuned in each block. Then we formulate the full problem, where each block’s voltage can be tuned to a finite set of preselected voltages. Computing the optimal choice for these preselected voltages is the core technical problem we solve. Suppose, then, that the voltage in each block can be tuned continuously after manufacturing. Adapting voltage happens in post-silicon tuning, so we have a two-stage optimization problem: the sizing, W, is determined at design time, and in the second stage after manufacturing, each out of p tunable blocks, with m cells in each block so that p · m = n, selects the optimal Vdd , in order to minimize power while meeting the yield constraints. We denote this optimal, adaptive voltage in each block by Vadapt,i = Vnominal + Vadapt,i . Note that here we assume it can take a continuous value; the next section considers the key setting where it can only take one of k preselected values. The modified margin vector is represented by margini (W, Vadapt,i ). The yield can now be computed as Y(W, Vadapt ) = Pr( max margini (W, Vadapt,i ) ≤ margint ). (4) 1≤i≤p
Since Vadapt is uniquely specified by the threshold voltages in all array transistors, by some slight abuse of notation, instead
of writing Vth explicitly as a random parameter, and Vadapt as a function of that, we can simply treat Vadapt directly as a random variable itself. Thus, accounting for the different supply voltages across array blocks, and the post-silicon adaptation, the mean power is now computed as EVadapt [Pmean (W, Vadapt )]. The optimization problem, including the optimal (continuous) adaptation in each block to the realizations of the threshold voltage in each cell, now becomes minW : n · EVadapt [Pmean (W, Vadapt )] s.t.: Y(W, Vadapt ) ≥ Yt .
(5)
We next move to the core issue: the optimization problem when Vadapt cannot be chosen continuously, but instead must take values from a preselected set. The critical problem will then be to optimally select this finite set. C. Adaptive Power Optimization: Discrete Formulation and Solution This section contains the technical core of the paper: solving the problem for the setting where Vadapt,i is restricted to k preselected voltage levels: {v1 , . . . , vk }. Thus, the value of that optimization is a function of these k voltage levels minW,{v1 ,...,vk } : n · EVadapt [Pmean (W, [v1 , . . . , vk ])] s.t.: Y(W, Vadapt ) ≥ Yt .
(6)
We now formally introduce the function φ(v1 , . . . , vk , W) = EVadapt [Pmean (W, [v1 , . . . , vk ])], that helps us to encode the objective. Optimizing φ over W and the k voltage levels is nontrivial, since the {v1 , . . . , vk } are effectively the parameters of the optimization problem. As is well-known, the value of an optimization problem as a function of its parameters, is typically not convex; this is no different here—φ is not a convex function of {v1 , . . . , vk }. However, as we show, this function, φ, has a particular recursive structure that allows us to solve it optimally and efficiently, using a dynamic programming approach. Due to the equal size of each partition, the adaptive voltage is independent and identically distributed across different blocks (again, we recall that we are treating Vadapt as a random variable, rather than explicitly writing it as a function of the random variable Vth ). Hence, the yield requirement splits 1/p equally among the partitions. We let Yi = Yt denote the yield target for partition i. Let Yi (W, Vadapt,i ) represent the yield of a block i assuming that the voltage Vadapt,i is used. The yield is given by Yi (W, Vadapt,i ) = Pr(margin(W, Vi ) ≤ margint )m .
(7)
Let Vmin , Vmax be the range restriction for the adaptive voltage. First, we note two degenerate cases. Suppose Yi (W, Vmax ) is smaller than Yt,i . Then, the problem is infeasible since even with the highest available voltage the desired yield cannot be achieved. On the other hand, if Yi (W, Vmin ) > Yt,i , we can simply use Vmin as the supply voltage for all partitions and still meet yield constraints while keeping minimal average power. In these two cases, we can obtain the result immediately without recourse to the full optimization flow. Ruling
SINGH et al.: MODELING AND OPTIMIZATION TECHNIQUES FOR YIELD-AWARE SRAM POST-SILICON TUNING
out these two possibilities, there must exist an intermediate voltage Vupper in the allowed supply voltage range such that Yi (W, Vupper ) = Yt,i . It is clear that we never need to use an adaptive voltage above Vupper and that the range of possible values of adaptive voltage is [Vmin , Vupper ]. The key to our derivation is the realization that for any voltage v in this range we can compute the conditional probability that we are able to use it or a voltage below it, and that this probability is given by Yi (W, v)/Yt,i . To summarize, the distribution of the adaptive voltage under the assumption that voltages being assigned are continuous is given by ⎧ if v < Vmin ⎨ 0 1 if v ≥ Vupper . F(v) = (8) ⎩ Yi (W,v) v ∈ [V , V ] min upper Yt,i We can now compute the mean power for the above probability distribution of Vadapt , first stating it for the case of continuous Vadapt Vupper EVadapt [Pmean (W, Vadapt )] = Pmean (W, v)dF(v). (9) Vmin
We now can rewrite the above equation for the case when voltages can take only discrete values out of a predefined set {v1 , . . . , vk } of size k. Given the general form of the expectation, this is immediate k EVadapt [Pmean (W, [v1 , . . . , vk ])] = i=1 Pmean (W, vi ) · (F(vi )
−F(vi−1 )).
(10)
The question that we address now is how to optimally select a set of voltage levels such that: 1) the number of voltage levels is minimized and 2) the selected values lead to the minimization of the mean power. The trivial solution is to space the voltage levels {vi } uniformly in the allowed voltage range. We demonstrate below that the naive strategy significantly underperforms an optimization-driven flow in which the allowed voltages {vi } are determined by solving a formal optimization problem. The problem is discrete since only a finite set of voltages is allowed: the optimal choice of {vi } is the one that minimizes E[Pmean (W, [v1 , . . . , vk ])] while ensuring that the yield constraints are satisfied. An exhaustive search over the complete set of possible voltage levels is costly: the runtime of the search grows as O(sk ), where s is the number of possible voltage levels. Given that we may encounter problem instances with k = 2–6 and s = 10–50, the runtime cost is substantial. This motivates us to solve the problem more efficiently. We develop an efficient solution method using dynamic programming. The formulation exploits the fact that the cost function of the problem, namely the mean power, can be expressed in a recursive form. That makes the problem amenable to dynamic programming. Recall that the range of allowed voltages is constrained to lie in [Vmin , Vupper ]. The yield constraint dictates that vk , the highest voltage used, must be identical to Vupper . Let the vector v represent the set of discrete voltage levels v = {v1 , . . . , vk } in increasing order. The optimal choice of discrete voltages is obtained by minimizing
1163
the objective φ(v, W) = E[Pmean (W, v)]. The key observation is that for k > 1 the cost function can be recursively written as (11) φ(v1 , . . . , vk , W) = φ(v1 , . . . , vk−1 , W) + Pmean (W, vk ) · (F(vk ) − F(vk−1 )). For k = 1, φ(v1 , W) = Pmean (W, v1 ) · F(v1 ). Let opt_cost(k, v) denote the optimal value of φ, under the constraint that the number of discrete voltages are k and the highest voltage out of them is v. Let opt_voltage(k, v) denote the set containing the optimal set of voltages that minimize opt_cost(k, v). Clearly, our goal is to find the opt_cost(k, Vupper ). We note that the expression for φ(v1 , . . . , vk , W) above also allows a recursive formulation for opt_cost (for k > 1) opt_cost(k, v) = min opt_cost(k − 1, u) + u≤v
Pmean (W, v) · (F(v) − F(u))
(12)
opt_voltage(k, v) = {v} ∪ opt_voltage(k − 1, u) where u = arg min opt_cost(k − 1, u) + u≤v
Pmean (W, v) · (F(v) − F(u)).
(13)
The base case for k = 1 is given by opt_cost(1, v) = Pmean (v) · F(v) opt_voltage(1, v) = {v}.
(14) (15)
Based on the achievable voltage resolution, we can preselect a large number of equi-spaced voltages within the range [Vmin , Vmax ], at the separation interval of . The optimal choice of k voltages has to come from among them. Let the set of allowed voltages be denoted by S. We set Vupper as the minimum voltage in S, such that Yi (W, Vupper ) ≥ Yt,i . If no such voltage exists, we report infeasibility. Otherwise, we are interested in computing opt_cost(k, Vupper ) and opt_voltage(k, Vupper ). In order to allow a bottom-up computation of these functions using a recursive formulation, we keep arrays for opt_cost(k , v ) and opt_voltage(k , v ) with v ∈ S and k ∈ [1, k]. Given the values of these arrays for k and all the voltages in S, we can estimate all the array values for k = k +1 in time O(s2 ), where s is the size of S. Thus, the total runtime is O(ks2 ). This is a significant improvement over an exhaustive search. IV. M ODELS AND E XPERIMENT R ESULTS A. Statistical Modeling of Cell Margins Efficient probability evaluations are needed to compute the single-cell probability of meeting design constraints Pr(margin(W, Vdd ) ≤ margint ). The direct way to evaluate the probability is through a SPICE-based Monte Carlo simulation at every pair of W and Vdd values of interest. Given that the number of Monte Carlo runs required to perform the evaluation for events occurring with probability po scales as 1/po , it is exceedingly expensive to carry out these simulations for small po . We deal with this challenge by introducing intermediate closed-form models of the moments of the distribution of the margins in terms of W and Vdd .
1164
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 8, AUGUST 2014
The use of dynamic programming for optimization allows handling models with arbitrary dependence on the fitted parameters. That allows using arbitrary nonlinear models for high accuracy. The models are constructed by empirical least-square fitting to data generated via SPICE simulations. Separate models are generated for the active and sleep modes. The randomness in the design margins and power is assumed to be due entirely to threshold voltage Vth . We find that the random behavior of the different margins for the same cell is correlated. Therefore, the covariance is also explicitly modeled. For yield optimization aimed at active mode power consumption, we characterize the mean power and the mean, standard deviation and the mutual covariance of all the SRAM design margins (e.g., read margin, WRM, SNMs) as a function of the single width scaling factor and the supply voltage Vdd . The transistor width is treated as a normalization factor that scales all transistors in the cell by the same amount. For optimization aimed at the sleep mode-power minimization, we additionally characterize sleep-mode SNM. Thus, there are two distinct characterizations of SNM: one is the active-mode (read) SNM and the other is the hold SNM. To measure read SNM, the access transistor is turned on while for sleep-mode SNM it is turned off [21]. Using SPICE-generated data, we use an iterative procedure to determine the fitted model. We start with the model containing only the first-order terms (W and Vdd ) and add higher-order terms to reduce the fitting error. We remove terms whose coefficients are close to zero while adding new terms. The procedure terminates when the fitted error is sufficiently small, e.g., below 5%. We find that higher-order polynomial models are generally required to provide high accuracy of modeling the mean, variances and covariance of the design margins and mean of power in terms of width and Vdd . The delay and leakage power cannot be modeled well using polynomial functions directly. However, we found that modeling an inverse of delay and the logarithm of leakage using polynomial functions results in good fits. The rms fitting error for the mean of the design margins ranges from 0.16% to 4.23%. The error for the variance and covariance models ranges from 1.37% to 3.07%. Below we show several of the fitted models with the model coefficients k, l, and q obtained directly from the regression 2 ·W E [SNM(W, Vdd )] = k4 ·W 2 + k5 ·Vdd ·W + k6 ·Vdd
+
3
i ki ·Vdd
(16)
i=−2
(17) σSNM (W, Vdd ) = l0 ·W + l1 ·Vdd + l2 ·W·Vdd (W, Vdd ) = exp q4 ·W + q5 ·W 2 + q6 ·W·Vdd
mutual
3 i . + qi ·Vdd
(18)
i=1
The cost of fitting the polynomial models depends on the grid size chosen for W and Vdd . The fitting accuracy improves as the grid becomes denser. The polynomials are obtained using the standard techniques of linear least-square regression whose computational cost increases linearly with the grid size. In
TABLE I A REA OVERHEAD OF VOLTAGE D ISTRIBUTION
addition, the number of SPICE simulations needed to evaluate the model terms also increases proportionally to the grid size. Using the models of the moments of the margin vector, the calculation of the probability is carried out via a numerical integration of the Gaussian vector with the estimated moments. A MATLAB procedure is used for this step. The numerical integration is inexpensive since the random vector has three components in the case of active-mode analysis, and one component in the case of sleep-mode analysis. B. Array Power and Area Models For the experimentation and the extraction of power models we assumed that the array size is 1 Mb and is divided into 16 banks. The nominal Vdd is 1V for the active mode and 0.4V for the sleep mode. Adaptive voltage is limited to ±20% of the nominal value. The ratio of transistor widths in the 6T cell is kept constant. The area changes are produced by varying the normalized width (w), which uniformly sizes all transistors in the cell. The models presented in the previous section were fitted for a cell designed in the 32 nm process using the PTM BSIM model [22]. The mean fitting error of the models for all the design margins was below 3%. We evaluated active and sleep modes separately. We note here the differences between the models used in this paper and our earlier work of [12]. This paper improves modeling accuracy in two ways. First, it models four margins rather than only one margin. Second, this paper utilizes models of leakage power and of wordline and bitline power, in addition to the cell dynamic power, that was the focus of modeling efforts in [12]. The cost of tuning circuitry in terms of the area overhead is a function of the number of partitions (s), the number of distinct voltage levels (v), and the normalized bitcell width (w) AO = c0 · s · v · w + c1 · (v − 1) · w
(19)
where c0 is the area of the pMOS switch when w = 1, c1 is the area of the voltage dividing network when w = 1, and c0 = 1.8 and c1 = 3.5. The estimated area overheads are shown in Table I for different partition complexities and can be seen to be quite small. The power consumption of an entire array is computed by combining the cell-level power models described previously with other power components involved in an entire read/write cycle. In the active mode, we have Ptot = α · (0.5 · Pread + 0.5 · Pwrite ) + (1 − α) · Pleak
(20)
where α is the activity factor, 0.2 activity is used in the experiment. Pread is the read power. We assume that the SRAM array is divided into banks which have a size of 256×256. During read operation only one row in a certain bank is being accessed, the rest of the banks consume only leakage power.
SINGH et al.: MODELING AND OPTIMIZATION TECHNIQUES FOR YIELD-AWARE SRAM POST-SILICON TUNING
Fig. 5. Power reduction achievable by the proposed approach outperforms the result of the uniform voltage allocation.
1165
Fig. 6. Savings in mean Vdd for different switch complexity and voltage complexity v = 4.
Therefore, the read power consists of the bitline power which is consumed when charging and discharging the bitline, the power consumed when driving the wordline, the cell power of the bank being accessed, and the leakage power of the rest of the banks Pread = Pbitline + Pwordline + Pcell + Pleak .
(21)
Similarly, power consumed during the write is given by the total of bitline, wordline, cell power, and leakage power with each component extracted through circuit simulation of a write operation. C. Experiment Results We use the developed optimization framework and statistical models to evaluate the effectiveness of the intraarray voltage tuning scheme. We also demonstrate that the joint design-time/post-silicon tuning optimization methodology we have developed is superior to a naive design approach. The primary capability enabled by the proposed flow is joint cell sizing and selection of a finite number of voltage levels for post-silicon tuning for the given the amount of variability and array size. We compare the results obtained by the developed flow—in terms of achievable mean power consumption and cell area—with those obtained by applying a naive voltage optimization scheme. For this purpose, we assume that the minimum and maximum allowed Vdd values are fixed. The naive scheme uses voltage levels that are equi-spaced within the voltage range. Fig. 5 shows the benefits of the proposed optimization flow in terms of achievable power consumption in the active mode. The proposed method identifies solutions that are significantly better than what the naive scheme produces. To achieve the minimum power, a uniformly spaced voltage scheme would require a cell which is approximately 33% larger than possible with the nonuniform solution. At a given area, the nonuniform solution reduces mean power by up to 17%, and on average by 8% across a range of cell areas (widths). We find that as cell area becomes larger, optimal designs produced by the two methods converge. This is because at higher cell areas, the yield target can be met with a smaller Vdd , thereby
Fig. 7. Expected power versus bitcell area Pareto curves for different voltage and switch complexities in the active mode.
reducing the allowed voltage range and making the difference in technique performance less pronounced. The experimental results indicate a rich set of trade-offs in optimizing the SRAM array depending on the needs of the designer to minimize power or bitcell area. The most direct manifestation of the effectiveness of active tuning is the reduction of the equivalent (expected) minimum supply voltage Vmin . This is the voltage that is used, on average, across multiple partitions and is shown in Fig. 6. The key result which confirms our initial intuition is that at reasonable voltage complexity and switch complexity (number of pMOS switches), the tuning strategy can allow reduction of power by 23% and on average by 17% (Fig. 7) at the same area. Alternatively, at the same power level, the cell area can be reduced by up to 50% by using an adaptive scheme with a higher voltage and switch complexity (we point out that these results are comparable to the results reported in our previous work [12], where we reported that the mean power reduction was 21% iso-area). One important question we sought to answer is the dependence of improvements in area and expected power on the
1166
Fig. 8.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 33, NO. 8, AUGUST 2014
Expected power as a function of switch complexity.
Fig. 9. Expected power versus yield in the active mode: adaptivity effectively reduces dependence of power on required yield level.
number of voltage levels (v) available. Fig. 7 investigates this question. We find that when no spatial partitioning is available (switch complexity s = 1), there is little improvement with higher v. However, once spatial partitioning is available (s = 1024), area savings are larger for a higher number of voltage levels. Yet the difference between v = 4 and v = 2 is not dramatic, which indicates that even a small number of different voltage levels can be effective in tuning circuits. In Fig. 8, we find a roughly logarithmic reduction in power as we increase the switch complexity. The rate of savings is almost the same for different cell sizes but the actual power savings are typically higher for larger widths because of the larger scope for Vdd reduction. We also investigate the dependence of the results on the targeted yield level. In Fig. 9, we see that in an untunable array the power required to guarantee yield at a higher level, e.g., 0.99, is significantly higher compared to the yield target of 0.9. However, using the tunable architecture effectively eliminates the dependence of power on yield level, with nearly identical power sufficient to achieve yield levels of 0.9 and 0.99. The reason for this is that when there is only one partition, in order to meet the noise margins for a few worst-case bitcells,
Fig. 10.
Expected leakage power versus bitcell area in the sleep mode.
we have to increase the supply voltage and pay power penalty for the entire array. In a tunable array, the supply voltage can be raised for just a single partition, effectively reducing the dependence of power on target yield. Fig. 9 also explores the dependence of power and yield on the size of the SRAM array. It contains the results for the 1 Mb array, that is studied throughout the paper, and also for the 4 Mb array. To help the comparison, the expected power of the 4 Mb array is normalized to a 1 Mb case (it is effectively divided by 4). If adaptivity is not employed, we observe that the larger array requires a substantially higher voltage, and thus power, to achieve the same level of yield. The proposed technique effectively reduces power at higher yield values despite the increase in the array size: in a larger array, the mean effective voltage is only slightly (about 0.4%) higher compared to the smaller array. Finally, we use the developed optimization flow to demonstrate the potential of tuning techniques for leakage power reduction in the sleep mode. Fig. 10 presents the power savings achievable at voltage complexity v = 4 and switch complexity s = 1024. We find that at the yield level of 0.99, the tunable architecture allows a 71% reduction in power consumption. Furthermore, by comparing the results at two different yield levels, we again see that the tuning architecture effectively reduces the penalty in terms of power for achieving higher yield, while the conventional design requires paying a heavy power penalty. V. C ONCLUSION We developed a novel optimization flow for yield-aware cell sizing and voltage selection under variability given the availability of post-silicon voltage tuning. We develop the closed-form statistical models of array margin behavior and yield as a function of Vdd , cell size, and array size to drive the optimization and solve the problem as a two-stage stochastic optimization problem. We are able to identify designs that are more power-efficient than naive designs in which voltages are selected uniformly. The results also show significant promise for the tuning architecture, indicating that at high yield levels
SINGH et al.: MODELING AND OPTIMIZATION TECHNIQUES FOR YIELD-AWARE SRAM POST-SILICON TUNING
power savings can be substantial both in the active and standby modes. R EFERENCES [1] K. Kuhn et al., “Managing process variation in Intel’s 45nm CMOS technology,” Intel Technol. J., vol. 12, no. 2, pp. 93–109, Jun. 2008. [2] G. Chen, D. Sylvester, D. Blaauw, and T. Mudge, “Yield-driven nearthreshold SRAM design,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 11, pp. 1590–1598, Nov. 2010. [3] S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Modeling of failure probability and statistical design of SRAM array for yield enhancement in nanoscaled CMOS,” IEEE Trans. Computer-Aided Design Integr. Circuits Syst., vol. 24, no. 12, pp. 1859–1880, Dec. 2005. [4] A. Bhavnagarwala et al., “Fluctuation limits & scaling opportunities for CMOS SRAM cells,” in Proc. IEDM Tech. Dig., Washington, DC, USA, 2005, pp. 659–662. [5] V. Khandelwal and A. Srivastava, “Variability-driven formulation for simultaneous gate sizing and postsilicon tunability allocation,” IEEE Trans. Computer-Aided Design Integr. Circuits Syst., vol. 27, no. 4, pp. 610–620, Apr. 2008. [6] S. Borkar et al., “Parameter variation and impact on circuits and microarchitecture,” in Proc. Design Automation Conf., vol. 40. Jun. 2003, pp. 338–342. [7] S. Kumar, C. Kim, and S. Sapatnekar, “Mathematically assisted adaptive body bias (ABB) for temperature compensation in gigascale LSI systems,” in Proc. Asia South Pacific Conf. Design Automation, Yokohama, Japan, 2006, p. 6. [8] S. Kulkarni, D. Sylvester, and D. Blaauw, “Design-time optimization of post-silicon tuned circuits using adaptive body bias,” IEEE Trans. Computer-Aided Design Integr. Circuits Syst., vol. 27, no. 3, pp. 481–494, Mar. 2008. [9] S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Reduction of parametric failure in sub-100-nm SRAM array using body bias,” IEEE Trans. Computer-Aided Design Integr. Circuits Syst., vol. 27, no. 1, pp. 174–183, Jan. 2008. [10] N. Mojumder, S. Mukhopadhyay, J. Kim, C. Chuang, and K. Roy, “Self-repairing SRAM using on-chip detection and compensation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 1, pp. 75–84, Jan. 2010. [11] B. Mohammad, S. Bijansky, A. Aziz, and J. Abraham, “Adaptive SRAM memory for low power and high yield,” in Proc. IEEE ICCD, Lake Tahoe, CA, USA, Oct. 2008, pp. 176–181. [12] A. K. Singh, K. He, C. Caramanis, and M. Orshansky, “Mitigation of intra-array SRAM variability using adaptive voltage architecture,” in Proc. IEEE/ACM ICCAD, San Jose, CA, USA, Nov. 2009, pp. 637–644. [13] J. Wang, A. Singhee, R. Rutenbar, and B. Calhoun, “Statistical modeling for the minimum standby supply voltage of a full SRAM array,” in Proc. ESSCIRC, Munich, Germany, Sep. 2007, pp. 400–403. [14] M. Sniedovich, Dynamic Programming: Foundations and Principles. Boca Raton, FL, USA: Chapman & Hall/CRC Press, 2010. [15] J. Wang, S. Nalam, and B. Calhoun, “Analyzing static and dynamic write margin for nanometer SRAMs,” in Proc. ACM/IEEE ISLPED, Bangalore, India, 2008, pp. 129–134. [16] R. Aitken, A. Singhee, and R. Rutenbar, “Extreme value theory: Application to memory statistics,” in Extreme Statistics in Nanoscale Memory Design. Boston, MA, USA: Springer, 2010, pp. 203–240. [17] A. Singhee, “Extreme statistics in memories,” in Extreme Statistics in Nanoscale Memory Design. Boston, MA, USA: Springer, 2010, pp. 9–15. [18] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications. New York, NY, USA: Springer-Verlag, 1998. [19] O. Hirabayashi et al., “A process-variation-tolerant dual-power-supply SRAM with 0.179 μm 2 cell in 40nm CMOS using level-programmable wordline driver,” in Proc. ISSCC, San Francisco, CA, USA, 2009, pp. 458–459. [20] M. Yamaoka, N. Maeda, Y. Shimazaki, and K. Osada, “65nm low-power high-density SRAM operable at 1.0V under 3σ systematic variation using separate Vth monitoring and body bias for NMOS and PMOS,” in Proc. IEEE ISSCC, San Francisco, CA, USA, Feb. 2008, pp. 384–385. [21] E. Seevinck, F. J. List, and J. Lohstroh, “Static-noise margin analysis of MOS SRAM cells,” IEEE J. Solid-State Circuits, vol. 22, no. 5, pp. 748–754, Oct. 1987. [22] W. Zhao and Y. Cao, “New generation of predictive technology model for sub-45 nm early design exploration,” IEEE Trans. Electron Devices, vol. 53, no. 11, pp. 2816–2823, Nov. 2006.
1167
Ashish K. Singh received the B.Tech. degree in computer science from the Indian Institute of Technology, India, the M.S. degree from the Royal Institute of Technology, Stockholm, Sweden, and the Ph.D. degree in electrical engineering from the University of Texas, Austin, Austin, TX, USA, in 2001, 2003, and 2007, respectively. He is a Senior Researcher at Terra Technology, Chicago, IL, USA. His current research interests include inventory optimization in the supply chain networks under demand uncertainty. Dr. Singh received the IEEE/ACM William J. McCalla Best Paper Award in International Conference on Computer-Aided Design in 2006.
Ku He (M’12) received the B.E. and M.E. degrees in electrical engineering from Tsinghua University, Beijing, China, in 2004 and 2007, respectively, and the Ph.D degree from the University of Texas at Austin, Austin, TX, USA, in 2012. Since 2012, he is a Mixed-Signal Design Engineer with Cirrus Logic, Inc., Austin, TX, USA, where his tasks include designing high-resolution and lowpower audio-band integrated circuit. His current research interests include low-power and robust circuit design.
Constantine Caramanis (M’06) received the Ph.D. degree in electrical engineering and computer science from the Massachusetts Institute of Technology, Cambridge, MA, USA, in 2006. Since then, he has been with the Faculty at the Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA. His current research interests include robust and adaptable optimization, machine learning and highdimensional statistics, with applications to largescale networks, and computer-aided design. Prof. Caramanis received the NSF CAREER Award in 2011.
Michael Orshansky (SM’12) received the Ph.D. degree in electrical engineering and computer science from the University of California (UC), Berkeley, Berkeley, CA, USA, in 2001. He is an Associate Professor of Electrical and Computer Engineering at the University of Texas (UT), Austin, Austin, TX, USA. Prior to joining UT Austin, he was a Research Scientist and Lecturer with the Department of Electrical Engineering and Computer Sciences, UC Berkeley. His current research interests include design optimization for robustness and manufacturability, statistical timing analysis, and design in fabrics with extreme defect densities. He has authored the book Design for Manufacturability and Statistical Design: A Constructive Approach with S. Nassif and D. Boning. Prof. Orshansky was the Recipient of the National Science Foundation CAREER Award for 2004 and ACM SIGDA Outstanding New Faculty Award in 2007. He received the 2004 IEEE T RANSACTIONS ON S EMICONDUCTOR M ANUFACTURING Best Paper Award, as well as the Best Paper Awards at the Design Automation Conference 2005, International Symposium on Quality Electronic Design 2006, and the International Conference on Computer-Aided Design 2006.