High-Fidelity Markovian Power Model for Protocols

High-Fidelity Markovian Power Model For Protocols Jing Cao

Albert Nymeyer

School of Computer Science and Engineering The University of New South Wales, Australia Email: [email protected]

School of Computer Science and Engineering The University of New South Wales, Australia Email: [email protected]

Abstract—We formally define a high-level power-aware protocol model based on a Markov chain, and consider 2 aspects of power consumption: the general switching activity, and the cost of data transfers. A state-assignment algorithm is devised that results in a state encoding that is near the theoretical lower bound (in the protocols we have studied). We have analyzed a set of protocol ‘converters’ that have been synthesised for the AMBA protocol, and compared the (high-level) predicted power consumption with the power actually used during (low-level) simulation of these converters. We observe high fidelity.

I. I NTRODUCTION Low-power and low-level designs have been studied for years and many techniques have been developed to reduce power consumption. Not well studied however is the modelling of power at the high-level design level, before the chip is synthesised. Larger power savings can be made by correct design decisions made at the high, pre-synthesis levels of abstraction. At low levels, generally only 30% power savings can be made [1]. Once the chip is synthesised, it is generally too late to change high-level decisions. Can a high-level specification of a chip provide enough information for an accurate estimate of power consumption to be made? In absolute terms this is not possible because so much gate-level detail is missing from a high-level specification, but a high-level model of power estimation can have high fidelity, which means that a low-power specification will result in a low-power implementation, and a high-power specification, a high-power implementation. The chip designer therefore can be confident that ‘early’ decisions that reduce power will actually result in a more efficient implementation. The aim of this research is to investigate high-level power modelling: firstly, how it can be done, and secondly, how good a predictor of actual power consumption it is. To find the total power consumption, the average power consumed by the n logic gates in the circuit during a clock cycle is summed [2], n Ci Di where V is the supply voltage, f is [3]: q ∝ V 2 f i=1

the clock frequency, Ci is the capacitive load of gate i and Di is the switching activity at gate i. Using this low-level model, the average power consumption is proportional to the average Hamming distance at high level. Related work: Benini and Micheli [4]’s survey of energyefficient circuit design techniques consider computation, communication and storage units as the main consumers of power. Our model of power can loosely be mapped to this systemlevel view. In [3], these same authors present a column-

978-3-9810801-6-2/DATE10 © 2010 EDAA

based state-assignment algorithm ‘pow3’ that is based on a Markov model. They use this algorithm to do state assignments that minimise the switching activity and a notion of area. Marculescu et. al. [5] also uses a Markov model to study power. Like [3], they do not formally define their Markov model. Their focus instead is to formulate theoretical bounds on the total Hamming distance. We have used this in our state assignment algorithm. Over the last decade many stateassignment methods have been proposed. E.g. in [6], algorithms called ‘fast’ and ‘greedy’ that use a method based on spanning trees are presented. In our experiments, we compare our results to those from ‘pow3’ and the ‘greedy’ technique. Liveris and Banerjee [7] develop an interface synthesis method for AMBA protocols and estimate power consumption at a bus-level, which is at a different level to us. II. A M ARKOV MODEL OF PROTOCOLS The definition of a protocol that we use extends the definition given in [8] by adding probabilities to transitions. These probabilities are assumed to exist and are input from the chip designers. We also change the formalism from finite-state machine to discrete-time Markov chains. Definition 1 –Protocol– A protocol is represented by a discrete-time Markov chain S, Σ, δ, s0 , where • S is a set of states. • Σ is a finite set of control actions Σc and data actions Σd1 ∪ Σd2 , where each action is either a send or a receive action. • δ ⊆ S × 2Σ × [0, 1] → S is a function that labels transitions between states. Transitions are labelled by actions A and probabilities P. The sum of the probabilities of all the outgoing transitions of a state is equal to 1. • s0 is the initial and final state, which can reach any state in S and be reached by any state in S. The actions in the alphabet Σ define the interface of the protocol. Each action may be considered a port. A control action is a single-bit Boolean. A receive action a ∈ Σc is graphically written a0 ? or a1 ?, which means a 0 or 1 is received at port a. A send control action is written a0 ! or a1 !, and means port a is reset or set (resp.) by the protocol. Data actions in Σd1 and Σd2 have a data width d1 and d2 resp., and represent binary data. Data that is received by the protocol is written d?, and data that is sent is written d!.1 The 1 The reason for two different data widths is the setting of this research: the protocol is synthesised and acts as a converter between different protocols that may have different data widths.

a? 0

b! 0

τ

⎡ 1.0 S 4

1.0

0.6

0

a ?e0?b1! c !

S 1 0.31

0

0.8

S4

S3

0.2 b1 d2! e ?1 g? a ?1 1.0 1 S7 f1?d1?d2! S8 d1?d2! 1.0 S 1.0 6

1.0 d2! re1!

0.4 S1

0.6

1.0

S3 f1?d1?d2!

! .1 !c1 0

⎢ ⎢ ⎢ ⎢ Ps = ⎢ ⎢ ⎢ ⎢ ⎣

g?

τ

a0?b0!c0!

S 2 1.0

re1!

S5

a 1?b1!c ! 1

1.0

0.6 0 0 0 0 0 0 0

4 45

0 0

0.3 0 0 0 0 0 0 0

0 0 0.8 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0.2 0 0 0 0 0

0.1 0 0 0 0 0 0 0

0 0 0 0 0 1 1 0

0 0

0 0 0 0 0 0 0

4 45

0 0 0 0 0

0 0 0

4 45

1 27

0 0

0 0 0 0 0 0 0

1 45

0 0 0 0 0

0 0 0 0

0 0 0 0 0

1 45 1 27

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

0

III. P OWER ESTIMATION

set of actions that label a transition from state si to state sj is denoted Ai,j . If there is no action present then the transition is labelled by the empty action τ . A protocol satisfies the Markov property that states that the future evolution of the protocol only depends on the current state, and is independent of past states [9]. The 1-step probability of moving from state si at time m ∈ N to state sj at time m + 1 is denoted by Pi,j . The transition probability P represents the 1-step probabilities of all the transitions. For example the transition probability P of P1 in Fig. 1 is: ⎡ ⎤ 0 1 0 0 1 0 0 1

0 0 0 0 0 0 0

Protocol P1 (left) and P2 (right).

Fig. 1.

⎢ ⎢ ⎢ ⎢ P=⎢ ⎢ ⎢ ⎣

2 9

0 0

8 135

S2

1 9

2 9

0

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

We need first to compute the weights W and Hamming distance H of each transition. To compute the power, we then swit data Hi,j and q data = si ,sj Wi,j Hi,j . use q swit = si ,sj Wi,j Switching activity: We compute the weights as: W swit = P s

(2)

Data Transfer Cost: In general, data transfers consume more power than control signals. The set of data actions in a protocol are given by Σd1 ∪ Σd2 , and the set of actions that label a transition from state si to sj are given by Ai,j . The number of bits Di,j that are transferred in a transition from state si to sj can therefore be expressed by Di,j = d1 × | Ai,j ∩ Σd1 | + d2 × | Ai,j ∩ Σd2 | Given Di,j , we can define the weight function for data transfer: data s Wi,j = Pi,j Di,j (3)

We can go further and denote the k-step probability of moving from state si at time m to state sj at time m + k by k k k . In general, the elements of P are the probabilities Pi,j . Pi,j 2 We can write for example Pi,j = sl Pi,l Pl,j , which defines the elements of P 2 . For example, in Fig. 1, we observe in P1 3 that P1,5 = si ,sj P1,i Pi,j Pj,5 = P1,3 P3,4 P4,5 = 0.24. We denote the probability that we are in a state si at time we are in state sj (or step) 0 by πi0 , and the probability that k after k steps by πjk . We can write πjk = si πi0 Pi,j . If we let k approach infinity, then we compute theso-called steady0 k state probability L(sj ) = lim πjk = lim si πi Pi,j . If we k→∞ k→∞ include all states, then we express this asymptotic equation using the transition probability, as follows:

Given a steady-transition probability P s , the amount of data that to be transfered per time unit is defined as is expected s D P . For example, assuming the data width d1 = 8 i,j i,j si ,sj and d2 = 4 in Fig. 1, the weights corresponding to data of P1 are: ⎡ 0 0 0 0 0 0 0 0 ⎤

L(S) = lim π 0 P k = lim π 0 P k+1 = L(S)P

The power of a circuit will depend on the state encoding. In this section, we will present a state-assignment algorithm that is low power for each of q swit and q data . The aim is to determine state k-bit encodings E that minimise the Hamming distances between adjacent states. We use the theoretical lower bound to globally determine if the encoding is near-optimal. Formally, the Hamming distance Hi,j of each transition is computed by k Hi,j = l=1 |Eil − Ejl |, where Eil is the lth -bit code of state si and similarly for Ejl . We use the undirected weight W i,j in this algorithm, which is computed by W i,j = Wi,j + Wj,i between states si and sj , where i = j. Algorithm 1 presents a simplified state assignment algorithm that considers encodings in a global sense, in contrast to existing methods that make local selections. This algorithm shows just one loop of an algorithm that selects the next minimum codes and assigns them to the corresponding states.

k→∞

k→∞

(1)

It can be proved that the steady-state probability L(S) is unique for a given protocol. This is important because we use these probabilities to measure power. Based on L(S), s = we compute the so-called steady-transition probability Pi,j L(si )Pi,j , for transition from state si to sj . The steady-state probability of P1 in Fig. 1 is computed by the following linear equations: L(s1 ) = L(s5 ) + L(s8 ) + L(s2 ) L(s2 ) = 0.6L(s1 ) L(s4 ) = 0.8L(s3 ) L(s3 ) = 0.3L(s1 ) L(s6 ) = 0.2L(s3 ) L(s5 ) = L(s4 ) ) = 0.1L(s1 ) L(s8 ) = L(s7 ) + L(s6 ) L(s7 L(s ) = 1, which has solution L(S) = and i i=1..8 2 1 4 4 1 1 8 , , , , , , , ]. The steady-transition probability [ 10 27 9 9 45 45 45 27 135 is

W data

⎢ ⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎣

0 0 0 0 0 0

32 135

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0

12 45 4 9

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

0

IV. S TATE ENCODING FOR LOW POWER

TABLE I S TATE ENCODINGS FOR PROTOCOL P1

After each loop the power is computed and compared with the theoretical lower bound by using an algorithm in [5] and terminates if close enough.

states s1 s2 s3 s4 s5 s6 s7 s8

Algorithm 1 State assignment Input: Constant S: a set of states W : S × S → N0 is a set of undirected weights N : S → Pow(S) is a set of neighbour states of each state k: length of the encoding Output: E(S) // state encoding 1: // H() return with the Hamming distance between two codes 2: // unallocated() return a 1 if argument code unallocated, otherwise 0 3: // nextbyweightconn() return the pair of states with next biggest weight in W and largest number of neighbour states for equal weights 4: // select1() select the minimum pair of codes from the set of pairs of codes 5: // select2() select the minimum code from the set of codes 6: E(si ) = −1 ∀si ∈ S 7: unallocated(m) = 1, ∀m ∈ [0, 2k − 1] 8: for each (si , sj ) = nextbyweightconn() do 9: if (E(si ) = −1) ∧ (E(sj ) = −1) then 10: for x=1..k do 11: C(x) = {(m, n)|m, n ∈ [0, 2k − 1] ∧ (Hm,n = x) ∧ unallocated(m) ∧ unallocated(n)} 12: if C(x) = ∅ then 13: h = x; break 14: end if 15: end for 16: (E(si ), E(sj ))=select1(C(h)) 17: unallocated(E(si )) = 0, unallocated(E(sj )) = 0 18: else if (E(si ) = −1) ∨ (E(sj ) = −1) then 19: sa = (E(si ) == −1)?si : sj 20: sb = (E(si ) = −1)?si : sj 21: for x=1..k do 22: C (x, m) = {n|n ∈ [0, 2k − 1] ∧ (Hm,n = x) ∧ unallocated(n)} 23: if C (x, E(sb )) = ∅ then 24: h = x; break 25: end if 26: end for 27: E(sa )=select2(C (h, E(sb ))); 28: unallocated(E(sa )) = 0 29: end if 30: end for

Individual state assignments: We apply Algorithm 1 to determine the encodings E swit using W swit which is calculated by Equation 2. We note k = 3. The , W1,3 , undirected weights W swit are ordered as W1,2 W1,5 , W3,4 , W4,5 , W1,8 , W1,7 , W7,8 , W6,8 , W3,6 . Thus, state s1 is first selected and assigned. We compute C(1) = {(0, 1), (0, 2), (0, 4), (1, 3), (1, 5), (2, 3), (2, 6), (3, 5), (4, 5), (4, 6), (5, 7), (6, 7)}. The pair (0, 1) is selected at line 16 in Algorithm 1 to assign E(s1 ) and E(s2 ). The next weight is then selected, and we compute C(1, 0) = {1, 2, 4}. The encoding E(s3 ) will be assigned 2 at line 27. Similarly for other states. The full encoding E swit for the switching activity is shown in the first column of Table I. Encoding E data has also been computed and is also shown in the table. Individual power computations: The power q swit that uses the E swit is denoted q swit |E swit and computed encoding swit swit data |E data . The results using si ,sj Wi,j Hi,j . Similarly for q swit data are q |E swit = 1.15 and q |E data = 0.95. We also compute the ‘cross encodings’ q swit |E data = 1.84 and q data |E swit = 1.63. Total power: The encodings E swit and E data are generally different. As only one state encoding is possible, we need to determine which of these encodings will result in the minimum total power q total = q swit + q data . To compute E total , we need to recompute the weights. We assume that

E swit 000 001 010 110 100 111 101 011

E data 100 111 011 101 110 010 001 000

E total 010 011 110 111 101 100 001 000

pow3 011 111 110 100 101 010 001 000

greedy 000 001 010 011 110 111 101 100

q swit |E total = q swit |E swit + Δswit where Δswit and Δdata corq data |E total = q data |E data + Δdata respond to the increased component power cost through the use of a possibly non-optimal (for that component) encoding. We minimise the total change Δ = Δswit + Δdata , which is equivalent to minimising q total = (W swit +W data )H, and define the total weight as: W total = W swit + W data (4)

The total weight for P1 is ⎡ 1 2

W total

⎢ ⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎢ ⎣

0 2 9

0 0

4 45

0 0

8 27

9

0 0 0 0 0 0 0

9

0 0 0 0 0 0 0

0 0

4 45

0 0 0 0 0

0 0 0

4 45

0 0 0 0

0 0

1 45

0 0 0 0 0

1 27

0 0 0 0 0 0 0

0 0 0 0 0

13 45 13 27

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

0

Applying Algorithm 1, the resulting encoding E total is shown in Table I, and we compute q swit |E total = 1.21 and q data |E total = 0.95. We have compared our state-assignment algorithm with the ‘pow3’ algorithm [3], and the ‘greedy’ algorithm [6] by applying these algorithms to P1 using W swit and W data resp.. The resulting encodings can be seen in the last two columns of Table I. We compute the swithing activity power cost using encoding ‘pow3’ is 1.26, and the data transfer power cost using encoding ‘greedy’ is 1.18, which are 8.7% and 19.5% higher than our q swit |E swit and q data |E data resp.. V. E XPERIMENTS Avnit and Sowmya [10], [11] develop a different but also formally-based methodology to synthesise protocol converters. They have built a tool that generates a set of possible converters, given two protocols, and they use the tool to generate converters for the well-known industrial protocols AMBA ASB and APB [12]. Their tool in fact synthesises 92 combinational choices for converters, which constitute the “design space” that the engineer needs to explore. Due to time constraints, we have predicted the power consumption of just 17 of these choices with the aim of testing the fidelity of our power model, and of course identifying the one that consumes the lowest power. Because of space constraints we show the results for just 10 of them in Table II. In column 1 we list the protocols, column 2 the number of states (ns), columns 3–5 the individual power predictions and the predicted total power, and in the final column the value of the dynamic power computed by the simulator.

TABLE II P OWER PREDICTION AND SIMULATION RESULTS FOR AMBA PROTOCOLS ID C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 P1 P2

ns 10 10 9 9 8 9 9 9 8 8 8 4

q swit

q data

q total

q simu

1.08 1.08 1.21 1.00 1.20 1.08 1.16 1.14 1.11 1.18 1.21 1.23

6.32 8.12 5.44 5.00 5.60 5.76 2.72 4.38 7.22 6.54 0.95 3.69

7.40 9.20 6.65 6.00 6.80 6.84 3.88 5.52 8.33 7.72 2.16 4.92

35 39 33 27 29 30 21 23 42 35 8 17

qsimu 40

30

20

. .. . .. . . . 3

qsimu

.. 40 . . . .. .. 30 .. . . .. 20 . 1

Fig. 2.

1.4 qswit

VI. C ONCLUSION

.. . .. . .

total

10 q

6

qsimu 40

30

20

. .. 2

a converter. We see that there is good correlation between the predicted results and the simulation. The solid line indicates perfectly correlated data.

.. .. .. . .. . .. ..

We have used a high-level Markovian formalism to predict the power consumption of a protocol in relative terms. This approach enables the design engineer to make early design decisions that can make large differences to the power efficiency, and importantly, make decisions that can be reversed as no synthesis has taken place. The results of our power analysis and actual simulation for a case study based on the AMBA protocol are surprisingly close given that the analysis is still ‘crude’. Other factors such as buffers and the area (which measures the cost of sending control signals) also needs to be included in the analysis, but that is future work. We also need to investigate what proportion of the total power is contributed by switching, data transfers, area and buffers. More case studies need to be investigated and other simulation tools need to be used. The background of this work is protocol converter synthesis, and we plan to integrate this approach into our protocol synthesis framework [14] to enable the lowestpower converter to be synthesised automatically. Acknowledgment: we are grateful to Karin Avnit and Jorgen Peddersen from the University of New South Wales, and Zong Wang from the University of Edinburgh for their assistance with the implementation. R EFERENCES

6

data 10 q

The estimated power versus the simulator’s dynamic power

The simulations were carried out using an FPGA design R (version 10.1) [13]. Note that the toggle tool Xilinx ISE rate of each input signal in the testbench is determined by the steady-transition probabilities of the converter. For the sake of comparison we choose the same FPGA device ‘XC3S400’ in ‘Spartan3’ for each converter, and an identical configuration for the other settings, e.g. clock frequency f = 25MHz. After simulation, a vcd file is automatically generated, which is then analyzed by Xilinx’s ‘XPower Analyzer’. The output of the analyzer includes the quiescent (static) power and dynamic power. In this work we consider only the dynamic power as the quiescent power estimates vary little. In the table, we see that C7 is predicted to consume the lowest power. It also has the lowest dynamic power in the simulation. In contrast, C2 and C9 (in that order) are predicted to consume the most power. They also consume the most power in the simulation (but in reverse order). The ratio between the lowest and highest power consummers in the simulation is 50%, and in the prediction 42%. The last two rows in the table show the power consumption of the protocols P1 and P2 , shown in Fig. 1. We show graphs that plot each of our power predictions q total , q swit and q data versus the simulator’s dynamic power q simu in Fig. 2. Each of the 17 data points in the graphs represents a combinational choice for

[1] R. Goering, “System-level synthesis scheme homes in on low-power IC design,” 2007, http://www.eetimes.com/news/design. [2] M. Nemani and F. N. Najm, “High-level area and power estimation for VLSI circuits,” IEEE Trans. Computer-Aided Design, vol. 18, pp. 697–713, 1999. [3] L. Benini and G. D. Micheli, “State assignment for low power dissipation,” IEEE Journal of Solid State Circuits, vol. 30, pp. 258–268, 1995. [4] L. Benini and G. d. Micheli, “System-level power optimization: techniques and tools,” ACM Trans. Des. Autom. Electron. Syst., vol. 5, no. 2, pp. 115–192, 2000. [5] D. Marculescu, R. Marculescu, and M. Pedram, “Theoretical bounds for switching activity analysis in finite-state machines,” in Int’l Symp. on Low Power Electronics and Design, 1998, pp. 36–41. [6] W. N¨oth and R. Kolla, “Spanning tree based state encoding for low power dissipation,” in DATE ’99: Proc. of the conf. on Design, automation and test in Europe. ACM, 1999, pp. 168–174. [7] N. D. Liveris and P. Banerjee, “Power aware interface synthesis for busbased SoC designs,” in Conf. on Design, Automation and Test in Europe (DATE’04), vol. 2. IEEE Computer Society, 2004, pp. 864–869. [8] J. Cao and A. Nymeyer, “Formal model of a protocol converter,” in 15th Computing: The Australasian Theory Symposium (CATS’09), ser. CRPIT, vol. 94, 2009, pp. 107–117. [9] H. Hermanns, Interactive Markov Chains. Springer Berlin, 2002. [10] K. Avnit, V. D’Silva, A. Sowmya, S. Ramesh, and S. Parameswaran, “A formal approach to the protocol converter problem,” in DATE’08: Proc. of the Conf. on Design, Automation and Test in Europe. ACM, 2008, pp. 294–299. [11] K. Avnit and A. Sowmya, “A formal approach to design space exploration of protocol converters,” in Conf. on Design, Automation and Test in Europe (DATE’09). ACM, 2009, pp. 129–134. [12] ARM, “AMBA specification,” 2002, http://www.arm.com. [13] Xilinx, “Xilinx ISE tutorial,” 2008, http://www.xilinx.com. [14] J. Cao and A. Nymeyer, “Formally synthesising a protocol converter: a case study,” in Conf. on Implementation and Application of Automata (CIAA’09), ser. LNCS, 2009, pp. 249–252.