A global bus power optimization methodology for physical design of ...

A global bus power optimization methodology for physical design of memory dominated systems by coupling bus segmentation and activity driven block placement Hua Wang∗, Antonis Papanikolaou, Miguel Miranda, Francky Catthoor† IMEC, Kapeldreef 75, Leuven, Belgium ∗ Also PhD student at the Katholieke Universiteit Leuven, Belgium † Also Professor at the Katholieke Universiteit Leuven, Belgium Email: wanghua,papaniko,miranda,[email protected]

Abstract— This paper presents a methodology which can substantially reduce the bus power consumption in memory dominated systems. It systematically combines an activity driven placement of the memories and a bus segmentation approach for the interconnect to localize the wire switching activity and minimize the associated wire capacitive load of the memory bus. A factor of 2.8 in bus power reduction is achieved for a real life design while maintaining the same performance.

nicating blocks closer together[9, 13]. However, although highly active blocks can have short connections after a power efficient placement, the bus energy gain is always counteracted by the shared bus which forces every access to be spread to the entire bus. As neither of the above two methodologies alone can achieve optimal bus power results, a possible way out is to combine them together to further optimize bus power. In this paper, we propose a methodology to achieve this and we show that the overall savings are substantial compared to applying either of the approaches alone.

I. Introduction II. Bus power reduction methodology As feature size scales down, interconnect starts to consume relatively more power than logic[1]. This is especially significant among the on-chip inter-block wires like memory buses due to their relatively large capacitance and high activity. Much effort has been invested in reducing the power consumed by the memory subsystem during the application mapping phase [2, 3, 14]. Here, we will tackle this problem at the physical design level. So far in the literature, various methods addressing the bus power problem have been proposed [4, 5, 6, 8, 9]. Among these methods, bus segmentation can effectively improve the bus power consumption by segmenting the buses into smaller pieces and activate only what is necessary at each time[8, 10, 11]. Current research activities in this area mainly focus on the bus segmentation scheme[5, 8, 11] and segment control[12]. With this method, bus switching activities are mostly limited within a small number of bus segments and inter-block switching is minimized. However, bus segmentation itself cannot improve the power of the systems where highly communicating blocks are physically far apart on the chip since their communication still has to traverse a large distance (many segments) on the bus. Another way of optimizing bus power is using activity driven placement in the traditional system design process. This approach is based on adding another optimization factor in the cost function of the placement stage. This factor is the product of activation frequency times wire capacitance of each bus. Minimizing this function has the effect of localizing activity and bringing highly commu-

In order to achieve good results at the physical design stage, the application should be first properly optimized with respect to data transfer and storage related concerns given the dominance of the storage related blocks. The goal is to have most of the activity localized in the smaller memories of the memory organization during the mapping phase. The least accessed data is then mapped in the larger memories. This hierarchy in activity can be exploited later during the physical design phase. This precondition is essential to allow placing memory blocks in floorplan so as to minimize the distance to the source of the activity for the highly active blocks. In our case (see Section IV), the driver we consider has been first optimized by using the Data Transfer and Storage Exploration (DTSE) methodology [14] which is an approach for system level memory management of data dominated applications with focus on lowering power. In this way, the more active memories are also smaller in size, allowing them to be placed in floorplan “layers” closer to the functional units and therefore requiring shorter wire lengths. A. Problem formulation The dynamic power consumption of a bus is given by the following formula: P ower ∝ Cul ∗length∗Vdd ∗Vswing ∗ factivation ∗ bitwidth, where Cul is the capacitance per unit of length of the wire. Clearly, one way to reduce the bus power consumption is to minimize the product of length∗factivation ∗bitwidth (although voltage and capacitance per unit of length can also be exploited, our current

focus is at the physical design step). However, activation frequency and bitwidth of the buses are usually dictated by the application, the high-level optimizations that are applied to it and the bus organization itself. We focus on reducing the power by assigning short connection to buses that have a large activities times bitwidths product and vice versa[7]. This will lead to an overall reduction in bus power consumption. Although bus segmentation and activity driven block placement aim at lowering bus power by minimizing the aforementioned product, they fail to notice the importance of the interaction between bus organization and placement strategy and the system level activity related data transfer and storage exploration. Moreover, although important power reduction has being reported in [11] by applying power aware placement in a segmented bus approach, detailed exploration was not present. B. Activity driven memory placement methodology using bus segmentation In order to overcome these problems we propose an activity driven placement approach combined with bus segmentation. This combination can effectively reduce the physical distance of the highly communicating blocks in the segmented bus. On the other hand, adding this optimization step may also increase the distances between some blocks in the segmented bus. However, these blocks are usually communicating less regularly due to the DTSlike optimizations. Thus, this overhead will not damage the gain in bus power consumption. From these observations, we can conclude that the following steps should be applied to get minimized bus power consumption. Step 1: Optimize the range of access frequency using a DTS-like methodology. By mapping highly activated data blocks into small memories, a large difference in access frequencies between small memories and large ones can be achieved. This enables shortening the distance between datapath and highly accessed memories in the later steps, by increasing the freedom for the floorplan. Step 2: Place all the blocks according to a floorplan template. This template consists of several layers (like an onion). Fig.1 shows an instance of that floorplan, where the most accessed memories(M1-M4) are placed at the core of the structure, close enough to the source of activity (e.g., register files and functional units). Within each layer, all blocks having physical connections should be clustered together, e.g. those connected to the same bus. The principle underlying this step is already standard practice in modern chip design. But we advocate to apply it even more aggressively and consistently by exploiting all system level characterization that the designer can obtain (in our case the DTSE toolset[16]). Step 3: Order the blocks within the cluster according to their access frequency. The memories should be placed in descending order of access frequency, regarding their proximity to the datapath. Step 4: Apply bus segmentation considering the actual floorplanning decisions and template constraints. Without the third step, optimal results may not be obtained.

The final step is the conventional place&route step, which is required to further minimize the interconnect length of the segments, to perform the detailed placement and routing and to compact the final layout while taking care of the timing constraints. This last step should follow the relative placement of the blocks and switches decided up front. Still, the results of this methodology heavily depend on how the activity is divided among the memories of the system. If the highly accessed blocks have an access frequency much higher than that of the other memories, the power gain of our methodology can be potentially large. In practice this does happen because the bus length reduction of our methodology is “amplified” by the large difference in access frequency of the blocks. On the other hand, if the difference is not that large (small access frequency gradient), the power gain will be limited compared to bus segmentation alone. This means the larger the gradient difference the bigger the gain in power. Hence the need for the first step (DTS-like method). C. Bus segmentation overhead The major overhead of our methodology comes from switches and their controller. Switches are implemented using a few gates, the overhead they will impose will be small taking into account the increasing dominance of interconnect in power and delay as technology scales[1]. Also, the impact of the control wires for the switches (normally one per switch) is marginal compared to the buses which are usually composed of 16 or 32 wires. Moreover, the memory controller in nowadays shared bus implementations can also provide the activation controls of the switches with little additional complexity. Although it is clear that some overhead is introduced by the addition of switches on the bus, this overhead can be very limited. Still, further research is required to decide whether switches should be added everywhere a bus splits. Therefore the optimal segmentation structure may not be introducing switches everywhere. M6

M1

M2

M3

RF M5 Datapath

M4

1st layer of memories 2nd layer of memories

Fig. 1. Onion-like floorplan template, gathering most of the activity in the middle of the die

III. Experimental setup and results For the experiments, we have used a Digital Audio Broadcast (DAB) receiver as a driver application[15]. The application has been optimized using DTSE methodology and the access frequency of the SRAMs, register

files and ROMs is also determined at the same time. Following the assumptions in section III, we have used length ∗ factivation ∗ bitwidth as the bus power metric. The bus length is estimated based on the geometry of the blocks in the floorplan. The final layout using our approach is shown in Fig.2. The power consumption of the combination of the two bus schemes and placement approaches is shown in Fig.3. Obviously, the combination of bus segmetation and activity driven block placement achieves the lowest bus power. Compared to a shared bus scheme using conventional placement, this optimization step achieves about a factor of 3 in bus power reduction, while segmentation alone only achieves about a factor of 2. Moreover, in terms of area, this methodology does not have significant impact: area increases marginally (less than 5% compared to the minimum of the three other contributions). On the other hand, the most crucial requirement on delay is not worse. Actually it is even improved. This holds because small memories can respond faster while the larger ones are assigned more cycles to respond with the delay slack in timing created at the system level with the DTSE methodology.

Fig. 2. Layout of activity driven placement with segmented buses 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Shared bus + Conventional placement

Segmented bus + Conventional placement

Shared bus + Activity driven placement

Segmented bus + Activity driven placement

Fig. 3. Normalized bus power in different combinations of placement and bus scheme

IV. Conclusion In this paper we propose a new methodology which combines an activity driven placement strategy with a bus segmentation approach to reduce the overall power consumed on the bus. Experimental results have shown

that this methodology can substantially reduce the bus power consumption with little area overhead, while still ensuring the timing constraints are met. Following this approach, bus power consumption was reduced by a factor of 2.8 for the same overall performance using real-life application driver. References [1] International technology roadmap for semiconductors 2001 edition [2] P.Panda, F.Catthoor, N.Dutt, K.Danckaert, E.Brockmeyer, C.Kulkarni, A.Vandecappelle, P.G.Kjeldsberg, “Data and memory optimizations for embedded systems”, ACM Trans. on Design Automation for Embedded Systems, Vol.6, No.2, pp.142206, April 2001. [3] L.Benini, G.De Micheli, “System-level power optimization techniques and tools”, ACM Trans. on Design Automation for Embedded Systems , Vol.5, No.2, pp.115-192, April 2000. [4] Y. Aghaghiri, F. Fallah, and M. Pedram, “Irredundant address bus encoding for low power”, ACM/IEEE International Symposium on Low Power Electronics and Design, pages 182–187, August 2001. [5] Hui Zhang, Jan Rabaey, “Low-swing interconnect interface circuits”, Low Power Electronics and Design, 1998. Proceedings. 1998 International Symposium on , 10-12 Aug 1998 Page(s): 161 -166 [6] Benjamin Bishop, Victor Lyuboslavsky, N. Vijaykrishnan, and Mary Jane Irwin, “ Design considerations for databus charge recovery”, IEEE Trans on Very Large Scale Integration (VLSI) Systems, Vol. 9, No. 1, FEB 2001 [7] A. Papanikolaou, M. Miranda, F. Catthoor, H. Corporaal, H. De Man, D. De Roest, M. Stucchi, Karen Maex, “ Global interconnect trade-off for technology over memory modules to application level: case study”, Proceedings of ACM/IEEE wsh. on System Level Interconnect Prediction 2003, Monterey, California [8] Chen, J.Y.; Jone, W.B.; Wang, J.S.; Lu, H.-I.; Chen, T.F.; “ Segmented bus design for low-power systems”, IEEE Transactions on VLSI Systems, Volume: 7 Issue: 1 , Mar 1999 Page(s): 25 -29 [9] Holt, G.; Tyagi, A.; “GEEP: a low power genetic algorithm layout system”, IEEE 39th Midwest symposium on Circuits and Systems, 1996, Vol. 3, pp. 1337 -1340 [10] Yan Zhang, Wu Ye, Mary Jane Irwin, “An alternative architecture for on-chip global interconnect: segmented bus power modeling”, Conference Record of the Thirty-Second Asilomar Conference on Signals, Systems& Computers, 1998, Vol. 2, pp. 1062 -1065 [11] W.-B Jone, J.S. Wang, Hsuen-I Lu, I.P. Hsu, J.-Y. Chen , “Design theory and implementation for low-power segmented bus systems”, ACM Transactions on Design Automation of Electronic Systems, Vol8, No.1, Jan 2003, pp. 38-54 [12] Seceleanu, T.; Plosila, J.; Liljeberg, P.; “On-chip segmented bus: a self-timed approach”, 15th Annual IEEE International ASIC/SOC Conference, 2002, pp. 216 -220 [13] M.Jimnez and M.Shanblatt, “Integrating a low-power objective into the placement of macro block-based Layouts”, Proceedings of the 44th IEEE 2001 Midwest Symposium on Circuits and Systems, 2001, Vol. 1, pp. 62-65 [14] F.Catthoor, K.Danckaert, C.Kulkarni, E.Brockmeyer, P.G.Kjeldsberg, T.Van Achteren, T.Omnes, “Data access and storage management for embedded programmable processors”, ISBN 0-7923-7689-7, Kluwer Acad. Publ., Boston, 2002. [15] Radio broadcasting systems; digital audio broadcasting to mobile, portable and fixed receivers. Standard RE/JTC-00DAB-4, ETSI, ETS 300 401, May 1997. [16] http://www.imec.be/design/atomium/