Energy Consumption for Transport of Control ...

Energy Consumption for Transport of Control Information on a Segmented Software-Controlled Communication Architecture Kris Heyrman1 , Antonis Papanikolaou2, Francky Catthoor3 , Peter Veelaert4 , Koen Debosschere5 , and Wilfried Philips6 1

3

Hogeschool Gent, Schoonmeersstraat 52, B-9000 Gent, Belgium, and IMEC, Kapeldreef 75, B-3001 Leuven, Belgium. [email protected] 2 IMEC. [email protected] IMEC, and Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium. [email protected] 4 Hogeschool Gent. [email protected] 5 ELIS, Ghent University, St-Pietersnieuwstraat 41, B-9000 Gent, Belgium. [email protected] 6 TELIN, Ghent University. [email protected]

Abstract. The segmented bus is a power-efficient architecture for intratile SoC communication, where energy is saved by switching off unused bus segments cycle-by-cycle. We determine the pattern of switch control bits and calculate the cost of transporting them. A test case indicates that the cost is much lower than the gain obtained from the segmentation, and that the prospects of segmented buses remain promising.

1

Trends in Communication Architecture: Software Control, Heterogeneity and Parallelism

In embedded systems, minimal energy dissipation and fast performance are challenging targets, while flexibility is necessary for quick time-to-market. Software control over the architecture is a definite trend in processor and memory architecture design, addressing the need to reduce energy consumption. It makes sense that the communication network is also controlled by software. The drawback is the extra functionality required from the compiler and possibly increased design complexity. Once the necessary design tools are developed these drawbacks will be overcome. A second trend in embedded system design is a move towards heterogeneous architectures. Heterogeneity is a step towards power efficiency, since it allows circuit activity to be confined to portions of an hierarchy, while non-utilized portions of the hierarchy do not consume power. Also, to optimize energy consumption, memory organizations are becoming multi-layered, with each layer consisting of memories of different size and type, like in the TI C5510 [13] or those designed according to the DTSE methodology described in [2]. Finally, hardware parallelism enables designers to either decrease an application’s execution time or trade off execution time for lower energy consumption, e.g. by decreasing the supply voltage. In both cases, the amount of data to be

2

transferred per cycle is high, since a number of parallel resources need to be kept busy in order to achieve the expected volume. This implies that the communication network should provide large enough bandwidth. This should be achieved by parallelism in communication, not by overclocking the network.

2

The Segmented Bus Architecture Mem

Mem

Mem

S

S

S

S S

Mem

FU

Mem

Fig. 1. The architecture of the segmented buses communication network includes parallel buses, segmentation switches, control wires and a communication controller.

In this paper we advocate the use of heavily segmented buses, which make a very energy-efficient software-controlled architecture. Because of the scaling down of technology, power lost to drive the capacitance of long lines will ultimately outweigh the power lost in active circuits. Isolating those segments of a bus that are not in use, cycle-by-cycle, can deliver important savings in the power budget. This architecture is shown in Figure 1. To see the type of environment where segmented buses become useful, Figure 2 shows a 14-bus system in the context of a VLIW processor cluster. The energy gains achieved by segmenting long bus interconnect wires have been reported in [3, 11], but the energy needed for steering the buses was neglected. decoder

fu

fu

memory & other clusters

fu

fu

memory & other clusters

register file

Fig. 2. VLIW processor cluster with a multiple segmented bus

The control bits bits required are 4 per bus and per individual segmentation switch. The control bit format has been chosen for energy saving (frequent

3

changes require only one bit to change) but could still be improved upon if the energy to transport the bits would turn out to be excessive. We will call Eseg the total energy consumed on a segmented bus system. Eunseg is the energy consumed on an equivalent bus which has no segmentation switches. SG, the segmentation gain, consists of the energy consumed by all segments that do not need to be driven, integrated over the duration of the application. At first sight, it looks like SG = Eunseg −Eseg There is, however also a segmentation loss SL, which has two components: SL = SLtransport + SLgen SL consists of the cost incurred to transport control bits from the source of control to the switches, since this would not have to be done if the bus was nonsegmented, and the cost to generate the bits from memory addresses or from the instruction flow. The cost of decoding the control bits is considered here as negligible, as it is certainly in the limit when technology scales down: we only need a small number of minimum-size gates, and in the limit their consumption is much smaller than Eseg . SLtransport does not depend on the method of generating the bits. The controller can be implemented either as an address-based path decoder using a memory-mapped look-up table or as a instruction-driven component which is part of the program flow generated by the compiler: the required control bits for the switches are inserted in the application code as separate network configuration instructions. In either case, scheduling is done at compile time and can be seen to be under software control. With the first solution (path decoder), the routing need not be fully determined at compile-time, although the scheduling must have been done. The solution is compatible with common register indirect addressing modes, where the addresses at runtime are not necessarily known to the compiler at compile time. The second solution (fully compiled control) requires a cost that the first does not have: not only the schedule but also the actual path information must be fetched from instruction memory. It offers more possibilities for power efficiency: hierarchical activity clustering [7] is possible and distributed loop buffers can be employed that bring instruction decoding costs down in data-intensive applications of the sort that we contemplate on SoCs. We will not consider SLgen further in this paper. It is in fact the subject of further work. We suspect that there is a scalability issue at work here. For simple networks an address-based path decoder is probably optimal but it may well be that for complex networks instruction-driven switching is better. Suffice it to say that at least for small networks SLgen scales well with technology, and that for scalability with network complexity, SLtransport does not depend on it.

3

Method: Segmented Bus Analysis

When developing a method for segmented bus analysis, we perform tasks that are part of design-time analysis or need to be done compile-time by the control-bit emitting compiler. Since we have as yet no such compiler, we designed a program that undertakes some of the extra tasks on a profiled run of the application. We use an optimization toolset (Atomium/MA [1]) to allocate and assign the memories for the application, i.e. to design the memory hierarchy for a given

4

application (or set of applications). The functional units are assigned from the C code, in our experiments. Ultimately, this is of course also a task for the compiler. Storage bandwidth optimization is again done by the optimization toolset. The methodology presented in [14] is used to define the number of parallel communication resources needed to satisfy the application bandwidth requirements. Based on high-level application mapping, the peak bandwidth is extracted and a sufficient number of parallel buses is allocated. The bus connections are defined based on the synthesis of the memory organization. The layouts are made according to the practice of activity-aware floorplanning [7]. All segment lengths, including control bit segments, are extracted from a commercial routing tool [10]. From the data obtained from the optimization tool, which includes profiling together with floorplanning information, we recover the execution schedule for each basic block, recover the memory assignment, reconstruct the access tree, and collapse it into a per-cycle node tree of concurrent accesses. Then, for each path through the bus geometry, we decide what the switch positions must be. Walking in time sequence through the tree of cycles that the application consists of, we resolve each transfer to a set of control bits to be emitted, and from the dynamic behavior of the control bits calculate the energy required to transport the information, taking into account the physical lengths involved. In essence, then, our power figures are derived from the characteristic capacitances and resistances of the technology node employed (130nm CMOS in our case), the activity of the segments as derived from the access schedule, and the wire lengths from the floorplan. Achieving the figures is not automatic, but comes from the fact that a physical design methodology optimized for low power is followed throughout.

4

Application: Digital Audio Broadcast (DAB) Receiver

Bus Energy

In order to estimate control power, we have made 4 different floorplans for a DAB receiver. Each represents a different trade-off between power efficiency and circuit area, resulting in a different on-chip memory count and complexity of the bus structure. The DAB receiver has three functional units: a FFT subsystem, a Viterbi decoder, and a deinterleaver. Bus Energy, Segm. vs. Non-Segm. Bus After data storage and bandwidth exploration (DTSE) analysis [2] of the probNon-segmented Segmented 50.0 mJ lem, 4 different sets of optimizations are 40.0 mJ chosen to set 4 alternative tasks for the 30.0 mJ design process. The solutions all feature 3 parallel buses and an unequal number 20.0 mJ of memories: 4, 8, 10 and 12. They all 10.0 mJ have an 8-bit bus, which in some solu4 8 10 12 tions is extended to include some 16-bit Number of Memories segments, and two 32-bit buses. Fig. 3. Comparison of Eseg vs. 5 Observations Eunseg for 4 design choices When analyzing typical activity patterns from this design, observing local switch activity would encourage us to seek for clusters of switches, that can be efficiently

5

Energy

driven from loop buffers. We find that if some two switches change data direction, all switches in-between also change data direction and show activity. This would in general be bad for the locality of switching activity. The effect is counteracted by the fact that because of power-aware floorplanning, active connections will be short and the number of switches in between them will be few. Using the switch control patterns and the segment and control bit line lengths, we can calculate the energies. In Figure 3, Eseg is compared with Eunseg , for all 4 design choices. The comparison confirms the advantage of a segmented bus from the standpoint of power efficiency. We see that Eseg and Eunseg both reach a minimum which is not radically different between the 4 solutions. This would indicate that segmentation does not impose different optimization targets for physical layout than a non-segmented solution. Switch Cntr. Energy vs. Segm. Bus Energy In Figure 4, we compare SLtransport 10.0 mJ Switch Control Energy with Eseg . It is of a lower order of magSegmented Bus Energy 8.0 mJ nitude. Intuitively, this can be attributed to a good choice of the switch codes, re6.0 mJ ducing the number of active control wires. 4.0 mJ Moreover, there are many more active data 2.0 mJ and address lines than control bit lines. Also there is only limited activity on some 0.0 mJ 4 8 10 12 buses in some branches of the program, Number of Memories thanks to activity-aware placement. We Fig. 4. Comparison of SL transport can observe from the test-case of the DAB vs. E seg for 4 design choices receiver that: – Eseg is 17-21% of Eunseg . – SLtransport is much smaller than SG. It is in the range of 1.5-6% of Eseg . So a further reduction of the SLtransport is not required7 . – Clustering the switches may make sense both locally and per bus. Often the switches do not change state because the bus is not in use. At other times, there are frequent patterns on a section of a bus because only short sections are being used. – For long periods, switching often occurs on every cycle. This follows from what the storage bandwidth optimization considers to be a cycle: a period through which accesses to external memory are scheduled. Consecutive cycles during which only internal registers are accessed, are not counted. Only if the access schedule would be completely the same for two cycles, w.r.t. sources and sinks as well as data direction, would there not be any switching activity. (Or else when the bus is simply not used.)

6

Related work

Segmented buses are not novel as such, having been developed in the context of super-computing, to speed up parallel computations in the mid 90’s; cfr, for 7

This is due to our judicious choice for software control. The picture would have been quite different if we would have used a full hardware-based NoC routing solution.

6

instance, Li [9]. Chen et al [3] have illustrated their potential for energy optimization. They did not show how to program or control such an architecture. Most research on communication architectures has been focused on inter-tile communication. The architecture discussed in this paper is intended for intratile communication and uses finer-grain segmentation and simpler control. In this area, the literature is limited. Current industrial SoC implementations rely on textbook [4] solutions such as point-to-point connections [5], shared buses [12] and crossbars [8]. These are general purpose architectures, that do not provide the energy-efficiency and scalability required for massively parallel processing [6]. The term “segmented bus” is at times used to refer to multiple inter-tile buses interconnected by bridges. Our “segmented bus” is different, taking segmentation to its logical consequence: not only are intra-tile and inter-tile buses decoupled, but every segment of the intra-tile bus can be decoupled to save power.

7

Conclusions

We have discussed a software-controlled energy-efficient segmented bus communication architecture for SoC designs, and compared the energy required to distribute the switch control bits with the energy consumed by the segmented bus itself, and also with the energy that would be consumed by the bus, were it not segmented. From a test case design we found that the energy costs of driving the switches are appreciably lower than the gain obtained. If we take the viewpoint of control energy, the case for the segmented bus still stands. We should not go for optimization of the transport component, but instead look for the ways to optimize the energy cost of fetching the control information.

References 1. “The ATOMIUM tool suite”, http://www.imec.be/design/atomium/ 2. F. Catthoor et al., “Custom memory management methodology exploration of memory organization for embedded multimedia system design”, Kluwer, June 1998, 3. J.Y. Chen et al., “Segmented bus design for low-power systems”, IEEE VLSI, Mar 1999. 4. J. Duato et al., “Interconnection networks, an engineering approach”, IEEE Computer Society, Jun 1997. 5. S. Dutta et al., “Viper: a multiprocessor SoC for advanced set-top box and digital TV systems”, IEEE Design & Test, Sep 2001. 6. A. Gangwar et al., “Evaluation of bus based interconnect mechanisms in clustered VLIW architectures”, DATE, 2005. 7. J. Guo et al., “Physical design implementation of segmented buses to reduce communication energy”, ASP-DAC, 2006. 8. B. Khailany et al., “Imagine: media processing with streams”, IEEE Micro, Mar 2001. 9. Y. Li et al, “Prefix computation using a segmented bus”, Southeastern Symposium on System Theory, Apr 1996. 10. “Blast Chip 4.0 User Guide Magma Design Automation, Cupertino, CA 95014, pp.271-351”, http://www.magma-da.com 11. A.Papanikolaou et al., “Architectural and physical design optimizations for efficient intra-tile communication”, Proc. Intnl SoC Symp., Finland, Nov 2005. 12. “TMS320VC5471 fixed-point digital signal processor data manual”, http://focus.ti.com/docs/prod/folders/print/tms320vc5471.html 13. “TMS320VC5510/5510A Fixed-Point Digital Signal Processors”, http://focus.ti.com/docs/prod/folders/print/tms320vc5510.html 14. T. Van Meeuwen et al., “System-level interconnect architecture exploration for custom memory organisations”, ISSS, 2001.