Performance Gain from Data and Control Dependency Elimination in Embedded Processors Valeriu Codreanu, Radu Hobincu Faculty of Electronics, Telecommunications and Information Technology University ―Politehnica‖ of Bucharest Bucharest, Romania
[email protected],
[email protected]
Abstract— This paper presents a way of increasing overall performance in embedded processors by introducing a multithreading interleaved execution model that can be applied to any Instruction Set Architecture. Usual acceleration techniques as superpipeline or branch prediction are not suited for embedded machines due to their inherent inefficiency. We will show that by removing dependencies within a processor and thus eliminating the need for extra hardware required for keeping the overall coherence, there will be a noticeable increase in performance (up to 450%) and also a decrease in size and power consumption. Also, this approach will maintain the backwards compatibility with the software legacy in order to keep the software changes to a minimum. Keywords- embedded; processor; interleaved; multithreading
I.
INTRODUCTION
In the context of the mobile ecosystem development, mobile computing architectures became more and more complex. From the classic 3-stage pipelined ARM7 core, with minimum speculation support present in the mobile devices of 2000, technology pushed the mobile architectures of today to ARM Cortex superscalar cores with 13-stage pipeline, branch prediction schemes and L2 cache, operating at frequencies past the Gigahertz limit [2]. These structures however, are power hungry, and are not always suited for mobile devices. A potential approach to reduce the area/power consumption of these structures will be presented below.
longer. To achieve such high operating frequencies, the pipeline is very deep, comparable with desktop processors. Today’s highest-performance embedded cores are represented by ARM’s Cortex family. The Cortex-A8 is a 32bit core that implements the ARMv7 instruction set. Its pipeline architecture is presented in fig.1. It is the first ARM core that uses a superscalar dual-issue, in-order execution pipeline. The pipeline, unusually long, comprises a 13-stage main pipeline and a 10-stage NEON pipeline designed to accelerate multimedia tasks. In contrast, ARM11's pipeline has only 8 stages; ARM9’s has 5 and ARM7 3 stages. According to ARM, the Cortex-A8's long pipeline will enable high clock rates—potentially exceeding 1 GHz in a 65 nm process. It is also designed with 2 32Kbyte L1 caches, a dedicated L2 cache with 9-cycle latency, a dynamic branch predictor and all key forwarding paths [2]. B. Limitations introduced by hazards Hazards occur when two or more instructions present in a processor’s pipeline conflict. There are three types of hazards: data hazards, control hazards and structural hazards. Unresolved hazards result in pipeline stalls (bubbles). When the pipeline is stalled, the program stops, resulting in low Instruction per Cycle (IPC).
A. Current Embedded Architectures There are literally hundreds of embedded processors available, but in the last decade the ARM cores dominated the market, having the most power-efficient architecture, and also the best software support. Other embedded approaches that achieved some success are MIPS, and more recently Intel x86. Although ARM had a strong competition, it is now by far the most used embedded architecture. There is at least an ARM chip in almost every mobile phone produced in the last decade [12]. In the last years, the mobile space evolved rapidly, with applications more and more complex, requiring more and more computing power. To keep up the pace with the software advances, application processors have grown in size and complexity, are more power hungry and less efficient. The memory gap also increased, as processor frequencies are at the Gigahertz level now, and memory wait times have become
Figure 1. Cortex A8 pipeline architecture [2]
Data hazards occur when instructions in the pipeline depend on results of prior instructions still in the pipeline. Data hazards are minimized in Cortex’s implementation by adding most of the possible forwarding paths, a 32KByte data cache, along with a dedicated L2 cache for reducing memory latency.
Over 20% of the instruction count of a typical program are load instructions, possibly leading to cache misses [5]. A cache miss negatively affects performance, introducing up to tens or hundreds of pipeline bubbles. Cortex’s L2 cache has a 9 cycle latency. Control hazards occur because the CPU typically fetches the next instruction before a potential branch target is known. As can be seen from fig. 2, if a branch is mispredicted the penalty is 13 cycles. So, in the case of a branch mispredict 13 upcoming instructions will be flushed. To achieve a high prediction rate ARM’s dynamic branch prediction includes a 512-entry BTB (Branch Target Buffer), a 4K-entry Global History Buffer and an 8-entry return stack. Branch target buffers are hash tables that map the program counters to branch target addresses. This memory is usually associative and addressed in every clock cycle. Every entry is composed of {pc,target_address}. It is equivalent with a 4KByte memory structure, and additional logic. The Global History Buffer keeps a global branch history based on 2-bit saturated counters. It is equivalent with a 1KByte memory structure, along with the necessary logic. Both these structures are power consuming, being addressed in every clock cycle. More than 20% of the dynamic instructions in the ―SPECint‖ benchmarks are branch instructions [5]. Structural hazards occur when a part of the processor is needed by two or more instructions in the same cycle. To minimize this type of hazard, Cortex employs multiple execution units. Hazards negatively affect performance, and to overcome them, ARM introduced the complex mechanisms/structures presented above. This solution for increasing IPC is „proven‖ for desktop CPUs, but for the low-power embedded cores, it is not optimal. Most of the core area (L2 cache, branch predictor, etc.) is used for resolving/minimizing dependencies. Reference [1] states that the L2 cache of the ARM Cortex-based Apple A4 SoC is almost 50% of the CPU core area. The total SRAM macro area in the CPU core is occupying over 60% of the CPU core area. If another solution can „assure‖ an independent stream of instructions in the pipeline, many of these inefficient structures can be removed or at least minimized, resulting in a lower-power, lower-cost processor.
that are produced by data dependencies of type RAW (Read After Write – represents a data dependency where a register operand is required for reading before one the previous instructions finishes the write operation), WAR (Write After Read – represents a data dependency where a register operand is written before one the previous instructions finishes the read operation - this only appear in out-of-order execution), and WAW (Write After Write – represents a data dependency where a register operand of an earlier instruction is written after another write of a later instruction, leaving the register in a wrong stat - this only appear in out-of-order execution) and large bubbles that appear due to control dependencies like execution branches and miss-predicts. Also, pipeline stalling is a type of control dependency given by ―load‖ type instructions, which can negatively affect performance. Data dependencies can be eliminated by making sure that all instructions in all pipeline levels are independent. The solution we propose is the hardware interleaving of multiple threads which are all instruction independent. Each thread will be allocated different hardware resources, while only sharing the execution units.
If the relation (1) is verified, as in if there are at least a number of available threads (T) equal to the number of pipeline levels between the read (RSi) and write (WSi) stage, then all instructions in these stages will be data independent. This design can be implemented by a generic block we call Thread Control Unit, which is responsible for inserting independent threads in the execution pipeline. This block removes the need for a forwarding mechanism. Pipeline flushes due to branches can be handled in a very similar way.
Just like formula (1), if relation (2) is verified, as in if there are at least a number of available threads (T) equal to the number of pipeline levels between the fetch (FSi) and execute (ESi) stage (considering that this is the stage when branches are decided and the new program counter computed), then all instructions in these stages will be control independent and there will be no need for flushing on a branch instruction. This also removes the need for a branch prediction block.
Figure 2. Cortex A8 control flow [2]
II.
APPROACH
A. Interleaved multi-threading Generally, the loss in performance is given by bubbles in the pipeline. These bubbles are of two types: small bubbles,
So if there are enough execution threads in order to verify (3), this guarantees that the execution pipeline is always full. In order to maximize efficiency, we must ensure that the pipeline is full, but also that it is constantly shifting instructions. This can be achieved by a special background unit we call Data Transfer Unit which has the role of executing the ―load‖ and ―store‖ instructions and program fetching outside of the main pipeline flow. A thread waiting on this unit to perform a task will not be selected for execution by the Thread Control Unit.
To generalize the idea of the Data Transfer Unit, we expect that all instructions take exactly one clock cycle to pass through any pipeline level and the instructions cannot be expanded as that would add additional dependencies. If there is such a need, then a background unit should receive and handle the request while putting the thread on hold. Fig.3 shows a generic block diagram for this machine named BEAM (Bubble-free Embedded Architecture for Multithreading) without the traditional modules like branch prediction and data forwarding. It is worth noting that this approach applies no restriction on the instruction set architecture and so it can be applied to any of the current used processors. B. Testing environment In order to evaluate this design we have implemented a processor core with a minimal RISC instruction set. It supports 8 hardware threads with 16 32bit registers/ thread, integer only ALU, dedicated memory ―load/store‖ instructions and direct addressing in a 2-levels pipeline. The core is connected to two 8KB caches. Using an assembler for this machine we have developed several programs for testing purposes. The same algorithms were implemented in ARM assembler code and simulated on RealView official ARM1176 simulator. The ARM simulator reports the number of clock cycles necessary to complete the program. III.
DESIGN RESULTS
The first test is a sequential scalar vector multiplication that generates 8000 integer values in two memory arrays and then computes the scalar product with overflow. The multiplication task is split among 4 threads. This test was designed to be the worst case scenario since most of the time the threads are waiting for the Data Transfer Unit to fetch data. Thread Control Unit
Register File
handles requests from and to the execution units
has distinct regions for every hardware thread has 2 read and 1 write ports for execution units has at least 1 read-write port for background execution units
ALU
Generic Execution Unit
1 clock cycle / instruction
1 clock cycle / instruction
Data Transfer Unit
Generic Background Execution Unit
multiple clock cycles / instruction
multiple clock cycles / instruction
Figure 3. BEAM block diagram
The results shown in fig.4, correlated with the analysis of the wave forms, show that the machine is data starved and the whole algorithm is IO bounded. ARM finished the scalar multiplication test in 141,590 cycles. The second test was designed to be closer to a real embedded application where we had 8 threads running with 4
different algorithms: a pseudo-random number generator, an arithmetical computation of 3 at 1000 power, a Fibonacci series generator and a memory parser.
51,329 total cycles
2 levels full 32.00%
pipeline empty 54.00% 1 level full 14.00% Figure 4. Scalar multiplication cycle count on BEAM
As the results show in fig.5, the pipeline efficiency grows up to 99.8%. In this case, the ARM core executed the program in 2,139,780 cycles. Comparing these results to the total cycles of the program runtime on an ARM processor, we have obtained the results shown in fig. 6. The execution time on BEAM is from 2.5 times better in the worst case scenario (scalar array multiplication) up to 4.5 times better in a multi-threading heavy application. But BEAM is not only about raw performance. BEAM’s area is also less than half of an ARM11, and its power consumption is also approximately half. The operating frequency is similar to an ARM11 implementation. BEAM was synthesized in a TSMC 90nm process, using the Design Compiler flow. The synthesis results are presented in Table 1. Based on these numbers we can compute the MIPS/mm2 or MIPS/mW metrics. For example, for the highly-threaded program, BEAM consumes approximately 10% of the energy consumed by ARM11. 1 level 480,861 total cycles pipeline full empty 0.14% 0.05%
2 levels full 99.81% Figure 5. Multi-program cycle count
To further see the impact of complex computing structures, we can outline the main processor blocks of a modern embedded processor; the Intel Atom Intel ATOM is the largest embedded CPU, with an area of 25mm2.
TABLE I.
Program Execution Time (kilo-cycles) 2500
2139.78
2000 1500 1000 500
480.861 141.59 51.329
0 Scalar multiplication ARM11
Multi-program BEAM
Figure 6. Execution time comparison
Atom’s main components are presented in the micrograph from Intel shown in fig 7.
FEC – front-end cluster
FPC – floating point cluster
IEC – instruction execution cluster
MEC – memory execution cluster (plus L1 data cache)
BIU – bus interface unit
The L2 cache is 512KB in size, occupying about 40% of the core area. The Front End Cluster includes the L1 instruction cache and the branch prediction logic. It is the second biggest block present in Atom. About 50% of its area could be saved by using an approach like BEAM. On lowersize cores like ARM, the impact of these memory structures is up to 70%. BEAM’s only overhead in terms of area over a classic CPU is a larger register-file, to fit each thread’s registers, and the Thread Control Unit that memorizes and controls each thread’s internal state.
90nm
481
Area(mm2)
~3
1.3
Power consumption(mW/MHz)
~0.4
0.2
A similar solution was developed by former Sun Microsystems and implemented in the Niagara chip, later to become the open-source OpenSparc T1. However, this chip was designed for server applications and many of its features are not suited for embedded [11]. Maybe the most important achievement from BEAM’s design is its energy efficiency. By reducing the big memory structures resulted from dynamic branch predictors and L1/L2 caches (over 60% in modern CPU’s), and by keeping the pipeline full of instructions all the time, BEAM looks very promising in terms of MIPS/mW and MIPS/mm2, especially in heavy multi-threaded environments. A following research will be the evaluation of the BEAM execution model in real time embedded applications. REFERENCES [1]
[2]
[3]
[4] [5]
[8]
CONCLUSIONS
The BEAM architecture shows a great increase in performance when executing many heterogeneous threads. The thread count requirement is extremely important for this approach to give good results. If any (1) or (2) are not verified, the performance potential begins to drop to the worst case scenario where only one thread is available. However, in several embedded segments, multithreading is an important requirement, especially when the system has many peripherals and also runs an operating system.
BEAM
~500
[7]
IV.
ARM1176
Max frequency(MHz)
[6]
Figure 7. Silverthorne Micrograph. [10]
SYNTHESIS COMPARISON BETWEEN BEAM AND ARM11
[9]
[10]
[11] [12]
[13]
Paul Boldt, Don Scansen, Tim Whibley. ―Apple's A4 dissected, discussed...and tantalizing‖ EETimes. [Interactive] 17 June 2010. [Cited: 16 July 2010.] http://www.eetimes.com/electronicsnews/4200451/Apple-s-A4-dissected-discussed--and-tantalizing Texas Instruments. ―Cortex-A8 Architecture‖ Texas Instruments Embedded Processors Wiki. [Interactive] [Cited: 2010 July 16.] http://processors.wiki.ti.com/index.php/Cortex-A8_Architecture ARM Company. Cortex™-A8 - Technical Reference Manual. [Interactive] [Cited: 16 July 2010.] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/DDI0344K _cortex_a8_r3p2_trm.pdf S. Furber, ARM System-on-Chip Architecture, 2nd ed. Boston: Addison Wesley, 2000 D. P. John Hennessy, Computer Architecture: A Quantitative Approach. Elsevier, 2007 R. Moussali, N. Ghanem, M. A. R. Saghir, ―Supporting multithreading in configurable soft processor cores‖ in CASES '07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, 2007 J. E. Thornton: ―Parallel Operation in the Control Data 6600", in AFIPS Conference Proceedings FJCC, part 2, vol. 16, 1964 B. J. Smith: "A Pipelined, Shared Resource MIMD Computer", in Proc. of the 1978 Intl. Conf. on Parallel Processing, pp. 6-8, August 1978 G. Stefan, A. Paun, A. Birnbaum, V. Bistriceanu: ―DIALISP - a LISP Machine", in Proceedings of the ACM Symposium on LISP and Functional Programming, Austin, Texas, Aug. 1984. p. 123 - 128. Arstehnica. Small wonder: inside Intel's Silverthorne ultramobile CPU. [Interactive] [Cited: 20 July 2010] http://arstechnica.com/gadgets/news/2008/02/small-wonder-insideintels-silverthorne-ultramobile-cpu.ars Oracle, OpenSparc T1 [Interactive] [Cited: 20 July 2010] http://www.opensparc.net/opensparc-t1/index.html Cnet News, ARMed for the living room, [Interactive] [Cited: 20 July 2010] http://news.cnet.com/ARMed-for-the-living-room/2100-1006_36056729.html K. R. Hirst, J. W. Haskins, Jr., and K. Skadron. ―dMT: Inexpensive Throughput Enhancement in Small-Scale Embedded Microprocessors with Differential Multithreading.‖ IEE Proceedings on Computers and Digital Techniques, 151(1):43-50, Jan. 2004.