that are application specific, i.e. their instruction set per- formed efficient in ... cores and the user can now create a very specific instruction set that is tailor-made ..... vey of systems and softwareâ, ACM Computing Surv. 02. [12] S Vassiliadis, et al ...
A Self-Adaptive Extensible Embedded Processor Lars Bauer, Muhammad Shafique, Dirk Teufel and Jörg Henkel University of Karlsruhe, Chair for Embedded Systems, Karlsruhe, Germany {lars.bauer, shafique, henkel} @ informatik.uni-karlsruhe.de Abstract
Extensible embedded processors allow the designer to adapt the instruction set to a certain application profile. It is either done during design time or at run time. In the latter case it is fixed when which part of the instruction set is used. The processor is then configured according to a predefined schedule. Our approach goes a step further: our extensible processor is self-adaptive. That means, during run time the processor analyzes the usage of Special Instructions and self-adapts when and how these are used and configured. We show that this kind of self-adaptation leads to a high efficiency (e.g. performance per chip area, etc.) and is superior to state-of-the-art extensible processors. In this paper we present the main techniques of our novel selfadaptive approach. We evaluate by means of an H.264 Video Encoder.
1. Introduction and Related Work
The term ASIP (Application Specific Instruction Set Processor) was introduced in the early 90s and denoted processors that are application specific, i.e. their instruction set performed efficient in terms of performance per chip area, etc. Since the early 2000s the term ASIP is far more expanded. Major vendors like Tensilica [4], ARC [1], CoWare [2], etc. offer their tool suites and processor cores and the user can now create a very specific instruction set that is tailor-made for a certain application. Typically these tool suites come with a whole set of retargetable tools, such that code can easily be generated for that specific extensible processor. As a result, extensible processors are more efficient than the first generation of ASIPs. However, still the designer decides during design time which part of the instruction set is invoked at which time. Reconfigurable computing addresses this problem in terms of run-time reconfiguration. And indeed, there are approaches that aim to adapt the instruction set during run time in order to overcome this limitation [3], [12], [14]. However, the designer still has to decide when reconfiguration takes place such that the processor can run more efficiently. The designer also decides during design time the granularity of a Special Instruction (SI) and offers exactly one implementation per SI which circumvents dynamic adaptation at run time. Investigating large real-world applications, we have found that it is hard or even impossible to exactly predict the requirements (performance, etc.) of an embedded processor during design time. As a result, an extensible processor may run non-efficient for large parts of an application. In order to overcome these shortcomings, we have designed a self-adaptive extensible processor that analyzes the SI usage during run time and self-adapts when and how a certain SI is used and configured. Our novel contributions are as follows:
• A self-adapting scheme that decides when and which Special Instructions to deploy at certain run-time points depending upon the changing application requirements. • A run-time monitoring technique that checks the current state of the system, collects the relevant data, and forwards it to the self-adapting scheme. Beneath the previously mentioned commercial tool suites and processor cores there are also academic approaches like LISA [6] and many more. A general overview for the benefits and challenges of ASIPs is given in [7]. A major focus was spent in automatically generating Special Instructions (SIs) for application speedup. A library of reusable functions is used in [8], whereas in [9], [10] the authors describe methods to generate SIs from operation patterns matching. Overviews for reconfigurable computing can be found in [11]. The Molen Processor couples a reconfigurable processor to a core processor via a dual port register file and an arbiter for shared memory [12]. The run-time reconfiguration is explicitly predetermined by additional instructions in the application. An overview for reconfigurable computing with a more fine-grained connection to the CPU is given in [13]. The OneChip98 project [14] uses a reconfigurable functional unit coupled to the host processor and obtains speedup mainly from memory streaming applications.
2. Motivational Case Study
Our extensible embedded processor provides the foundation for time-multiplexed utilization of available hardware resources with self-adaptation to the dynamic needs of an application. We explicate our concept and justify its advantages by means of an H.264 Video Encoder [5] case study. Within the H.264 Video Encoder every frame is segmented into Macro Blocks (MBs, i.e. 16x16 pixels). Each MB is either predicted by the neighboring MBs in the same frame (I-MB) or by an MB in the previous frame (P-MB). The execution paths for encoding I- and P-MBs differ significantly and are implemented with a different set of Special Instructions (SIs). The ratio of I-to-P MB in a frame can not be predicted statically. In case of high motion scenes the Motion Estimator normally fails to provide a good match, i.e. an MB with small pixel-wise Sum of Absolute Differences (SAD). Then, the resulting residue (pixel-difference of current and predicted MB) is too high and for a given bitrate this deteriorates the encoded quality. At this time the Rate Controller decides to insert an I-MB if it gives smaller residue. There can be two types of I-MB injections: Random and Contiguous depending upon the object size. A simple decision equation looks like following: if (SADP-SADI > Threshold) then MB_Type := INTRA; else MB_Type := INTER; We have carried out a study of various video sequences with different types of motion contents. Figure 1 shows a typical distribution of smooth-to-hectic motion scene with a varying number of I-MBs per image. It is apparent from the
figure that a sudden variation in the motion content of a video scene radically changes the ratio of I- to P-MBs. 100% 90%
Distribution of I-MBs [%] in CIF (352x288: 396 MBs) Video Scence
I-MBs in frame [%]
80% 70% 60% 50%
the invocation probability of its successors, and which SIs are needed within this BB. Out of this information a tool automatically determines BBs, where the future SI usages are forecasted. For the selected BBs, Forecast Instructions (FCs) are automatically embedded into the application containing information about the SIs (i.e. when, how often, and with which probability these will be executed) [15].
4.2. Run-time architectural overview
40% 30% 20% 10% 0% 0
100
200
300 400 Frame Number
500
600
700
Figure 1: Distribution of I and P MBs
Analysis: Extensible Processors and static reconfiguration schemes either provide tailor-made hardware designed for a fixed average ratio of I-to-P MBs or provide full hardware for both. The first case suffers from performance degradation for different ratios while the second case requires more silicon area (which is then inefficiently utilized). We solve this problem with our novel self-adaptive extensible embedded processor platform (described in subsequent sections).
3. Fine-grained Instruction composition
We noticed that all discussed architectures in the related work are using mostly exactly one implementation for each Special Instruction (SI). This severely limits the potential for self-adaptation. Our architecture is based on offering the SIs in a much more fine-grained manner. Instead of implementing full SIs independently, we implement data paths as our elementary processing units in reconfigurable Hardware (therefore called Atoms) and then combine these Atoms to create an implementation (called Molecule) of an SI. This allows us to reuse the Atoms for different SIs and to increase the efficiency compared to traditional approaches. We additionally observed that an SI often needs certain Atoms multiple times. We exploit this property by making these SIs available in different Molecules that vary in their level of parallelism (temporal vs. spatial computation). The system can then automatically adapt the SI performance at run time. As a special Molecule we offer an execution with Base Instructions (BIs: instructions of base processor) that does not consume any dynamic hardware but therefore is typically the slowest kind of execution for a certain SI. Figure 2 shows an example of 3 SIs with their implementing Molecules and the composition out of the Atoms.
Figure 2: Exemplary SI composition
4. Self-adaptive Processor
As pronounced, our processor self-adapts during run time and we aim to execute as many decisions as necessary at run time and as many as possible during design time to reduce the run-time overhead.
4.1. Preparation at compile time
As the Atom re-loading is nearly 1 ms [15], the system needs information about upcoming SIs in order to avoid penalties (i.e. waiting). Therefore, we perform a compiletime analysis on the application control flow graph, where each node is a base block (BB). Each BB contains profiling information about its execution time, execution frequency,
A functional overview of the run-time system is given in Figure 3. We extend a typical CPU pipeline by Atom Containers (ACs, i.e. Containers that can be dynamically reloaded with different Atoms) and the Rotation Manager, which is controlling the run-time behavior, i.e. the rotation (adaptation) of the SI implementation. The ACs are tightly connected to the execution data path of the pipeline. The main tasks of the Rotation Manager are: i) Controlling the execution of SIs ii) Observing and adapting to changing situations iii) Determining the rotation decision Instruction
Status
Address
Figure 3: Overview of the run-time system
The pipeline executes the Base Instructions (BIs), provides the parameters for SIs, and writes their computed results to the register file. The Rotation Manager is recognizing the SIs and FCs by decoding the instructions. When an SI or FC is noticed, this information is forwarded to the Execution Control and the Monitoring. The Execution Control is taking care of the actual execution of the SIs (either executed with accelerating hardware when all necessary Atoms are available or by the base instructions triggered by a synchronous trap). The Monitoring records the SI usages and the Forecasting updates the FCs to adapt them to changing runtime situations, as we will see in section 4.3. The Selection uses the adapted FCs to choose a set of Molecules to implement the forecasted SIs. The Scheduling and Replacing finally determine the AC re-loadings to offer the Atoms that are needed to implement the selected Molecules.
4.3. Observation and Self-Adaptation
For self-adaptation the system needs to observe changing surroundings. Specifically, the knowledge of future SI usages is needed to start reconfiguration right in time in order to avoid delays. At compile time, FCs are embedded into the application. We call a set of subsequent FCs a Forecast Block FB. Figure 4 shows a sequence of FBs Bi that are executed sequentially. The Monitoring inside the Rotation Manager considers the time between the last executed FB Bt and the next executed FB Bt+1, as it is indicated by the sliding window. The actual usages Ut+1 of SIs are counted between two FBs and compared against the forecast from Bt. The difference between FC(Bt) and Ut+1 + FC(Bt+1) denotes the forecast error Et: (1) Et = Ut+1 + γFC(Bt+1) – FC(Bt) The parameter γ is used to adapt of how strong the forecast of Bt+1 should contribute for computing the error Et, which is then used to update FC(Bt). The strength of this back propagation is adjusted with the parameter α: (2) FC(Bt) = FC(Bt) + αEt B
42 40 38 36 34 0 0.2 0.4 α 0.60.8 v
alu
es
32 30 1
Figure 4: Temporal-Difference (TD) method
0.1
0.3
0.5
0.7
execution time [million cycles]
28 0.9
s γ value
Figure 5: TD Parameter evaluation for λ=0
As for hardware implementation λ>0 accounts for a significant overhead due to its ability to update multiple previous FBs, whereas the special case λ=0 only updates the directly preceding FB. We therefore evaluated different values for λ to appraise the impact by restricting λ to 0. For λ=0.6 we noticed an average execution time improvement of 4.8% and for λ=1 (i.e. updating all previous FBs) an average performance improvement of 8.1% compared to λ=0. The peak performance improvement (44.3%) as well as the peak performance degradation (-8.6%) were found for λ=1. As an average performance improvement of 8.1% does not justify the hardware overhead of tracking potentially all previously executed FBs in their execution sequence, all further benchmarks are obtained with λ=0. As seen in Figure 5 the execution time is dominated by the parameter α for λ=0. The fastest execution can be found for α=1, i.e. a rapid selfadaptation to varying situations. It is in average (over the γ values) 24.6% faster than α=0, i.e. no self-adaptation.
4.4. Self-x Properties of our Architecture At compile time certain system parameters cannot be determined, e.g. the number of Atom Containers (ACs) that are assigned to the application by the Operating System (OS) is unknown as it depends on task priorities and the number of available ACs. Therefore, the system has to be able to configure itself at start-up. We call this property Self-Configuration: the application starts with the initial assigned ACs, constraints and information which SIs will most probably be needed in the beginning of the application (i.e. before Self-Monitoring has gathered sufficient informa-
tion to adapt the static profiling information). According to these information the system decides which Molecules to choose for implementing the requested SIs and rotates the instruction set accordingly. Self-Monitoring is key to Self-Adaptation as it provides feedback to the system which is then used for updating the predictions for the future. Self-Adaptation is implemented at 2 different levels. At OS level parameters for selecting Molecules are modified and at control-flow level the forecasts for the SI usages are continuously adapted. OS-level Self-Adaptation is achieved by an Observer-ControllerLoop, where the monitor corresponds to sensors while the controller is part of the OS. When the observer is reporting a non-optimal system behavior or a miss of constraints, then the OS is modifying parameters for choosing Molecules. If the observer, for example, reports non-acceptable power consumption, then the parameter for the performance-permW trade-off is adjusted to reduce the number of Atom reloadings, etc. If one of the running tasks consistently misses its timing constraints, the controller can increase the number of ACs that are dedicated to this task. Control-flow level Self-Adaptation serves to support the changes of parameters by the OS-level Self-Adaptation. If for instance, the tradeoff between power consumption and performance changes or the number of available ACs varies, then the system needs a good knowledge of the future SI usages to adapt its rotation policy to follow these guidelines of the OS level. Finally, Self-Repairing is a natural feature of our system: as our architecture can adapt to varying number of ACs, it can easily handle a detected failure in an AC by basically avoiding it’s future usage and adapting the contents of the functional ACs to an optimal support for the requested SIs. The detection of the failure can be done with conventional approaches, e.g. applying a test vector to an AC and inspecting the computed result. As every SI can be executed with the base instructions, the Self-Repairing reaction is fast and inherent to the system.
5. Evaluation and Results Special Instruction latency [cycles]
The parameter γ in (1) makes sure, that the usage of an SI is not back propagated too far before the actual usage, whereas the parameter α in (2) makes sure, that no thrashing between two extreme values can occur. The correction term αEt may not only be back-propagated to the directly preceding FB Bt, but potentially to all previously executed FBs. The strength of the back-propagation to the farther away FBs is diminished by the parameter λ, as shown in Figure 4. Techniques that are using differences between distinct points in time are called Temporal-Difference (TD) methods [16]. The TD method is based on the Markov property, i.e. the conditional property to reach a specific FB and to count a specific number of SIs (between the last and reached FB) depends only on the previous FB and not on the chain of preceding FBs. This Markov property can not be guaranteed in real-life problems, but experiments suggest, that the TD scheme nevertheless achieves good results in practice [17]. This is due to the fact that the Markov property is mainly needed to derive the TD scheme and to prove its convergence, which is not relevant for our approach, as the behavior of executed applications varies and thus we are not interested in convergence, but in adaptation [18].
40 40
IPred VDC 16x16 (I-MB) IPred HDC 16x16 (I-MB)
35 35
30 30
MC Hz 4 (P-MB)
25 25 20 20
15 15
10 10
55
0 0 11
22
33
44
55
66
77
88
10 11 11 12 12 13 13 99 10
Hardware Resources [Atom Containers]
Figure 6: Block diagram for Motion Compensation, composed by different Atoms
Figure 7: Performance vs. area trade-offs for different Molecules for SI implementation
In section 2 we have introduced and analyzed the scenario of encoding I- and P-MBs. To evaluate this scenario we have designed and implemented 3 Special Instructions (SIs) to execute the I- and P-MB computation paths in dynamic hardware. The SI for the Motion Compensation (MC) for PMBs in Figure 6 consists of three different Atom types. The BytePack is a data alignment, the PointFilter is an FIR operation and the Clip Atom limits the output to a certain min/max value. The MC SI may be implemented by different Molecules having different levels of parallelism, reaching from temporal to spatial computation. The trade-offs be-
500.0 450.0 400.0
Forecast value
350.0 300.0
MC Hz 4 (P-MB)
static forecast values
250.0
IPred VDC & HDC 16x16 (I-MB)
200.0 150.0
dynamic forecast values
100.0
situation can be seen in the smooth scene in Figure 9, where the benefit for executing MC by rotating from the 5 Atom Molecule to the 7 Atom Molecule overcomes the penalty for executing the I-MBs with base instructions. 1000
SI Latency [cycles]
tween these extremes are shown in Figure 7 for all three SIs with the Pareto-optimal solutions that are determined at compile time (at the composition of the Molecules). We have evaluated our architecture with a video sequence with the analyzed I- and P-MB ratios in Figure 1 and the three implemented SIs for encoding the MBs. We have simulated the total execution time for the SIs in this scene for varying numbers of Atom Containers (ACs) and compared the results of our self-adapting system with the behavior of a reference system with static reconfiguration decisions. Our forecasts are updated after a whole frame. With only 3 ACs, our architecture already gains a performance improvement of 12.6% (compared to the static reconfiguration decisions) and in average we observed a 5.6% performance improvement. It is noteworthy that the benchmarks for the static reconfiguration decision had to be modified every time we improved the number of ACs to explicit force the static system to use the newly available hardware resources (otherwise the static system would not have used the additional ACs at all). Our self-configuring system instead automatically utilized the available ACs. Additionally the static system was highly optimized for the used video sequence. For a video sequence with a different I- to P-MB ratio, it would have been worse. Our selfoptimizing system instead was always simulated with the same binary and automatically decided which Molecules to choose in which situation. Comparing our dynamic system with the corresponding traditional embedded processor without hardware accelerators we achieve a speedup from 10.5x (5 ACs) up to 17.5x (10 ACs).
Frame Number 150
200
250
300
350
400
450
500
Figure 8: Dynamically updating Forecast values
To demonstrate how the updating scheme for the forecasts from section 4.3 is performing (as this is the basis for our later dynamic rotation decision), we have evaluated the dynamically updated forecast values for I- and P-MBs, as shown in Figure 8. Comparing these forecasts with the actual executed I-to-P MB ratio from Figure 1 (which is a priori unknown to the system) we see that the forecasts match the actual distribution of I- and P-MBs for the simulated frames. The major difference is that the forecasts are smoothened compared to the original ratio, which is due to the fact that we are not back propagating the full error (parameter α in section 4.3). The static system does not update the forecasts but keeps them fixed as indicated in Figure 8. We have analyzed the run-time behavior of our selfadaptive system, i.e. which Molecule is chosen at which time. In Figure 9 the latencies for the chosen Molecules for 7 available ACs are plotted. For instance, in the hectic scene IPred HDC is in two steps upgraded from the Molecule with 2 Atoms to the Molecule with 4 Atoms at the cost of downgrading MC to the Molecule with 3 Atoms. After the hectic scene ends the system is rotating back to its original configuration that then fits best the situation. A noteworthy
smooth scene
IPred VDC 16x16 (I-MB) IPred HDC 16x16 (I-MB) MC Hz 4 (P-MB) 100
hectic scene
10 150
200
250
300
350
Frame Number
400
450
500
Figure 9: SI latencies at different frames
6. Conclusion We have presented a self-adapting extensible embedded processor. It goes beyond the capabilities of state-of-the-art embedded processor approaches that fail when an application’s behavior cannot be predicted at design-time. As a result of self-adaptation, our approach is more efficient. We have evaluated its functionality using an ITU-T H.264 Video Encoder and an FPGA platform. Our future work will aim at minimizing the additional overhead in order to further increase its efficiency.
7. References [1] [2] [3] [4] [5] [6]
50.0 0.0
Dynamically chosen Special Instruction Latencies for frames of a video sequence
[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]
Arctangent processor: www.arc.com/configurablecores/ CoWare Inc, LISATek: www.coware.com Stretch processor: www.stretchinc.com Xtensa processor, Tensilica Inc: www.tensilica.com ITU-T Rec. H.264 and ISO/IEC 14496-10 (E) “Advanced video coding for generic audiovisual services”, 2005 A Hoffmann et al. “A novel methodology for the design of application-specific instruction-set processors (ASIPs) using a machine description language”, IEEE Trans. on CAD of Integr. Circuits and Systems 2001 J Henkel “Closing the SoC Design Gap”; IEEE Computer Volume 36; Issue 9; September 2003 N Cheung, J Henkel, S Parameswaran „Rapid Configuration & Instruction Selection for an ASIP: A Case Study“, DATE 2003 K Atasu, L Pozzi, P Ienne “Automatic ApplicationSpecific Instruction-Set Extensions Under Microarchitectural Constraints”, DAC 2003 F Sun, A Raghunathan, S Ravi, NK Jha “A scalable application specific processor synthesis methodology”, ICCAD 2003 K Compton, S Hauck “Reconfigurable computing: a survey of systems and software”, ACM Computing Surv. 02 S Vassiliadis, et al. “The MOLEN polymorphic processor”, IEEE Transaction on Computers, Issue 11, 2004 F Barat, R Lauwereins „Reconfigurable Instruction Set Processors: A Survey“, RSP 2000 JE Carrillo, P Chow “The Effect of Reconfigurable Units in Superscalar Processors”, FPGA 2001 L Bauer, M Shafique, S Kramer, J Henkel “RISPP: Rotating Instruction Set Processing Platform”, DAC’07 RS Sutton, “Learning to Predict by the Methods of Temporal Differences” Machine Learning 3, Springer RS Sutton, AG Barto, “Reinforcement Learning: An Introduction.” MIT Press, Cambr., MA, 98 T Sherwood, S Sair, B Calder, “Phase Tracking and Prediction” Int’l Symp. on Computer Architecture, 2003