Techniques for Software Thread Integration in Real-Time Embedded Systems Alexander G. Dean and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213
[email protected] [email protected]
Abstract This paper presents how to perform thread integration to provide low-cost concurrency for general-purpose microcontrollers and microprocessors. A post-pass compiler interleaves multiple threads of control at the machine instruction level for concurrent execution on a uniprocessor and provides very fine-grain multithreading without context switching overhead. Such efficient concurrency allows implementation of real-time functions in software rather than dedicated peripheral hardware. We investigate a set of code transformations which allow systematic integration of a real-time guest thread into a host thread, producing an integrated thread which meets all real-time requirements. The thread integration concept and the associated code transformations have been applied to example functions chosen from three application domains to evaluate the method’s feasibility.
1.0 Introduction Embedded system designers must work within a design space tightly bounded by system requirements such as cost, size, weight, power and development time. They must trade off design costs with recurring costs, both of which vary over time for each application. Any real-time requirements place additional pressure on these constraints by adding hardware or software components dedicated to timely operation. Hardware solutions increase system size, cost, weight and power. Software solutions complicate timing analysis and verification, and are limited in performance by context switch and interrupt service routine response times. However, CPU chip technology and architecture improvements continue to lower the cost of CPU cycles needed to execute software by about 50% every 18 months. This paper demonstrates how thread integration (first introduced in [4]) can simplify embedded systems by eliminating hardware dedicated to real-time functions and
transferring the work to software. Thread integration is a compiler technology which can automatically interleave multiple threads of computation at the machine instruction level and provide implicit multithreading on a generic uniprocessor. This interleaving involves scheduling at the instruction level and eliminates context switch overhead to allow efficient operation on CPUs ranging from 8 to 64 bits. Three factors make thread integration difficult. First, the control flow behavior of the two threads may be very different and difficult to reconcile. We use a program representation which simplifies transformations needed to reconcile differing control flow structures. Second, manually integrating assembly code is tedious and error prone. A high-level language is not an appropriate source code representation for real-time thread integration due to the coarse time resolution of the statements. Assembly code must be used as it provides cycle-accurate instruction placement and therefore allows precise scheduling of realtime events. The compiler we use operates on assembly code and performs data flow analysis and register renaming, removing this burden from the programmer. Finally, the behavior of the host thread over time must be known a priori, but may be impossible to predict, as it is equivalent to the halting problem. Embedded system software tends to be deterministic and straightforward to characterize, as it typically has bounded loop iterations and no recursion. This paper demonstrates how to integrate threads and shows the benefits and costs for three embedded applications. The ultimate goal of our research is the creation of theory and tools for automatically merging a real-time function thread with a host thread, so that the integrated thread requires no extra task-switching once it has begun, allowing nearly all processor cycles to be dedicated to useful work and dramatically increasing the maximum frequency and temporal accuracy of real-time events. The guest thread can emulate a hardware function, allowing elimination of special circuits added to satisfy real-time requirements. This reduces system cost, weight and size, as well as providing device selection flexibility by reducing
Copyright 1998 IEEE. Published in the Proceedings of RTSS’98, 1 December 1998 in Madrid, Spain. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966
hardware requirements. This technology is applicable to a broad range of embedded real-time control applications using low- to high-end microprocessors. Previous work on task fusion by compiler is limited. [8] presents a technique called interleaving for compile-time task scheduling which identifies a task’s idle times and schedules other tasks at those times. It is a coarse grain approach and incurs context-switching penalties for each task switch. [19] extends this work to provide non-intrusive real-time task monitoring. However, as the fragment size decreases, the performance penalty exacted by context switching increases. We have found no work other than ours [4] which eliminates this penalty by merging two threads at the assembly instruction level into a single integrated thread. Previous work in code motion with a PDG ([1], [6], [7], [12], [14], [19]) focuses upon reducing program run time.
2.0 Research Methodology We examine real-time functions from three hypothetical embedded systems to illustrate the benefits of thread integration. In each system, a software function implements a real-time function using busy-waiting code for event scheduling. Thread integration replaces this padding code with instructions from the host function to improve runtime efficiency. The resulting code is then analyzed to determine cycle count and code size. Simulation verifies the correct functioning of the integrated threads.
2.1 Sample Applications The first application is a small portable computing device such as a hand-held PC (HPC) or portable video game with a high resolution graphic liquid crystal display (LCD). Current HPCs use CPUs with performance of nearly 100 MIPS [16], and future devices will grow faster. Thread integration can use part of this growing CPU capacity to refresh the LCD, eliminating the need for a dedicated LCD controller and its local frame buffer memory if any exists. Thread integration has promise for these markets as it eliminates hardware, cutting system size, cost, weight and time-to-market. The second application, a digital cellphone integrated into a car, features GSM 06.10 speech compression [5] and uses the CAN 2.0A [2] protocol for communication with other devices within the vehicle. CAN is popular in automotive applications for multiplex communications, reducing the wiring harness size, cost, weight and failure rate while simplifying assembly. Providing a common databus in a vehicle encourages incorporation of features such as hands-free cellphone, navigation aids, entertainment and travel information as well as vehicle customization, opti-
mization, and diagnostics. Thread integration is especially appropriate for the automotive market because of its price sensitivity and size constraints. The third application, a stand-alone digital cellphone first examined in [4], offers GSM 06.10 speech compression and an I2C (inter-IC) network [15] for communication with a smart battery. I2C is popular for embedded systems due to its standardized interface, simplification of hardware, software and printed circuit board design, small device packages, and the wide variety of ICs available. Thread integration can support this application by reducing system hardware cost and increasing software efficiency.
2.2 The Pedigree Compilation Tool The Pedigree tool suite [12] [13] is being extended to support thread integration. Pedigree is a retargetable, postpass, program dependence graph-based code transformation and evaluation environment. The Pedigree compiler accepts assembly language programs and uses interprocedural information, profiling information, machine and instruction set architecture descriptions, and scheduling hints to produce optimized code, automatically parallelized optimized code, and provide program visualization information and statistics. Pedigree also contains a functional timing simulator which can be used to evaluate performance. The Pedigree compiler uses the program dependence graph (PDG) [6], a representation which explicitly identifies control and data dependences. The program’s control structure is represented hierarchically, with summary dataflow information added to describe subgraphs. This hierarchical data representation allows efficient code motion and transformations. The Pedigree tool suite is post-pass; it accepts compiled assembly code as input. It is currently targeted for DEC’s Alpha instruction set architecture (ISA), which is a clean 64 bit load-store RISC architecture. The Alpha ISA was chosen for Pedigree for its simplicity and straightforward simulation. Although Alpha processors are not representative of most embedded applications, the Alpha ISA can serve as an expedient research vehicle for illustrating the concept of thread integration for real-time embedded systems. In the future we plan to retarget Pedigree to an ISA more representative of the broad range of embedded controllers (e.g. MIPS, Atmel AVR, Microchip PIC).
3.0 Integration of Real-time Threads The goal of thread integration (i.e. integrating a guest thread into an existing host thread) is to produce a program with integrated real-time guest thread events which execute at the proper time. In existing systems, this event schedul-
ing can be implemented through busy waiting (the CPU polls inputs to detect event occurrence and executes nops to pass time), or through interrupt-driven context switching (a timer or other signal source triggers execution of the real-time code). Thread integration replaces the filler instructions of the busy-wait version with instructions from the host function, enabling more efficient program execution. Because the integrated thread has real-time constraints, it is assumed to be uninterruptable. This lack of preemption increases the CPU’s maximum interrupt response latency by the worst-case duration of the integrated function. Any other real-time functions requiring a response faster than this must be implemented in hardware. Future work will loosen this constraint. Host and guest function selection are discussed further in [4].
3.1 Program Representation and the PDG This paper demonstrates compiler-assisted thread integration using the PDG representation of a program, which stores program control and data dependence information concisely and hierarchically, enabling efficient code analysis and motion. B1 FiltSize = 15; BufSize = 1000; k=0; do { B2 sum = 0; p = &Sample[k-FiltSize+1]; i = 0; do { B3 sum += (*p++)*Filt[i++]; } while (i Titer(LoopH)
Host
Titer(LoopH), Tcomp(LoopH) Guest
Tcomp(LoopG)
Titer(LoopG)
Optimize for time (b) or memory (c) Host
b1. Unroll Host
Finish Guest Iterations
Perform Integrated Iterations
Finish Host Iterations
b2. Guard Guest (Effective Host Unrolling)
Guest
Perform Integrated Iterations
Finish Guest Iterations
Finish Host Iterations
FIGURE 4. Summary of loop integration with unrolling.
The period of the guard counter is n host loop iterations; the counter ranges from 0 to n-1. n =
T iter ( Loop G ) – T comp ( Loop G ) ----------------------------------------------------------------------------T iter ( Loop Hnew )
The delay Idel(Guesti) is measured in host loop iterations from the beginning of the new host loop’s first iteration to the guest event Guesti, taking into account
the execution of any previous guests Guestj in this iteration of this guest loop body. ( T ev ( Guest i ) – T sta ( Loop Hnew ) ) -– I del ( Guest i ) = --------------------------------------------------------------------------------T iter ( Loop Hnew ) i–1
∑ T comp ( Guest j )
j=0
-----------------------------------------------T iter ( Loop Hnew )
These delays determine the placement of the guard predicates (and hence guest nodes) within the new host body, as well as the value for which the predicates check. The predicate for Guesti is set to match the proper host loop iteration counter value and is placed in the correct location of the loop body by calling Integrate(). Finally, the last guest added to the host loop body is padded with nops lasting Tpad. This padding takes into account the computation time of each guest and the time to reset the guard counter variable.
gration increases code size (e.g., by splitting or unrolling loops), resulting in significant code expansion. Note that typically only one or two functions will need to be integrated, so the 2x to 6x memory increase is only incurred for the code of those functions, and hence is minor. For example, the handheld cellphone’s 1148 byte increase is small in comparison with the 60 kilobytes needed for the voice codec functions. TABLE 1. Costs of Implementing guest functions in software, without and with thread integration.
Discrete
T pad = T iter ( Guest ) – ( n ⋅ T iter ( Loop Hnew ) ) –
∑ T comp ( Guest i ) – T comp ( CodeGreset ) ∀i
The loop control tests for the integrated loop are formed as the logical product of the tests of both host and guest loops. Unrolling a loop requires changing its counter variable target value to match n-1 iterations earlier. Any remaining iterations are performed by the original loops, located after the integrated loop. The guest loop is padded and placed immediately after the integrated loop to allow satisfaction of real-time requirements.
4.0 Experimental Results Three hypothetical applications have been processed to implement real-time guest functions in software to demonstrate benefits of thread integration. This eliminates dedicated hardware in each application to reduce system cost, size and weight. Table 1 summarizes the cost of implementing each application’s real-time function in software. The discrete, or non-integrated, software implementation (baseline) in each case uses busy waiting to meet guest function timing requirements, while thread integration increases the software implementation efficiency, reducing the run time at the expense of increased code memory. The “Run Time Change” columns show how much time is needed to perform the guest’s instructions compared with a busy-wait guest implementation. An integrated implementation overlaps host and guest execution and so reduces the time difference below 100%. For example, adding software LCD refresh support to the handheld computer using a discrete busy-wait version requires 2984 cycles per display row. Integrating the LCD refresh function with a graphics line drawing routine reduces the refresh cost to 1673 cycles per display row; now 45% of the refresh work is performed during line drawing. The “Code Memory Change” columns show how much code memory is needed to add the guest function. Thread inte-
Integrated
System
Run Time Change
Code Memory Change
Run Time Change
Code Memory Change
Handheld Computer
+100% of Guest
+336 bytes
+55.0% of Guest
+1796 bytes
Vehicle Cellphone
+100%
+498
+80.9%
+2460
Handheld Cellphone
+100%
+280
+32.9%
+1148
4.1 Handheld Computer The first application is a small portable computing device such as a hand-held PC (HPC) or portable video game with a high resolution graphic liquid crystal display (LCD). Current HPCs use CPUs with performance of nearly 100 MIPS [16], and future devices will grow faster. Thread integration uses part of this growing CPU capacity to refresh the LCD, eliminating the need for a dedicated LCD controller and its local frame buffer memory. A line drawing routine is integrated with the LCD row refresh function, improving system efficiency. Thread integration has promise for these markets as it eliminates hardware, cutting system size, cost, weight and time-to-market. Figure 5 shows the original system hardware architecture. The CPU communicates with an LCD controller (LCDC) [9], which generates control and data signals for the LCD based upon data stored in the frame buffer. ROM
CPU
RAM
ROM
CPU
I/O
LCD Controller
Frame Buffer
I/O
Latch
LCD
RAM
LCD
Hardware LCD Refresh Software LCD Refresh FIGURE 5. Handheld computer hardware components.
remaining CPU capacity as a function of line-drawing activity; refreshing the LCD in software requires 50% of the processor’s time. The integrated software solution uses 100%
HW LCD, SW Draw Integrated LCD, Draw SW LCD, SW Draw
90% 80%
CPU Time Remaining
A high resolution monochrome LCD (640 by 480 pixels, 1 bit per pixel) displays information and must be refreshed 70 times each second to avoid flickering. Column pixel data is loaded serially into a shift register and then latched every 59.5 µs, driving an entire row simultaneously. The data and control signals are generated by dedicated LCD controller which requires its own memory or else arbitrated access to the CPU’s memory. Using this dedicated hardware solution increases chip count, size and weight, which are typically at a premium in this type of device. Some microcontroller makers address this problem by integrating the LCD controller with the microcontroller. The main disadvantages of such hardware integration are that it limits the designer’s options in selecting a microcontroller and may increase device cost.
70% 60% 50% 40% 30% 20% 10% 0% 0
50
100
150
200
Line Pixels Drawn per Refresh Row
80 ns (8 cycles) per value
Key CPU Active Overhead
the idle time in the refresh function to plot up to 40 pixels per display refresh row, freeing up to 21% of the CPU’s capacity for other functions. Table 2 presents information on the discrete and integrated threads.
Data Clock CPU Activity
1cycle Release Clock
FIGURE 7. CPU capacity vs. line drawing activity.
2 cycles Shift data nibble onto bus
1cycle Assert Clock
1cycle Release Clock
TABLE 2. Handheld computer thread statistics.
Thread
Cycle Count
Idle Cycle Count
Size in Bytes
DrawLine (40 pixels long)
1390
0
616
LCDRow
2984
1280
336
Discrete DrawLine (40 pixels) and LCDRow
4374
1280
952
Integrated DrawLine (40 pixels) and LCDRow
3063
0
2412
Data Timing min 26.7 µs, max 59.5 µs Data
0
1
2
317
318
319
Clock
Row Timing FIGURE 6. LCD refresh timing and CPU activity.
It is possible to generate the LCD control signals in software. A periodic interrupt every 59.5 µs calls a function which shifts a row of data out and clocks the shift registers. The primary bottleneck of this scheme is the low maximum clock speed for common LCD driver shift registers, which ranges from 4 to 12 MHz. As seen in Figure 6, this bottleneck forces a 100 MIPS CPU to spend nearly half of its time during LCD refresh waiting for the shift register, in the form of one and two nop busy waits. Every 59.5 µs, the CPU spends 1280 of its 2984 LCD refresh cycles as busy waits. As a result, the 100 MIPS CPU has 50 MIPS remaining for applications, with 29 MIPS used for the display refresh and 21 MIPS used for busy waits. Thread integration enables the CPU to use those wasted cycles to perform useful work. In this example a fast line drawing routine [3] is integrated with the LCD refresh thread to take advantage of the free time. Figure 7 shows
Thread integration allows elimination of the LCD controller and its frame buffer. The cost for this integration is 1580 bytes of program memory, 38,400 bytes of data memory for the new frame buffer and 29 MIPS of processor throughput. Thread integration uses nearly all of the idle cycles in LCDRow to perform useful work, and mitigates about half of the performance impact of implementing LCD refresh in software.
4.2 Vehicle Cellphone with External Network The two digital cellphone examples use GSM 06.10 lossy speech compression [5] integrated with a communication protocol. These applications have tight cost, size, weight and power constraints yet benefit from protocol inclusion. Thread integration is used to eliminate network
interface hardware by performing such functions efficiently in software. Both examples integrate a message transmission function into a GSM function which is called once per 20 ms frame; this introduces a message transmission delay of up to 20 ms, which is acceptable for many applications. Message reception is asynchronous and is not integrated; instead a discrete interrupt service routine is used. ROM
CPU
RAM
ROM
CPU
CAN bus. The CAN code requires only 29 cycles of work per 66 cycle bit time, so the CPU utilization of a discrete, busy-wait version is only 44% during message transmission. Integrating the CAN function with a GSM function (Reflection_Coefficients) allows the CPU to reclaim some of these idle cycles. Figure 10 presents the CPU capacity remaining after message transmission and GSM compression.
RAM
CAN Controller
I/O
I/O
CAN Transceiver
CAN Transceiver
CAN Bus
Hardware CAN Protocol
Latch Buffer
CAN Bus
Software CAN Protocol
CPU Time Remaining
100%
99%
98%
HW CAN, SW GSM Integrated CAN, GSM SW CAN, SW GSM
FIGURE 8. Vehicle cellphone hardware components. 97% 0
This example features a cellphone embedded into a vehicle. With its CAN interface, the phone can signal and react to events in the vehicle, such as muting the stereo during phone calls and automatically calling emergency service dispatchers upon airbag deployment. As the automotive application domain is very price-sensitive, the CPU speed is chosen to be 33 MHz for a 72% load from speech compression. Figure 8 shows a block diagram of the phone’s digital architecture for the two network implementations. 2 µs (66 cycles) per bit
20
80
100
Thread integration replaces 19% of the idle cycles with useful work. As summarized in table 3, an additional 2062 bytes of program memory are needed for the integrated version as compared with the discrete. Thread integration enables system designers to eliminate a dedicated CAN protocol controller chip, reducing system size, weight and cost. TABLE 3. Vehicle cellphone thread statistics.
19 cycles Confirm bus state CPU Activity Key CPU Active Overhead
60
FIGURE 10. CPU capacity vs. message activity.
CAN Bus
10 cycles Shift next bit onto bus
40
Messages per Second
10 cycles Shift next bit onto bus
FIGURE 9. CAN message timing and CPU activity.
CAN 2.0A [2] is a robust multimaster bus designed for real-time communication with short messages (up to eight bytes). Transmitters perform bitwise arbitration on unique 11 bit message identifiers to gain access to the bus. During message transmission, the sending node monitors the bus for errors. At the end of the message frame (up to 131 bits), all nodes on the bus assert an acknowledgment flag. Figure 9 shows the timing of operations within each bit cell for this application when using a 33 MHz CPU and 500 kbps
Thread
Cycle Count
Idle Cycle Count
Size in Bytes
Reflection_Coefficients
6344
0
1264
CAN
8674
4847
498
Discrete Reflection_ Coefficients and CAN
14990
4847
1762
Integrated Reflection_ Coefficients and CAN
14436
0
3824
4.3 Cellphone with Internal Network This handheld cellphone application was presented in detail in [4] and is revisited here for comparison with the other applications. The cellphone communicates with its smart battery using the I2C protocol, a 100 kbps multimas-
ter bus popular for communication within small embedded devices. The message transmission function (I2C) is integrated with an autocorrelation function (Fast_Autocorrelation). I2C implements a subset of the protocol, being limited to sending one byte messages (called Quick Commands in the SMBus extension to I2C) in a system with only one master and regular speed peripherals. The CPU runs at 66 MHz; voice compression requires 30% of the CPU’s cycles. The remaining capacity might be used for advanced features such as voice recognition, soft modem/fax, image compression/decompression, and encryption.
ROM
CPU
RAM
ROM
CPU
messages are transmitted by the less efficient discrete I2C function, which is beyond the knee in the plot. Integration improves the run time efficiency of the discrete functions but expands the code memory requirement for the two functions from 548 to 1416 bytes. This integration fills the idle time with code from Fast_Autocorrelation efficiently enough to mask 67% of the I2C message transmission time. In this example, thread integration allows system designers to eliminate a dedicated I2C controller or increase the efficiency of a software implementation at the price of slightly more program memory. 100.0%
RAM
I/O
IC Controller
I/O
Latch
I2C Bus
Hardware
I2C
Buffer
I2C Bus
Protocol
Software
I2C
Protocol
FIGURE 11. Handheld cellphone hardware components.
CPU Time Remaining
99.8%
2
99.6%
99.4%
HW I2C, SW GSM Integrated I2C, GSM SW I2C, SW GSM
99.2%
99.0%
Figure 11 shows the two hardware architectures which support the hardware and software implementations of I 2C. The hardware I2C version contains a dedicated bus controller, while the software version reduces system hardware. TABLE 4. Statistics for original and integrated threads
Thread
Cycle Count
Idle Cycle Count
Size in Bytes
Fast_Autocorrelation
25768
0
268
I2C
6612
6404
280
Discrete Fast_Autocorrelation and I2C
32380
6404
548
Integrated Fast_Autocorrelation and I2C
27943
0
1416
Table 4 summarizes characteristics of the two software implementations while Figure 12 plots CPU time required for I2C message transmission based on message rate. The discrete software message transmission function presents a small load for the high-performance CPU (chosen to support the advanced features mentioned previously), but it is reduced further through thread integration. The integrated version supports rates of up to 50 messages per second and is limited by the call frequency of its host function Fast_Autocorrelation. At higher message rates the surplus
0
20
40
60
80
100
I2C Messages Transmitted per Second
FIGURE 12. CPU time vs. message activity.
5.0 Conclusions and Future Work In this paper we present techniques for integrating software threads to replace dedicated real-time hardware in embedded systems as well as overlapping the execution of multiple threads to increase overall performance. These techniques can be automated in a compilation tool. We examine the method’s feasibility by integrating example real-time threads in several hypothetical embedded systems. These examples demonstrate the potential savings of hardware components in integrated software implementations. Thread integration allows the system designer to implement new real-time functions in software, speeding time to market and reducing development costs. Moving functions into software enables more of the system cost to match the falling cost of CPU throughput. We are in the process of adding thread integration techniques to the Pedigree compiler to automate thread integration. We plan to retarget Pedigree to support an ISA more representative of embedded systems. We will extend thread integration to efficiently handle dynamic events which cannot tolerate long latencies.
Acknowledgments This work was funded by a grant from United Technologies Research Center, East Hartford, CT, 06108, and in part by ONR (N00014-96-1-0347).
References [1] V.H. Allan, J. Janardhan, R.M. Lee and M. Srinivas, “Enhanced Region Scheduling on a Program Dependence Graph”, Proceedings of the 25th International Symposium and Workshop on Microarchitecture (MICRO-25), Portland, OR, December 1-4, 1992. [2] Robert Bosch GmbH, CAN Specification, Version 2.0. [3] J.E. Bresenham, “Algorithm for Computer Control of a Digital Plotter”, IBM Systems Journal, 4(1), 1965, pp. 25-30.
[10]Sharad Malik, Margaret Martonosi and Yau-Tsun Steven Li, “Static Timing Analysis of Embedded Software”, ACM Design Automation Conference, June 1997. [11]D. Neihaus, “Program Representation and Translation for Predictable Real-Time Systems”, Proceedings of the 12th IEEE Real-Time Systems Symposium, December 1991, pp. 53-63. [12]Chris J. Newburn, Derek B. Noonburg and John P. Shen, “A PDG-Based Tool and Its Use in Analyzing Program Control Dependences”, International Conference on Parallel Architectures and Compilation Techniques, 1994. [13]Chris J. Newburn, “Pedigree Documentation”, Technical Report, CMµART-97-03, Carnegie Mellon Microarchitecture Research Team, Electrical and Computer Engineering Department, Carnegie Mellon University, November 1997. [14]Chris J. Newburn, “Exploiting Multi-Grained Parallelism for Multiple-Instruction Stream Architectures”, Ph.D. Thesis, CMµART-97-04, Electrical and Computer Engineering Department, Carnegie Mellon University, November 1997
[4] Alexander G. Dean and John Paul Shen, “Hardware to Software Migration with Real-Time Thread Integration”, Proceedings of the 24th EUROMICRO Conference, Västerås, Sweden, August 25-27, 1998, pp. 243-252.
[15]Philips Semiconductors, “The I2C-bus and how to use it (including specifications)”, 1995.
[5] Jutta Degener, “Digital Speech Compression”, Dr. Dobbs Journal, December 1994, http://www.ddj.com/ddj/1994/1994_12/ degener.htm, http://kbs.cs.tu-berlin.de/~jutta/toast.html.
[16]Philips Semiconductors, “Optimized MIPS RISC-based TwoChipPIC Powers Pen-Based, Pocket-Sized Personal Companion Devices”, Press release, January 26, 1998
[6] Jeanne Ferrante, Karl J. Ottenstein and Joe D. Warren, “The Program Dependence Graph and Its Use in Optimization”, ACM Transactions on Programming Languages, 9(3), July 1987, pp. 319-349.
[17]P. Puschner and C. Koza, “Calculating the maximum execution time of real-time programs”, The Journal of Real-Time Systems, 1(2), September 1989, pp. 160-176.
[7] Rajiv Gupta and Mary Lou Soffa, “Region Scheduling”, Proceedings of the Second International Conference on Supercomputing, 1987, pp. 141-148. [8] Rajiv Gupta and Madalene Spezialetti, “Busy-Idle Profiles and Compact Task Graphs: Compile-time Support for Interleaved and Overlapped Scheduling of Real-Time Tasks”, 15th IEEE Real Time Systems Symposium, 1994, pp. 86-96. [9] Hitachi, HD61830/HD61830B LCDC (LCD Timing Controller) Data Sheet.
[18]K. Ramamritham, “Allocating and Scheduling Complex Periodic Tasks”, Proceedings of the 10th International Conference on Distributed Computing Systems, 1990, pp. 108-115. [19]Madalene Spezialetti and Rajiv Gupta, “Timed Perturbation Analysis: An Approach for Non-Intrusive Monitoring of RealTime Computations”, ACM SIGPLAN Workshop on Language, Compiler, and Tool Support for Real-Time Systems, Orlando, Florida, June 1994