Balancing Register Pressure and Context-Switching ... - CiteSeerX

Balancing Register Pressure and Context-Switching ∗ Delays in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alexander G. Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering North Carolina State University Raleigh, NC, USA

alex [email protected] ABSTRACT This paper makes two contributions to Asynchronous Software Thread Integration (ASTI). First, it presents methods to calculate worst-case secondary thread performance statically. This will enable real-time performance guarantees for the system in future work. Second, it improves the run-time performance of integrated threads by partitioning the register file, allowing faster coroutine calls. Determining the ideal partitioning of the register file is non-trivial if the registers are heterogeneous, which is a common case. We use an exhaustive search to explore the limits of performance possible. We have implemented these analyses in our research compiler Thrint and a shell script to create an automated system for design space exploration. We present experimental results showing the secondary thread performance attainable on an 8-bit embedded microcontroller of the AVR architecture. We automatically integrate two embedded protocols (CAN and MIL-STD-1553B) and two secondary threads (PID controller and serial host interface) and find that in most cases the AVR’s 32 registers are adequate for both threads with no slowdown. In two cases slowdowns reach 1.8%, a negligible penalty. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors – code generation, compilers, optimization General Terms: Algorithms, Design Keywords: Asynchronous software thread integration, hardware to software migration, fine-grain concurrency, softwareimplemented communication protocols

1.

INTRODUCTION

Partitioning a register file to support multiple threads in ASTI [1] requires balancing several factors, making it a chal∗Supported by NSF CAREER award CCR-0133690

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’05, September 24–27, 2005, San Francisco, California, USA. Copyright 2005 ACM 1-59593-149-X/05/0009 ...$5.00.

lenging problem. Dedicating a register to a single thread rather than sharing it eliminates the need to save and restore it during context switches (coroutine calls in ASTI), accelerating them. However, it reduces the number of registers available for other threads, potentially increasing their register-memory traffic and slowing execution. Sharing a register allows the compiler to generate more efficient code by holding more data in registers, yet it increases the time for a context switch. Traditional methods of register allocation (e.g. coloring [2]) do not work here as the asynchronous nature of thread progress in ASTI makes it possible for far more live ranges to overlap, leading to much more interference among variables. This is complicated by heterogeneous register files; certain registers may have different capabilities (e.g. address pointers, immediate value support, concatenation), requiring a more sophisticated partitioning approach. A banked register file provides one solution to this problem. Although this exists in some instruction set architectures (M16C, Z80), it is absent in others (AVR, MSP430). A fundamental theme of our work is developing software techniques which make generic processors lacking special hardware support more efficient, allowing a wider range of devices to be used.

1.1 Improving Uniprocessor Concurrency The problem of efficiently allocating a number of concurrent potentially real-time tasks onto one or a few processors is important to the fields of real-time and multiprocessor scheduling, hardware/software co-synthesis and compilation. Dynamically scheduling and then performing each context switch consume time, reducing system throughput. However, embedded systems designers can sometimes take advantage of application information to statically schedule or even eliminate context switches. This reduces the instruction throughput required, allowing the use of a slower processor clock or even a less complex and less expensive processor. Slower processors have a lower demand for instruction and data throughput, potentially averting the need for memory hierarchies, superscalar instruction issue and branch prediction. Using a processor without such features simplifies system design, performance prediction and analysis. Lower clock rates lead to benefits such as reduced power and energy consumption, reduced electromagnetic emissions and susceptibility, easier hardware design and debugging, and simplified power supplies and distribution networks. One example of an embedded application subsystem with frequent context switches which can be reduced significantly

Analog & Digital I/O




System MCU

System MCU

Optional System MCU

I/O Expander

Discrete Protocol Controller

On-Board Protocol Controller

Generic MCU with ASTI S/W Protocol Controller

Physical Layer Transceiver




Subroutine calls Executive ReceiveMessage ReceiveBit

Idle time

Check for errors, save bit, Return update CRC

Prepare message Read bit from bus Sample bus for buffer 3 times and vote resynchronization a) Timeline of original primary thread execution

Executive Primary ReceiveMessage Thread ReceiveBit Secondary Thread

Coroutine calls

b) Idle time recovered with coroutine calls ASTI

a) I/O Expander

Communication Network c) MCU with b) Discrete on-board protocol protocol controller controller with MCU

d) Generic MCU with ASTI SW protocol controller

Figure 1: Overview of three dedicated protocol controller architectures which ASTI replaces.

is a dedicated communication protocol controller. Hardware to software migration of such a device can improve system cost, size, weight, power, time to develop and upgradeability. For example, although the CAN automotive protocol [3] has been standardized since 1991, it took a decade to become a common peripheral on low-cost microcontrollers. Until that time had passed a designer had to buy a dedicated CAN controller IC, which often cost more than the microcontroller for which it was a peripheral. The processing needed per bit for common embedded network protocols is typically quite small, ranging from 10 to 200 instruction cycles [4–8]. Given the means to achieve adequate concurrency, a simple and inexpensive microcontroller running at a clock frequency of tens to hundreds of times the network bit rate could implement these protocols and perform other processing as well. 8 and 16 bit microcontrollers with clock rates of 1 MHz to 100 MHz are quite inexpensive, with prices ranging from $0.25 (US) to $5.00 (US). ASTI can make it possible to use these MCUs efficiently enough for this application, eliminating the need for faster processors and their determinism-sapping microarchitectural features such as caches, superscalar execution and branch prediction. These microcontrollers are low in cost in comparison with discrete network interface ICs or network processors, offering cost savings on top of flexibility. We call such migrated devices software-implemented communication protocol controllers (SWICPCs); they feature one or more real-time threads which access to the network bus and are characterized by fine grain idle time imposed upon them due to the bit rate of the protocol. Although this paper concentrates on SWICPCs, these methods are applicable to the other fields mentioned earlier, especially hardware/software co-synthesis. Figure 1 shows various ways in which a SWICPC can replace dedicated hardware; more detail appears in [1]. Figure 1a shows the simple “I/O expander” (or “Serially Linked I/O”) device, which connects discrete analog and digital inputs and outputs to a network [9–12]. Figure 1b shows a discrete protocol controller, which acts as a bridge between the communication bus and the system’s microcontroller.

Primary Thread Integrated Secondary Thread

Executive ReceiveMessage ReceiveBit c) Idle time recovered with coroutine calls and integration

Figure 2: Traditional method for sharing processor incurs two context switches per idle time fragment. ASTI eliminates most context switches and also recovers finer grain idle time.

Figure 1c shows a microcontroller with an on-board protocol controller. Figure 1d shows the architecture of the system based on ASTI.

1.2 ASTI Software Thread Integration (STI) [13] and Asynchronous STI (ASTI) [1] are compiler methods which interleave functions from separate program threads at the assembly language level. This creates implicitly multithreaded functions which provide low-cost concurrency on generic hardware. STI provides synchronous thread progress, requiring the software developer to gather work to ensure an integrated function is frequently data-ready. ASTI provides asynchronous thread progress through the use of lightweight context switches (coroutine calls [14]) between primary and secondary threads. Each thread has its own call stack. The primary thread has hard-real-time constraints which are guaranteed by integration, unlike the secondary thread. This paper provides an analytical foundation of predicting a secondary thread’s performance, opening the door to evaluating its real-time characteristics (left for future work). In ASTI, the cost of context switching is mitigated by reducing the number of these switches, as shown in Figure 2. ASTI uses the idle time TIdle within a frequently-called primary thread function as a window in which to execute a segment of a secondary thread via a coroutine call (or cocall ). There will be TSegmentIdle = TIdle − 2 ∗ TCocall of that time available, given that two cocalls (TCocall long each) must execute for each segment. After padding to equalize timing of conditionals and loops modulo TSegmentIdle , the entire secondary thread is broken into segments of duration TSegmentIdle . Intervening primary code within the idle time window is removed and integrated into each segment of the secondary thread, ensuring that running any segment of the secondary thread will still result in the proper intervening primary code executing at the correct times. In addition, coroutine calls are integrated in the secondary thread to ensure that it yields control back to the primary thread at

the end of the segment, just before the available idle time expires. The coroutine calls are responsible for saving the context of one thread before switching to the other during the idle time slots. These two types of code are integrated by copying them into each path of control flow in a location which ensures they execute at the correct times regardless of the primary and secondary thread’s progress. Conditionals in the primary code and secondary thread segments are padded to eliminate timing jitter, the code is statically analyzed for timing behavior, and then the primary code regions are inserted at the appropriate times. Integration involves control-dependence graph traversal, and transformations are cumulative. Non-looping primary code regions are handled individually. Moving a region into a conditional requires replicating it into both sides, while entering a loop requires either guarding the execution with a conditional which triggers on a specific loop iteration or else splitting and peeling the loop. Looping primary function regions are unrolled and treated as non-looping regions unless they overlap with secondary function loops. In the latter case the overlapping iterations of the two loops are unrolled as needed to match the secondary function loop body work to available idle time in the primary function loop body. This increases efficiency. In certain cases, integrated code may be guarded by a conditional test to allow execution based upon mode flags in a status register. More details appear in [1]. We assume that instructions take a predictable number of cycles to execute. We target applications with only one hard real-time thread (the primary thread, used for the communication protocol), although recent extensions to STI [15] support multiple hard-real-time primary threads. During message transmission or reception, any other (secondary) threads will only run in the available idle time, slowing them from their original speed. (Note that this is an improvement over the non-integrated case, where threads may be completely blocked). These methods eliminate most of the context switches needed to share the processor, enabling recovery of finergrain idle time for use by the secondary thread. The two benefits of these recovered compute cycles are improved performance of the secondary thread and reduced minimum clock speed for the microprocessor. The former allows more work to be done, while the latter enables the use of a lower cost processor or a more power-efficient operating frequency. These benefits enable embedded system designers to use existing processors more efficiently with less software development effort. Designers can implement their communication software methodically without resorting to ad hoc solutions which are brittle and difficult to change.

1.3 Paper Overview This paper is organized as follows. Section 2 presents our methods for predicting the performance of a secondary thread and then the techniques for finding a register file partition leading to efficient code. Section 3 describes target microcontroller architecture and the software tools used. It then describes and quantifies the software implementations of the two embedded communication protocols and the two secondary thread applications. Section 4 analyzes the results of the various register file partition options and the search for the best partition. Section 5 describes related work in compiling for fine-grain concurrency and register file parti-

tioning. Finally, Section 6 summarizes the impact of results and identifies areas for future work.

2. METHODS This section presents the methods to predict worst-case secondary thread performance statically, enabling real-time performance guarantees for the system. It then presents the register file partition selection methods used.

2.1 Factors Which Determine Secondary Thread Performance The secondary thread runs only when the primary thread yields control, so predicting its performance depends on several factors. All of these factors are affected by the register file partitioning approach used. Performance depends on (1) the execution schedule of the primary thread after compilation with a partitioned register file, (2) how large the usable fragments of idle time in the primary thread are, (3) how long a coroutine call takes to execute, and (4) the execution schedule of the secondary thread after compilation with a partitioned register file.

2.2 Primary Thread: Maximum Inter-Call Processing ASTI reclaims a fragment of idle time from the primary thread by inserting at its beginning a coroutine call, which resumes execution of a secondary thread. After a fixed amount of time TSegmentIdle the secondary thread yields control back to the primary thread with another coroutine call. The duration of TSegmentIdle depends upon both the (potentially hard) real-time instruction-level requirements of the primary thread (e.g. input/output operations which communicate on a bus, determined by the bus bit rate), and the amount of time required for instruction execution between them (determined by the source code, the compiler, the ISA and the processor clock rate). Supporting hard-realtime requirements requires supporting the worst case, hence the amount of idle time depends upon the longest-duration execution path between any two successive idle segments. This section describes the method to enumerate the paths and find the largest. Although for this work only the worstcase path needs to be determined, in other work we use the path set for code motion to shorten the worst-case path and hence increase the minimum idle time fragment size, improving performance [16]. As described previously [1], the input and output instructions with timing constraints are located within (potentially call-scheduled) fragmented primary functions (CSFPF). We divide the path into two portions: that within the CSFPF and that within the calling function. Analyzing the execution time within the CSFPF is straightforward with treebased static timing analysis, and has already been implemented in Thrint. Analyzing the execution time within the calling function (between calls) is novel and the focus of this algorithm. The timing analysis algorithm which we use to identify the run times of the different inter-call paths that exist in the program is simple yet efficient. We define an inter-call path as the set of nodes (CODE/LOOP/PRED) between subsequent CALL nodes (which call CSFPFs) that can occur in the CDG of a program. By subsequent we mean the subsequence during the run-time of the program. Not all inter-call paths in a program may be traversed, but all must be accounted for when timing

C1

B1

available in those functions. This idle time is used to execute the secondary thread.

LOOP

P1 T

F

B2

B3

B4

C3

B5

P2 F

F

C2

B6

Figure 3: Inter-call static timing analysis LivePaths Set of paths for which the execution time is being calculated. AppendNew(pathlist,i) - Appends a new path to pathlist. The new path is a copy of the path number i in pathlist. AddDuration(pathlist,dur) - Adds dur cycles to the duration of all the paths in pathlist. RemovePath(pathlist,i) - Removes the path i from pathlist. Type(node) - Gives the type of node. NextChild(node,tv) - Gives the next child of node with truth value tv. LoopReturn(node) - Is true, if node is the LOOP node and is already visited or is being visited. HasCALL(node) - Is true, if node has a CALL node as one of its descendants Duration(node) - Duration of node in cycles SDuration(node) - Duration of the entire subgraph with node as the root IsLCP(node) - Is true, if node is a loop closing PRED node - current node (the algorithm is started with node=start) createnew - indicates whether new paths have to be created (initially false)

Figure 4: Definitions of terms used in algorithm

analysis is performed. Timing analysis becomes complicated for inter-call paths as there can be numerous possible ways in which a program flow can take place. For example for the simple CDG shown in the figure 3, the number of inter-call paths possible is 6. To simplify the task we do not explore the various paths possible for PRED and LOOP nodes which do not have any CALL nodes below them. In our example PRED node P1 has two children B2 and B3, but has no CALL node. Instead of considering the two possible paths that are available at the PRED node (along either branch), we take the PRED node as a single node with a finite run-time being the greatest run-time of the branches of the PRED node. In the case of LOOP nodes which do not contain CALL nodes, we determine the run-time of the entire LOOP node by using data-flow analysis or directions from the user about the loop iteration counts. The user can provide this information through the integration directives (ID) file in our implementation Thrint. After this simplification each PRED or LOOP node that does not have any CALL node below it, is represented as a single node with finite run-time. The definitions of Figure 4 are used in the algorithm. The algorithm for computing the various inter-call paths is presented in Figure 5. This path information is used to determine the worst-case duration TM axInterCallP rocessing between successive calls to CSFPFs, which in turn determines the amount of idle time

2.3 Secondary Thread: Slowdown after Partitioning and Segmentation The amount of processing performed in the secondary thread per coroutine call depends upon the idle time present in the primary thread functions [1]. It is the sum of the idle time segments (a through b) minus the context-switching time TCS : TSegmentIdle =

b X

(Tidle (F, i)) − 2 ∗ TCS

(1)

i=a

The secondary thread is structured as an infinite eventprocessing loop. We use the worst-case duration of that loop’s body (TSec cycles) to measure performance, as it will determine maximum update rates and response times. During integration, that loop body is divided into segments of length TSegmentIdle cycles. Hence it will take an integral number of segments to execute the loop body, raising the iteration duration to TSec−Seg . This is the base performance against which we will compare the effects of register file partitioning. ‰ ı TSec TSec−Seg = ∗ TSegment (2) TSegmentIdle When partitioning the register file between primary and secondary threads, the duration for both threads may increase because the full register set is unavailable to either, leading to increased memory accesses. An increased primary thread duration reduces idle time, decreasing segment duration from TSegmentIdle to TSegmentIdle−P art. (Part. implies register partitioning). The secondary thread duration increases from TSec to TSec−P art. because the full register set is unavailable. The secondary thread is segmented into pieces of duration TSegmentIdle−P art. , leading to a secondary thread loop iteration duration of TSec−Seg−P art.. TSec−Seg−P art. =

‰

TSec−P art. TSegmentIdle−P art.

ı

∗ TSegmentIdle−P art.

(3) Finally, the slowdown is the ratio of the two secondary thread’s worst-case loop iteration time, before and after compilation with a partitioned register file and integration by Thrint (which includes segmentation).

Slowdown =

l

TSec−P art. TSegmentIdle−P art.

l

m

TSec TSegmentIdle

∗ TSegmentIdle−P art. m ∗ TSegment

(4)

3. EXPERIMENTAL METHOD 3.1 Target Processor We target 8-bit microcontrollers with the AVR architecture (e.g. the ATmega128 [17]), as in our previous work [1]. AVR devices have a load-store architecture, a two-stage pipeline, separate on-chip program and data memories, no cache memory and easily predictable timing characteristics. Most instructions take one cycle to execute. As shown in Table 1, the register file has 32 8-bit registers. Of these, sixteen

AddDuration(LivePaths,Duration(node)) tv=truth value of first child of node next=nextchild(node,tv) Inter-CallAnalysis(next,true) next=nextchild(node,tv) while next do { Inter-CallAnalysis(next) next=nextchild(node,tv) }

Inter-Call Analysis(node,createnew) if Type(node) = PROC do { next= nextchild(node,TV_T) while next do { Inter-CallAnalysis(next) next=nextchild(node,TV_T) } } else if createnew { for every path p in LivePaths AppendNew(LivePaths,p) } if LoopReturn(node) exit if Type(node)=CODE AddDuration(LivePaths,Duration(node)) If Type(node)=LOOP { If not HasCALL(node) AddDuration(LivePaths,SDuration(node)) else next=nextchild(node,TV_T) while next do { Inter-CallAnalysis(next) next=nextchild(node,TV_T) } while next do { Inter-CallAnalysis(next) next=nextchild(node,TV_T) } } if Type(node)=PRED and IsLCP(node) { if node called first time { If not HasCALL(node) AddDuration(LivePaths,SDuration(node)) else {

} } else AddDuration(LivePaths,Duration(node)) } if Type(node)=PRED { if not HasCALL(node) AddDuration(LivePaths,SDuration(node)) else { AddDuration(LivePaths,Duration(node)) tv=truth value of first child of node next=nextchild(node,TV_T) Inter-CallAnalysis(next,true) next=nextchild(node,tv) while next do Inter-CallAnalysis(next) next=nextchild(node,TV_T) while next do { Inter-CallAnalysis(next) next=nextchild(node,TV_F) } } } if Type(node)=CALL for every path p in LivePaths ending in node Remove(LivePaths,p)

Figure 5: Algorithm for finding inter-call paths

Table 1: AVR Register Capabilities Capability Use as pointer Immediate operands supported All other capabilities

r0-r15

r16-r25

yes

yes yes

Register File Partitioning Decisions

r26-r31 yes yes yes

s_m.c

Primary Thread

r_m.c s_b.c

gcc

Thrint: ICTA

TSegmentIdle

r_b.c

can be used with immediate operands. Of these, three pairs can be used as 16-bit pointers to memory. The remaining 16 registers can be used neither with immediate operands nor as pointers. Coroutine calls take from 34 to 158 cycles, depending upon how many registers must be swapped (each register takes 4 cycles to swap). The exact coroutine call duration is TCS = 30 + 4 ∗ N umberOf SwitchedRegs cycles. The AVR architecture has no hardware support for faster switches. (If used on a system with single-cycle cocalls, ASTI would still help by providing automated static scheduling of context switches, significantly improving development process.) Our target devices support clock rates of up to 20 MHz.

Secondary Thread

gcc

gcc

Original Performance of Secondary Thread Comparison: TSec-Seg Slowdown vs. Thrint Full Register Set Thrint

TSec-Seg-Part

Performance of Segmented, Partitioned Secondary Thread

Figure 6: Overview of register file partioning design exploration

3.2 Tools C source code for the threads is compiled to AVR assembly code with avr-gcc 3.2 [18] with maximum speed optimization (-O3). We have developed a back-end threadintegrating compiler (Thrint) which processes AVR assembler code generated by avr-gcc. It automatically performs control-flow, data-flow and static timing analyses, creates visual representations of function structures (control-flow and control-dependence graphs) and performs integration using code motion and loop transformation. The outputs are integrated assembly code, statistics (timing, code size, etc.) and program structure graphs. The work in this paper en-

hances Thrint in two ways: Thrint now calculates secondary thread performance, and builds it into a larger framework which automatically evaluates different register file partitioning approaches. The exploration of register partitioning options, presented in Figure 6, is performed by a shell script tool which in turn invokes two other tools. It calls avr-gcc to compile the primary and secondary thread’s C code to assembler code based upon a specified register file partition. In the figure, s m.c refers to the send message C function, r b.c refers to the re-

ceive bit C function, and so forth. Thrint then analyzes the primary thread’s assembly code statically to determine timing information, which the tool uses to calculate the available idle time (TSegmentIdle ). The tool then runs Thrint to perform static timing analysis of the secondary thread and calculate its performance based upon TSegmentIdle . Future work will extend the tool to use this and previous performance information to guide partitioning decisions.

Table 2: Best and Average Slowdowns Protocol

Activity

1553 1553 CAN CAN

Send Receive Send Receive

Host Interface Best Average 0.00% 21.7% 0.00% 15.4% 1.52% 28.8% 0.00% 14.7%

PID Controller Best Average 0.00% 21.0% 0.00% 16.8% 1.86% 44.8% -0.20% 15.3%

We have developed C software which implements two common embedded communication protocols. MIL-STD-1553B [19] is a command-response communication protocol with a bit rate of 1 Mbps. This protocol is used in aircraft. A Master Bus Controller coordinates traffic. Data is transferred in 16-bit words, with parity checking and potential message retransmission. CAN (Controller Area Network) [20] is a carrier-sense / multiple-access with collision resolution communication protocol commonly used in automobiles. CAN features strong error control support, including a 15 bit cyclic redundancy check code, bit stuffing, error flag signalling and other features. Bit rates range from 10 kbps to 1 Mbps; for these experiments we use 62.5 kbps.

3.4 Applications As previously [1], our applications are structured with two threads. A primary thread which performs protocol processing and communicates with the secondary thread through shared message buffers. Interrupts are disabled to ensure proper timing is maintained. Each secondary thread is structured as a cyclic executive, with an infinite loop to handle activities. Two secondary threads from our previously developed benchmark suite [21] are used, with procedure calls inlined. Typical activities include acting upon received messages from the buffer, sampling and responding to inputs, computing outputs (PID controller), driving output signals, and servicing the UART. The host interface thread acts as one side of a embedded network-to-UART bridge [1]. It receives messages from the host microcontroller (via UART) for transmission on the bus and relays messages received from the bus to the host microcontroller. The PID controller application provides a simple closed loop proportional/integral/derivative control system. The feedback input is FIR-filtered for noise cancellation and then linearized before entering the PID control computation.

4.

ANALYSIS

We use our tools to evaluate the performance of all possible partitionings of the register file between primary and secondary threads. We investigate eight cases, varying protocol (1553B, CAN), bus activity (send, receive), and secondary thread (Host interface, PID Controller).

4.1 General Analysis The goal of this paper is to find the best performance possible; future work will evaluate methods for finding a nearoptimal partitioning quickly. Several parameters are varied: allocation of the three pointer register pairs (zero to three register pairs per thread), the ten immediate-addressingcapable registers (zero to 11 registers) and the 16 other registers (zero to 16 each). Some of these partitionings require registers to be swapped in each context switch. For each combination of primary and secondary thread we evaluate

Slowdown

3.3 Protocols 1.14 1.12 1.1 1.08 1.06 1.04 1.02 1 0.98 0

2

# of Immediate Registers for Secondary

4

6

8

0 10

2

4

6

8

10

# of Immediate Regs. for Primary

Figure 7: Slowdown for 1553 Send with PID Controller, allocating “immediate” registers

100,980 possible allocations of registers. This exhaustive search takes at least a day on a 2 GHz Pentium 4 processor. We examine the slowdowns and find the smallest for a given register allocation approach. Table 2 shows the best (smallest) slowdown for each of the eight integration cases; in five cases the worst-case performance was unaffected by partitioning and integration (including segmentation). In two cases it worsened slightly (up to 1.86%) and in one case it actually improved by 0.2%. Hence, even with their limitations and heterogeneity, the 32 registers present in the AVR register file are more than adequate for supporting efficient execution of these protocols and secondary threads. Table 2 also shows the average slowdown for all of the successful register partitionings. This number is given to provide an idea of what kind of performance might be expected by randomly choosing a partition. The performance is from about 15% to 45% worse than the best case, motivating the search for a better partition.

4.2 Detailed Analysis We now examine detailed performance of specific cases. Each surface plot (e.g. Figure 7) shows the slowdown (zaxis) as a function of two parameters: one horizontal axis (right) indicates the number of registers of a given type available to the primary thread, while the other horizontal axis (left) shows the number available for the secondary thread. Data points along the diagonal correspond to no swapped registers and hence the quickest coroutine calls. Moving to the right beyond the diagonal increases the number of registers available to each thread, yet also raises the number swapped at each coroutine call. The surface plots require two additional explanations for clarity. First, only register partitionings which use at least the full register file size (32 registers) were examined. Hence

1.6

1.2

1.005 1 0.995 0

0.99

9 6

12 3

16 0

# of Other Registers for Secondary Thread

12

8

15

4

# of Other Registers for Primary Thread

Slowdown

1.4

1.01

Slowdown

1.015

1 0.8 0.6 0.4

15

0.2

10

0 0

5 4

8

12

# of Other Registers for Secondary Thread

0 16

# of Other Registers for Primary Thread

Figure 8: Slowdown for 1553 Receive with HI, allocating “other” registers

Figure 9: Slowdown for CAN Send with HI, allocating “other” registers

the points in the left half (triangle) of each graph were not examined. Second, the surface plot clips values to a minimum slowdown “floor”. Points on this “floor” represent infeasible schedules, rather than actual data points. We find the following characteristics:

• Integrating with the CAN Send thread is most sensitive to smaller register partitions. Other registers are most important, followed by Immediate registers. Integrating with the CAN Receive thread is somewhat sensitive to the number of Immediate registers, and less sensitive for Pointer registers. Other registers have a negligible impact.

• In general, reducing the number of registers available to the secondary thread slows the secondary thread negligibly until a threshold is reached. For example, Figure 7 shows that the secondary thread (PID controller) performance is not affected significantly (slowdown ≤ 2%) until fewer than six immediate registers are available, when slowdown rises past 10%. • For some cases (e.g. 1553 Receive + Host Interface), there is a slight benefit to providing more registers to the secondary thread, even though the overhead of context switching rises. As shown in Figure 8, slowdown drops to 1x after more than three registers are swapped. This is likely a result of the register requirements of both threads. The slowdown falls from 1.076x to 1.0x as the secondary thread partition rises from zero to seven immediate registers. • However, in both CAN Send cases, the available idle time in the primary thread is short enough that coroutine call time must be minimized, so only a few registers can be swapped. Attempting to swap more registers make the code impossible to schedule (i.e., there are no idle time segments long enough for two coroutine calls). For example, the CAN Send + HI case (shown in Figure 9) is the most demanding example. Performance is good (less than 7% slowdown) if at least two “other” registers are allocated to the secondary thread and two to the primary thread, and none are swapped. The minimum slowdown (1.0152x) appears in several places along the diagonal (no swapping, and 10 to 14 other registers allocated to the primary thread). At most two other registers can be swapped; slowdowns grow rapidly to 1.26x to 1.51x. The code shows show similar behaviour for immediate and pointer registers.

• Integrating with the 1553B Receive or Send thread is slightly sensitive to Immediate registers; pointer registers have a lesser impact. Overall we find that it is often possible to find a register file partitioning approach which has no performance penalty (i.e. the slowdown is exactly 1). The exceptions are the CAN Send functions, which have higher register pressures.

5. RELATED WORK Snyder et al. [22] seek to minimize context switching time by using dataflow analysis to identify register live-ranges and hence find instructions with the fewest registers live. These fast context switch points require fewer registers to be saved and restored, accelerating switching time. Such a method could be adapted to reduce context switching times for ASTI. Rather than rely upon hardware support to enable a delayed context switch, one could schedule the coroutine calls at the ideal times, potentially reordering instructions to fill available time while minimizing register liveness. A technique called register relocation allows the flexible partitioning of the register file into variable size contexts [23]. It follows the RISC philosophy of maintaining a simple processor architecture and relying on the compiler and the runtime system to manage the allocation and use of contexts. Possible organizations range from static partitioning into contexts with fixed or varying sizes or dynamic allocation of contexts with varying sizes as needed. The register relocation mechanism is very simple. Instruction operands contain context-relative register numbers, which are bitwise ORed with a Register Relocation Mask(RRM) to form absolute register numbers used during instruction execution. This bitwise or is supported in hardware and occurs during the instruction decode phase. Context switching, context

allocation/deallocation and context loading are all done in software. Compilers balance the gain in performance per thread due to a higher number of registers against a potentially greater degree of multithreading (by allowing more resident contexts with fewer registers per context). Experiments are run on a coarsely grained multithreaded processor which switches context only when a high latency operation or synchronization fault occurs. Register relocation significantly outperforms fixed size hardware contexts. Executing mini-threads [24] on a simultaneous multithreading (SMT) processor permits all of the partitioning schemes in the Waldspurger paper above, but it focuses on statically partitioning each register set in half between two minithreads. Applications are compiled to use only half of the register set. Two factors describing mtSMT performance relative to a traditional SMT processor are presented: the IPC benefit of increased TLP and the increased instruction count due to fewer registers per mini-thread. Instruction throughput improvement due to extra mini threads alone (increase in thread level parallelism) increases when the the number of mini-threads is doubled, with the benefit diminishing as the number of contexts increases. The benchmarks considered are rather insensitive however regarding the increase in dynamic instructions due to fewer registers. The results show that trading off registers for contexts improves performance especially on the smaller SMTs that are characterestic of the first commercial implementations. Compared to Waldspurger, a more modern hardware and software environment is used. Intra-context architectural register partitioning is evaluated for mini threads for an out of order simultaneously multi-threaded CPU with multiple hardware contexts and register renaming. Real parallel programs are simulated, including a multi-threaded server compiled for mini threads including OS code with detailed performance analysis. Albrecht et al. [25] seek to reduce context switching time and code size in systems with cooperative software multithreading. Because the multitasking is cooperative, there are explicit context switch points within the program. Registers are re-allocated using coloring such that the registers live immediately before a context switch point are located contiguously at the end of the register file. This enables the use of a single move-multiple-register instruction, rather than multiple move-single-register instructions in order to save the context. This saves time and code size.

6.

CONCLUSIONS AND FUTURE WORK

We present methods to calculate the impact of register file partitioning and integration on a secondary threadIn gen’s execution speed. This will be used in the future to help derive real-time performance guarantees. We find that the register file of the AVR microcontroller is adequate for ASTI using the threads examined: there is a neglible performance impact, given an exhaustive search of all partitioning approaches. However, the average performance ranges is 15% to 45% slower than the original code. Future work includes deriving efficient heuristics to find a near-optimal register partitioning scheme, improving search time dramatically. Alternatively, a data-flow-driven approach to determining register liveness (as described in the related work) could optimize the context switch time, although the time impact on overall performance would need to be evaluated.

7. ACKNOWLEDGMENTS We thank Karthik Sundaramoorthy for the enhancing Thrint with code to automatically implement ASTI for the secondary thread.

8. REFERENCES [1] N. J. Kumar, S. Shivshankar, and A. G. Dean, “Asynchronous software thread integration for efficient software implementations of embedded communication protocol controllers,” in Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems. ACM Press, June 2004. [Online]. Available: http://www.cesr.ncsu.edu/agdean/TechReports/p55kumar.pdf [2] G. J. Chaitin, “Register allocation and spilling via graph coloring,” in Proceedings of the SIGPLAN ’82 Symposium on Compiler Construction, ACM. ACM, 1982, pp. 98–105. [3] “CAN specification version 2.0,” Robert Bosch GmbH, 1991. [Online]. Available: http://www.motsps.com/csic/techdata/refman/can2spec.pdf [4] “Application note: Software LIN slave,” Atmel Corporation, Feb. 2000. [5] AVR320 Software SPI Master, Atmel Corporation, May 2002. [6] AVR304 Half Duplex Interrupt Driven Software UART, Atmel Corporation, Aug 1997. [7] AVR410 RC5 IR Remote Control Receiver, Atmel Corporation, May 2002. [8] A. G. Dean and R. R. Grzybowski, “A high-temperature embedded network interface using software thread integration,” in Second Workshop on Compiler and Architectural Support for Embedded Systems, Washington, DC, October 1999. [9] P82C150 CAN Serial Linked I/O Device (SLIO) with Digital and Analog Port Functions Data Sheet, Philips Semiconductors, JUn 1996. [10] MCP2502X/5X CAN I/O Expander Family Data Sheet, Microchip Technology, Inc., Aug 2000. [11] MIC74 2-Wire Serial I/O Expander and Fan Controller, Micrel, Inc., Aug 2000. [12] PCF8574 Remote 8-bit I/O Expander for I2C-bus, Philips Semiconductors, Nov 2002. [13] A. G. Dean, “Software thread integration for hardware to software migration,” Ph.D. dissertation, Carnegie Mellon University, February 2000. [14] M. E. Conway, “Design of a separable transition-diagram compiler,” Communications of the ACM, vol. 6, no. 7, pp. 396–408, 1963. [15] B. Welch, S. Kanaujia, A. Seetharam, D. Thirumalai, and A. G. Dean, “Extending STI for demanding hard-real-time systems,” in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM Press, November 2003, pp. 41–50. [16] S. Vangara, “Code motion techniques for software thread integration for bit-banged communication protocols,” Master’s thesis, North Carolina State University, May 2003. [17] Atmega 128: 8-Bit AVR Microcontroller with 128K

[18] [19]

[20]

[21]

[22]

Bytes In-System Programmable Flash, Atmel Corporation, 2002. [Online]. Available: http://www.atmel.com/dyn/resources/prod documents/doc2467.pdf avr-gcc 3.2. [Online]. Available: http://www.avrfreaks.net/AVRGCC/index.php Aircraft Internal Time Division Command/Response Multiplex Data Bus, MIL-STD-1553B, Military Standard, September 1978. “CAN specification version 2.0,” Robert Bosch GmbH, 1991. [Online]. Available: http://www.motsps.com/csic/techdata/refman/can2spec.pdf V. Asokan and A. G. Dean, “Providing time- and space-efficient procedure calls for asynchronous software thread integration,” in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, September 2004. J. S. Snyder, D. B. Whalley, and T. P. Baker, “Fast context switches: Compiler and architectural support for preemptive scheduling,” Microprocessors and Microsystems, pp. 35–42, 1995. [Online]. Available: citeseer.ist.psu.edu/33707.html

[23] C. A. Waldspurger and W. E. Weihl, “Register relocation: Flexible contexts for multithreading,” in Proceedings of the 20th Annual International Symposium on Computer Architecture, May 1993. [Online]. Available: http://jukebox.lcs.mit.edu/papers/RegisterRelocation.ps [24] J. Redstone, S. Eggers, and H. Levy, “Mini-threads: Increasing TLP on small-scale SMT processors,” in HPCA ’03: Proceedings of the The Ninth International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE Computer Society, 2003, p. 19. [25] C. Albrecht, R. Hagenau, and A. D¨ oring, “Cooperative software multithreading to enhance utilization of embedded processors for network applications.” in 12th Euromicro Workshop on Parallel, Distributed and Network-Based Processing (PDP 2004). IEEE Computer Society, 2004, pp. 300–307.

Balancing Register Pressure and Context-Switching ... - CiteSeerX

Balancing Register Pressure and Context-Switching ... - CiteSeerX

Suggest Documents

Global Register Partitioning - CiteSeerX

Cyclic Register Pressure and Allocation for Modulo ... - HAL-Inria

Register HERE Balancing Technology Use in ... - Tri-Town Council

Balancing Exertion Experiences - CiteSeerX

Balancing numbers, Lucas - CiteSeerX

Improvements to Conservative and Optimistic Register ... - CiteSeerX

Balancing Culture and Professional Education: American ... - CiteSeerX

Genetic controls balancing excitatory and inhibitory ... - CiteSeerX

ICAGENT: Balancing between reactivity and deliberation - CiteSeerX

Balancing Culture and Professional Education: American ... - CiteSeerX

Balancing Acquisition and Retention Resources to ... - CiteSeerX

Balancing Product Design and Theoretical Insights - CiteSeerX

Balancing Interconnect and Computation in a ... - CiteSeerX

Balancing Conservation and Production in Grassy ... - CiteSeerX

Balancing Agility and Discipline with XPrince - CiteSeerX

Balancing Agile and Structured Development Approaches ... - CiteSeerX

Balancing Culture and Professional Education: American ... - CiteSeerX

Balancing Strategies and Class Overlapping - CiteSeerX

COPD: balancing oxidants and antioxidants - CiteSeerX

Balancing vertical integration and strategic outsourcing - CiteSeerX

Algorithms for Balancing Privacy and Knowledge ... - CiteSeerX

Balancing Culture and Professional Education: American ... - CiteSeerX

sequence balancing and cobalancing numbers - CiteSeerX

INTERNAL PRESSURE DEVELOPMENT AND ... - CiteSeerX