Statically Calculating Secondary Thread Performance in ASTI ...

1 downloads 0 Views 73KB Size Report
8-bit embedded microcontroller of the AVR architecture. We examine ..... dancy check code, bit stuffing, error flag signalling and other features. Bit rates range.
Statically Calculating Secondary Thread Performance in ASTI Systems ∗ Siddhartha Shivshankar, Sunil Vangara and Alexander G. Dean 1

Center for Embedded Systems Research, Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695-7256 alex [email protected]

Abstract. This paper makes two contributions to Asynchronous Software Thread Integration (ASTI). First, it presents methods to calculate worst-case secondary thread performance statically, enabling real-time performance guarantees for the system. Second, it improves the run-time performance of integrated threads by partitioning the register file, allowing faster coroutine calls. We present experimental results showing the secondary thread performance attainable on an 8-bit embedded microcontroller of the AVR architecture. We examine two embedded protocols (CAN and MIL-STD-1553B) and two secondary threads (PID controller and serial host interface).

1. Introduction Partitioning a register file to support multiple threads in ASTI [Kumar et al. 2004] requires balancing several factors, making it a challenging problem. Dedicating each register to one thread rather than sharing it eliminates the need to save and restore it during context switches (coroutine calls in ASTI), accelerating them. However, it reduces the number of registers available for other threads, potentially increasing their register-memory traffic and slowing execution. Sharing a register allows the compiler to generate more efficient code by holding more data in registers, yet it increases the time for a context switch. Traditional methods of register allocation (e.g. coloring [Chaitin 1982]) do not work here as the asynchronous nature of thread progress in ASTI makes it possible for far more live ranges to overlap, leading to much more interference among variables. This is complicated by heterogeneous register files; certain registers may have different capabilities (e.g. address pointers, immediate value support, concatenation), requiring a more sophisticated partitioning approach. A banked register file provides one solution to this problem. Although this exists in some instruction set architectures (M16C, Z80), it is absent in others (AVR, MSP430). A fundamental theme of our work is developing software techniques which make generic processors lacking special hardware support more efficient, allowing a wider range of devices to be used. 1.1. ASTI Software Thread Integration (STI) [Dean 2000] and Asynchronous STI (ASTI) [Kumar et al. 2004] are compiler methods which interleave functions from separate program threads at the assembly language level, creating implicitly multithreaded functions which ∗ Supported

by NSF CAREER award CCR-0133690

provide low-cost concurrency on generic hardware. STI provides synchronous thread progress, requiring the software developer to gather work to ensure an integrated function is frequently data-ready. ASTI provides asynchronous thread progress through the use of lightweight context switches (coroutine calls [Conway 1963]) between primary and secondary threads. Each thread has its own call stack. The primary thread has hard-realtime constraints which are guaranteed by integration, unlike the secondary thread. This paper provides an analytical foundation of predicting a secondary thread’s performance, opening the door to evaluating its real-time characteristics (left for future work). In ASTI, the cost of context switching is mitigated by reducing the number of these switches. ASTI uses the idle time TIdle within a frequently-called primary thread function as a window in which to execute a segment of a secondary thread via a coroutine call (or cocall). There will be TSegmentIdle = TIdle − 2 ∗ TCocall of that time available, given that two cocalls (TCocall long each) must execute for each segment. After padding to equalize timing of conditionals and loops modulo TSegmentIdle , the entire secondary thread is broken into segments of duration TSegmentIdle . Intervening primary code within the idle time window is removed and integrated into each segment of the secondary thread, ensuring that running any segment of the secondary thread will still result in the proper intervening primary code executing at the correct times. In addition, coroutine calls are integrated in the secondary thread to ensure that it yields control back to the primary thread at the end of the segment, just before the available idle time expires. The coroutine calls are responsible for saving the context of one thread before switching to the other during the idle time slots. These two types of code are integrated by copying them into each path of control flow in a location which ensures they execute at the correct times regardless of the primary and secondary thread’s progress. More details appear in [Kumar et al. 2004]. These methods eliminate most of the context switches needed to share the processor, enabling recovery of finer-grain idle time for use by the secondary thread. The two benefits of these recovered compute cycles are improved performance of the secondary thread and reduced minimum clock speed for the microprocessor. The former allows more work to be done, while the latter enables the use of a lower cost processor or a more power-efficient operating frequency. These benefits enable embedded system designers to use existing processors more efficiently with less software development effort. Designers can implement their communication software methodically without resorting to ad hoc solutions which are brittle and difficult to change. 1.2. Paper Overview This paper is organized as follows. Section 2. presents our methods for predicting the performance of a secondary thread and then the techniques for finding a register file partition leading to efficient code. Section 3. describes target microcontroller architecture and the software tools used. It then describes and quantifies the software implementations of the two embedded communication protocols and the two secondary thread applications. Section 4. analyzes the results of the various register file partition options and the search for the best partition. Finally, Section 5. summarizes the impact of results and identifies areas for future work.

2. Methods This section presents the methods to predict worst-case secondary thread performance statically, enabling real-time performance guarantees for the system. It then presents the register file partition selection methods used. The secondary thread runs only when the primary thread yields control, so predicting its performance depends on several factors. All of these factors are affected by the register file partitioning approach used. Performance depends on (1) the execution schedule of the primary thread after compilation with a partitioned register file, (2) how large the usable fragments of idle time in the primary thread are, (3) how long a coroutine call takes to execute, and (4) the execution schedule of the secondary thread after compilation with a partitioned register file. 2.1. Primary Thread: Maximum InterCall Processing ASTI reclaims a fragment of idle time from the primary thread by inserting at its beginning a coroutine call, which resumes execution of a secondary thread. After a fixed amount of time TSegmentIdle the secondary thread yields control back to the primary thread with another coroutine call. The duration of TSegmentIdle depends upon both the (potentially hard) real-time instruction-level requirements of the primary thread (e.g. input/output operations which communicate on a bus, determined by the bus bit rate), and the amount of time required for instruction execution between them (determined by the source code, the compiler, the ISA and the processor clock rate). Supporting hard-realtime requirements requires supporting the worst case, hence the amount of idle time depends upon the longest-duration execution path between any two successive idle segments. This section describes the method to enumerate the paths and find the largest. Although for this work only the worst-case path needs to be determined, in other work we use the path set for code motion to shorten the worst-case path and hence increase the minimum idle time fragment size, improving performance [Vangara 2003]. As described previously [Kumar et al. 2004], the input and output instructions with timing constraints are located within (call-scheduled) fragmented primary functions (CSFPF). We divide the path into two portions: that within the CSFPF and that within the calling function. Analyzing the execution time within the CSFPF is straightforward with tree-based static timing analysis, and has already been implemented in Thrint. Analyzing the execution time within the calling function (between calls) is novel and the focus of this algorithm. The timing analysis algorithm which we use to identify the run times of the different inter-call paths that exist in the program is simple yet efficient. We define an inter-call path as the set of nodes (CODE/LOOP/PRED) between subsequent CALL nodes (which call CSFPFs) that can occur in the CDG of a program. By subsequent we mean the subsequence during the run-time of the program. Not all inter-call paths in a program may be traversed, but all must be accounted for when timing analysis is performed. Timing analysis becomes complicated for inter-call paths as there can be numerous possible ways in which a program flow can take place. For example for the simple CDG shown in the figure 1, the number of inter-call paths possible is 6. To simplify the task we do not explore the various paths possible for PRED and

C1

B1

LOOP

P1 T

F

B2

B3

B4

C3

B5

P2 F

F

C2

B6

Figure 1. Inter-call static timing analysis

LivePaths Set of paths for which the execution time is being calculated. AppendNew(pathlist,i) - Appends a new path to pathlist. The new path is a copy of the path number i in pathlist. AddDuration(pathlist,dur) - Adds dur cycles to the duration of all the paths in pathlist. RemovePath(pathlist,i) - Removes the path i from pathlist. Type(node) - Gives the type of node. NextChild(node,tv) - Gives the next child of node with truth value tv. LoopReturn(node) - Is true, if node is the LOOP node and is already visited or is being visited. HasCALL(node) - Is true, if node has a CALL node as one of its descendants Duration(node) - Duration of node in cycles SDuration(node) - Duration of the entire subgraph with node as the root IsLCP(node) - Is true, if node is a loop closing PRED node - current node (the algorithm is started with node=start) createnew - indicates whether new paths have to be created (initially false)

Figure 2. Definitions of terms used in algorithm

Inter-Call Analysis(node,createnew) AddDuration(LivePaths,Duration(node)) if Type(node) = PROC do tv=truth value of first child of node next= nextchild(node,TV_T) next=nextchild(node,tv) while next do Inter-CallAnalysis(next,true) Inter-CallAnalysis(next) next=nextchild(node,tv) next=nextchild(node,TV_T) while next do else if createnew Inter-CallAnalysis(next) for every path p in LivePaths next=nextchild(node,tv) AppendNew(LivePaths,p) else if LoopReturn(node) AddDuration(LivePaths,Duration(node)) exit if Type(node)=PRED if Type(node)=CODE if not HasCALL(node) AddDuration(LivePaths,Duration(node)) AddDuration(LivePaths,SDuration(node)) If Type(node)=LOOP else If not HasCALL(node) AddDuration(LivePaths,Duration(node)) AddDuration(LivePaths,SDuration(node)) tv=truth value of first child of node else next=nextchild(node,TV_T) next=nextchild(node,TV_T) Inter-CallAnalysis(next,true) while next do next=nextchild(node,tv) Inter-CallAnalysis(next) while next do next=nextchild(node,TV_T) Inter-CallAnalysis(next) while next do next=nextchild(node,TV_T) Inter-CallAnalysis(next) while next do next=nextchild(node,TV_T) Inter-CallAnalysis(next) if Type(node)=PRED and IsLCP(node) next=nextchild(node,TV_F) if node called first time If not HasCALL(node) if Type(node)=CALL AddDuration(LivePaths,SDuration(node)) for every path p in LivePaths ending in node else Remove(LivePaths,p)

Figure 3. Algorithm for finding inter-call paths

LOOP nodes which do not have any CALL nodes below them. In our example PRED node P1 has two children B2 and B3, but has no CALL node. Instead of considering the two possible paths that are available at the PRED node (along either branch), we take the PRED node as a single node with a finite run-time being the greatest run-time of the branches of the PRED node. In the case of LOOP nodes which do not contain CALL nodes, we determine the run-time of the entire LOOP node by using data-flow analysis or directions from the user about the loop iteration counts. The user can provide this information through the integration directives (ID) file in our implementation Thrint. After this simplification each PRED or LOOP node that does not have any CALL node below it, is represented as a single node with finite run-time. The definitions of Figure 2 are used in the algorithm. The algorithm for computing the various inter-call paths is presented in Figure 3. This path information is used to determine the worst-case duration TM axInterCallP rocessing between successive calls to CSFPFs, which in turn determines the amount of idle time available in those functions. This idle time is used to execute the secondary thread.

2.2. Secondary Thread: Slowdown after Partitioning and Segmentation The amount of processing performed in the secondary thread per coroutine call depends upon the idle time present in the primary thread functions [Kumar et al. 2004]. It is the sum of the idle time segments (a through b) minus the context-switching time T CS : TSegmentIdle =

b X

(Tidle (F, i)) − 2 ∗ TCS

(1)

i=a

The secondary thread is structured as an infinite loop. We use the worst-case duration of that loop’s body (TSec cycles) to measure performance, as it will determine maximum update rates and response times. During integration, that loop body is divided into segments of length TSegmentIdle cycles. Hence it will take an integral number of segments to execute the loop body, and the new duration TSec−Seg will likely rise. TSec−Seg =

&

TSec TSegmentIdle

'

∗ TSegment

(2)

After partitioning the register file between primary and secondary threads, the duration for primary as well as secondary threads may increase because the full register set is unavailable to either thread, leading to increased memory accesses. It is apparent from the relation for TSegmentIdle that increased primary thread duration causes a decrease in the segment duration (from TSegmentIdle to TSegmentIdle−P art. , where Part. implies register partitioning). The secondary thread duration increases from TSec to TSec−P art. because the full register set is unavailable. TSec−Seg−P art. =

&

TSec−P art. TSegmentIdle−P art.

'

∗ TSegmentIdle−P art.

(3)

Finally, the slowdown is the ratio of the original secondary thread’s worst-case loop iteration time, and the time after it is compiled with a partitioned register file and integrated by Thrint (which includes segmentation).

Slowdown =

l

TSec−P art. TSegmentIdle−P art.

m

∗ TSegmentIdle−P art.

TSec

(4)

3. Experimental Method 3.1. Target Processor We target 8-bit microcontrollers (e.g. the ATmega128 [atm ]) with the AVR architecture, as used in our previous work [Kumar et al. 2004]. AVR devices have a load-store architecture, a two-stage pipeline, separate on-chip program and data memories, no cache memory and easily predictable timing characteristics. Most instructions take one cycle to execute. The register file has 32 8-bit registers. Of these, sixteen can be used with immediate operands. Of these, three pairs can be used as 16-bit pointers to memory. The remaining 16 registers can be used neither with immediate operands nor as pointers. Coroutine calls take from 34 to 158 cycles, depending upon how many registers must be

swapped (each register takes 4 cycles to swap). The AVR architecture has no hardware support for faster switches. (If used on a system with single-cycle cocalls, ASTI would still help by providing automated static scheduling of context switches, significantly improving development process.) Our target devices support clock rates of up to 20 MHz. 3.2. Tools C source code for the threads is compiled to AVR assembly code with avr-gcc 3.2 with maximum speed optimization (-O3). We have developed a back-end thread-integrating compiler (Thrint) which processes AVR assembler code generated by avr-gcc. It automatically performs control-flow, data-flow and static timing analyses, visualizes function structures (control-flow and control-dependence graphs) and performs integration using code motion and loop transformation. The outputs are integrated assembly code, statistics (timing, code size, etc.) and program structure graphs. The work in this paper enhances Thrint in two ways: Thrint now calculates secondary thread performance, and builds it into a larger framework which automatically evaluates different register file partitioning approaches. The exploration of register partitioning options is performed by a shell script tool which in turn invokes two other tools. It calls avr-gcc to compile the primary and secondary thread’s C code to assembler code based upon a specified register file partition. Thrint then analyzes the primary thread’s assembly code statically to determine timing information, which the tool uses to calculate the available idle time (T SegmentIdle ). The tool next runs Thrint to perform static timing analysis of the secondary thread and predict its performance based upon TSegmentIdle . Based on this and past performance, the tool tries a modified partitioning scheme. 3.3. Protocols We have developed C software which implements two common embedded communication protocols. MIL-STD-1553B [155 1978] is a command-response communication protocol with a bit rate of 1 Mbps. This protocol is used in aircraft. A Master Bus Controller coordinates traffic. Data is transferred in 16-bit words, with parity checking and potential message retransmission. CAN (Controller Area Network) [CAN 1991] is a carrier-sense / multiple-access with collision resolution communication protocol commonly used in automobiles. CAN features strong error control support, including a 15 bit cyclic redundancy check code, bit stuffing, error flag signalling and other features. Bit rates range from 10 kbps to 1 Mbps; for these experiments we use 62.5 kbps. 3.4. Applications As previously [Kumar et al. 2004], our applications are structured with two threads. A primary thread which performs protocol processing and communicates with the secondary thread through shared message buffers. Interrupts are disabled to ensure proper timing is maintained. Each secondary thread is structured as a cyclic executive, with an infinite loop to handle activities. Two secondary threads from our previously developed benchmark suite [Asokan and Dean 2004] are used, with procedure calls inlined. Typical activities include acting upon received messages from the buffer, sampling and responding to inputs, computing outputs (PID controller), driving output signals, and servicing the UART. The host interface thread acts as one side of a embedded network -to-UART

Table 1. Best and Average Slowdowns Protocol 1553 1553 CAN CAN

Activity Send Receive Send Receive

Host Interface Best 0.00% 0.00% 1.52% 0.00%

Average 21.7% 15.4% 28.8% 14.7%

PID Controller Best 0.00% 0.00% 1.86% -0.20%

Average 21.0% 16.8% 44.8% 15.3%

bridge [Kumar et al. 2004]. It receives messages from the host microcontroller (via UART) for transmission on the bus and relays messages received from the bus to the host microcontroller. The PID controller application provides a simple closed loop proportional/integral/derivative control system. The feedback input is FIR-filtered for noise cancellation and then linearized before entering the PID control computation.

4. Analysis We use our tools to evaluate the performance of all possible partitionings of the register file between primary and secondary threads. We investigate eight cases, varying protocol (1553B, CAN), bus activity (send, receive), and secondary thread (Host interface, PID Controller). The goal is to find the best performance possible; future work will evaluate methods for finding a near-optimal partitioning quickly. Several parameters are varied: allocation of the three pointer register pairs (zero to three register pairs per thread), the ten immediate-addressing-capable registers (zero to 11 registers) and the 16 other registers (zero to 16 each). Some of these partitionings require registers to be swapped in each context switch. For each combination of primary and secondary thread we evaluate 100,980 possible allocations of registers. This exhaustive search takes at least a day on a 2 GHz Pentium 4 processor. We examine the slowdowns and find the smallest for a given register allocation approach. Table 1 shows the best (smallest) slowdown for each of the eight integration cases; in five cases the worst-case performance was unaffected by partitioning and integration (including segmentation). In two cases it worsened slightly (up to 1.86%) and in one case it actually improved by 0.2%. Hence, even with their limitations and heterogeneity, the 32 registers present in the AVR register file are more than adequate for supporting efficient execution of these protocols and secondary threads. Table 1 also shows the average slowdown for all of the successful register partitionings. This number is given to provide an idea of what kind of performance might be expected by randomly choosing a partition. The performance is from about 15% to 45% worse than the best case, motivating the search for a better partition. Overall, we find that it is often possible to find a register file partitioning approach which has no performance penalty (i.e. the slowdown is exactly 1). The exceptions are the CAN Send functions, which are have high register pressures (described below). Integrating with the CAN Send thread is most sensitive to smaller register partitions. Other registers are most important, followed by Immediate registers. Integrating with the CAN Receive thread is somewhat sensitive to the number of Immediate registers, and less sensitive for Pointer registers. Other registers have a negligible impact. Integrating with either the 1553B Send or Receive thread is slightly sensitive to Immediate registers; pointer registers have a lesser impact.

5. Conclusions and Future Work We present methods to calculate the impact of register file partitioning and integration on a secondary thread’s execution speed. This will be used in the future to help derive real-time performance guarantees. We find that the register file of the AVR microcontroller is adequate for ASTI using the threads examined: there is a neglible performance impact, given an exhaustive search of all partitioning approaches. However, the average performance ranges is 15% to 45% slower than the original code. Future work includes deriving efficient heuristics to find a near-optimal register partitioning scheme.

6. Acknowledgments We thank Karthik Sundaramoorthy for the enhancing Thrint with code to automatically implement ASTI for the secondary thread.

References Atmega 128: 8-Bit AVR Microcontroller with 128K Bytes In-System Programmable Flash. Atmel Corporation. (1978). Aircraft Internal Time Division Command/Response Multiplex Data Bus, MILSTD-1553B. Military Standard. (1991). CAN specification version 2.0. Asokan, V. and Dean, A. G. (2004). Providing time- and space-efficient procedure calls for asynchronous software thread integration. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. Chaitin, G. J. (1982). Register allocation and spilling via graph coloring. In Proceedings of the SIGPLAN ’82 Symposium on Compiler Construction, pages 98–105. ACM, ACM. Conway, M. E. (1963). Design of a separable transition-diagram compiler. Communications of the ACM, 6(7):396–408. Dean, A. G. (2000). Software Thread Integration for Hardware to Software Migration. PhD thesis, Carnegie Mellon University. Kumar, N. J., Shivshankar, S., and Dean, A. G. (2004). Asynchronous software thread integration for efficient software implementations of embedded communication protocol controllers. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems. ACM Press. Vangara, S. (2003). Code motion techniques for software thread integration for bit-banged communication protocols. Master’s thesis, North Carolina State University.