Asynchronous Software Thread Integration for Efficient Software ...

Asynchronous Software Thread Integration for Efficient Software Implementations of Embedded Communication ∗ Protocol Controllers Nagendra J. Kumar†, Siddhartha Shivshankar and Alexander G. Dean Center for Embedded Systems Research, Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695-7256 [email protected], alex [email protected] ABSTRACT

Categories and Subject Descriptors

The overhead of context-switching limits efficient scheduling of multiple concurrent threads on a uniprocessor when real-time requirements exist. Existing software thread integration (STI) methods reduce context switches, but only provide synchronous thread progress within integrated functions. For the remaining, non-integrated portions of the secondary threads to run and avoid starvation, the primary thread must have adequate amounts of coarse-grain idle time (longer than two context-switches). We have developed asynchronous software thread integration (ASTI) methods which address starvation through the efficient use of coroutine calls and integration. ASTI allows threads to make independent progress efficiently and reduces the number of context switches needed through integration. Software-implemented protocol controllers are crippled by this problem; the primary thread “bit-bangs” each bit of a message onto or off of the bus, leaving only fragments of idle time shorter than a bit time. This fragmented time may be too short to recover through context switching, so only the primary thread can execute during message transmission or reception, slowing the secondary threads and potentially making them miss their deadlines. ASTI simplifies the implementation of embedded communication protocols on low-cost, moderate speed (1 - 100 MHz, 8- and 16-bit) microcontrollers. We demonstrate ASTI by replacing a standard automotive communication protocol controller (J1850) with software and generic hardware. Secondary thread performance improves significantly when compared with a traditional interrupt-based software approach.

D.3.4 [Programming Languages]: Processors—code generation/compilers/optimization; D.4.1 [Operating Systems]: Process Management—concurrency/multitasking/scheduling/ threads; D.4.7 [Operating Systems]: Organization and Design—real-time systems and embedded systems

† Now at Intel. [email protected] ∗Sponsored by NSF CAREER award CCR-0133690

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. LCTES’04, June 11–13, 2004, Washington, DC, USA. Copyright 2004 ACM 1-58113-806-7/04/0006 ...$5.00.

General Terms Algorithms, Design, Experimentation

Keywords Asynchronous software thread integration, hardware to software migration, fine-grain concurrency, software-implemented communication protocol controllers, J1850

1. INTRODUCTION The problem of efficiently allocating a number of concurrent potentially real-time tasks onto one or a few processors is important to the fields of real-time and multiprocessor scheduling, hardware/software co-synthesis and compilation. Dynamically scheduling and then performing each context switch consume time, reducing system throughput. However, embedded systems designers can sometimes take advantage of application information to statically schedule or even eliminate context switches. This reduces the instruction throughput required, allowing the use of a slower processor clock or even a less complex and less expensive processor. Slower processors have a lower demand for instruction and data throughput, potentially averting the need for memory hierarchies, superscalar instruction issue and branch prediction. Using a processor without such features simplifies system design, performance prediction and analysis. Lower clock rates lead to benefits such as reduced power and energy consumption, reduced electromagnetic emissions and susceptibility, easier hardware design and debugging, and simplified power supplies and distribution networks. Software thread integration (STI) is a compiler method which interleaves functions from separate program threads, creating implicitly multithreaded functions which provide lowcost concurrency on generic hardware.

One example of an embedded application subsystem with frequent context switches which can be reduced significantly is a network protocol controller. This paper presents a general software structure for communication protocol controlllers, shows how STI can be modified into asynchronous STI (ASTI), which allows independent progress among threads, extends ASTI for protocol controller software, and then applies ASTI to a bridge for a common automotive protocol.

1.1 STI and ASTI Software thread integration (STI) creates a single implicitly multithreaded function from multiple functions [1–4]. STI is useful because it enables the scheduling of multiple processes on a uniprocessor while eliminating most of the context switching overhead. This overhead becomes prohibitively large when implementing threads with fine-grain concurrency (on the order of cycles or tens of cycles). To cross the boundary and schedule one thread’s instructions within another we create integrated threads using compiler techniques such as code motion, control-flow transformations, register reallocation, static timing analysis and timebased instruction scheduling. We consider two types of threads. Primary threads contain primary functions, some of which are temporally fragmented and contain timing constraints (release times and deadlines) on specific instructions. These constraints result in idle time when executed on a sufficiently fast processor. There may also be function- or task-level timing constraints. Secondary threads contain functions with no instruction-level timing constraints and hence are not fragmented. The idle time in the fragmented primary functions can be reclaimed by integrating them with other fragmented primary functions and secondary thread functions. Implementations of non-trivial protocols have multiple realtime events within each bit time, as shown in Figure 1a. Typical activities include sensing the bus during arbitration, sampling multiple times per bit for reliability through voting and detecting bus transitions for clock recovery (resynchronization). These events fragment the available idle time. Such a software implementation requires a mechanism for sharing the processor among the primary thread (implementing the protocol) and the secondary thread (implementing other code, such as an interface to the host device, or smart node functions). This mechanism must transfer control quickly and at the correct time; inefficiencies in either regard will incur significant overhead and limit performance. Existing STI work relies upon the primary threads having large amounts of coarse-grain idle time (defined as being significantly longer than two context-switches) in which any part of the secondary thread can run. Existing STI works by selecting suitable functions of the secondary threads which can be deferred and run synchronously with suitable functions of the primary threads. The remainder of the secondary threads are run within the primary threads’ coarse grain idle time. The secondary threads must accumulate deferrable work (e.g. enqueue it) for later processing by the integrated threads. Software-implementated communication protocol controllers (SWICPCs) do not fit this model. As seen in Figure 1a, the primary thread has little or no coarse grain idle time when transmitting or receiving a message. During periods of 100% utilization, the communication bus will force the processor to run its primary thread. Integrated threads will

Subroutine calls Executive ReceiveMessage ReceiveBit

Idle time

Check for errors, save bit, Return update CRC

Prepare message Read bit from bus Sample bus for buffer 3 times and vote resynchronization a) Timeline of original primary thread execution

Coroutine calls Executive Primary ReceiveMessage Thread ReceiveBit Secondary Thread b) Idle time recovered with coroutine calls ASTI Primary Thread Integrated Secondary Thread

Executive ReceiveMessage ReceiveBit c) Idle time recovered with coroutine calls and integration

Figure 1: Traditional method for sharing processor incurs two context switches per idle time fragment. STI eliminates most context switches and also recovers finer grain idle time.

execute if data is available, but eventually will run out of work, and the secondary threads will not proceed until the bus utilization falls below 100% (leading to coarse-grain idle time in the primary thread). It is not practical to integrate all of the secondary threads with the primary thread using this model for two reasons. First, the primary functions with idle time would need to be integrated with every function in the secondary threads. Second, the idle time within the primary functions is very short, so the secondary threads’ work would not fit without extensive modification of either primary or secondary code. Both of these issues raise the complexity of integration to an unmanageable level. This paper presents our solution to this design challenge. We use context switches (specifically coroutine calls) to allow independent and asynchronous progress for the primary and secondary threads. We mitigate the high cost of context switching by reducing the number of these switches through software thread integration. Portions of the code from fragmented primary functions are integrated and duplicated throughout the secondary threads, ensuring that a context switch to the secondary thread will still result in the proper primary code executing at the correct times. In addition, coroutine calls are inserted into the secondary threads to ensure that they yield control back to the primary thread before the idle time expires. These methods eliminate most of the context switches needed to share the processor, enabling recovery of finer-grain idle time for use by the secondary threads. The two benefits of these recovered compute cycles are improved performance of the secondary thread and reduced minimum clock speed for the microprocessor. The former allows more work to be done, while the latter enables the use of a lower cost processor or a more power-efficient operating frequency. These benefits enable embedded system designers to use existing processors more efficiently. These techniques are especially useful for implementing communication protocols in software, as designers can implement their code methodically without resorting to ad hoc solutions which are brittle and difficult to change.

Analog & Digital I/O




System MCU

System MCU

Optional System MCU

I/O Expander

Discrete Protocol Controller

On-Board Protocol Controller

Generic MCU with ASTI S/W Protocol Controller

Physical Layer Transceiver




1.2 Related Work Both compiler and design automation researchers have attempted to reclaim idle time by interleaving code of multiple threads. Gupta [5] introduces interleaving of busy-idle profiles for reclaiming idle time but relies upon context switching. A fine-grain coroutine called software multithreading uses a compiler to insert coroutine call instructions immediately after long latency instructions, with the goal of increasing performance rather than providing increased concurrency [6]. Another technique called interval scheduling [7] enables static scheduling of operations with delay intervals, and guarantees that all fine grained max/min timing constraints for all combinations of actual delay values are met. This scheme avoids a run-time executive for scheduling. The design automation community has investigated compiling concurrent descriptions (e.g Esterel, Petri networks, SDL, Verilog, flow graphs, communicating sequential processes) of systems into software which runs efficiently on a generic processor despite context switching and scheduling overhead. Event-driven approaches require a dispatcher, allowing dynamic execution at the expense of context-switching overhead. VULCAN [8] uses either a coroutine call or a large switch statement to perform the switching. Compiling Esterel to C requires generating context-switching code [9], as does compiling code for event-driven Verilog simulation [10]. This overhead has been reduced with quasi-static or compile-time scheduling [11–17], in which a model (e.g. Petri net) of a concurrent system’s specification is converted into a set of concurrent tasks, which in turn are dynamically scheduled (e.g. by a real-time operating system). For example, Picasso [15] provides a theoretic approach which begins with Petri nets and converts them to control-flow graph segments which follow Petri net firing rules. These fragments are generated in C and then compiled for the target machine. Ptolemy [13] uses quasi-static scheduling of dataflow graphs for multiprocessors . Lin [15] converts C-like code into Petri nets, then joining concurrent processes and extracting acyclic fragments to produce control-flow graph fragments in C, which can then be compiled and optimized.

1.3 Paper Overview This paper is organized as follows. Section 2 presents our design space target. Section 3 introduces our new ASTI methods. Section 4 demonstrates their use on J1850, a standard automotive embedded network protocol. Section 5 analyzes the resulting system’s performance and code size. Section 6 draws conclusions and points toward future work.

2.

DESIGN SPACE TARGET

Although ASTI is not tied to a specific class of application, we focus upon software implementations of communication protocol controllers because of the abundance of ad hoc solutions in industry, which indicate the need for a solution.

2.1 Implementing Network Protocol Controllers in Software Specialized network protocols abound for embedded systems: they exist for land vehicles (CAN, LIN, J1850, TCN), aircraft (1553, DATAC, ARINC-429), process control and factory automation (Profibus, Foundation Fieldbus, SERCOS), building automation (BACNet, LON, DALI) and sensor networks. These applications use buses with speeds

a) I/O Expander

Communication Network c) MCU with b) Discrete on-board protocol protocol controller controller with MCU

d) Generic MCU with ASTI SW protocol controller

Figure 2: Three types of hardware protocol controllers, and their replacement with a generic MCU.

ranging from under 1 kbps to over 1 Mbps, rather than with high-speed networks which need expensive cables, shielding and transceivers. These embedded network applications can benefit from protocol optimizations for which hardware versions are either unavailable or otherwise unsuitable due to cost, time to market or functionality. For example, SmartDust motes [18] benefit from a custom multihop wireless network communication protocol which reduces power and energy consumption. Tuning the protocol in software is much easier than in hardware. Harsh environments (high radiation or temperature) may also preclude or complicate the use of existing ICs. Obsolescence can also lead to a software implementation. For these reasons there are many software implementations of communication protocols (SWICPs) which reach down to the bit level, eliminating the need for custom hardware. For example, there are infrared protocols for remote controls, software UARTs, modems and protocols such as I2C and RDS, a radio-based data broadcast service [19–26].

2.2

Hardware Protocol Controllers

Hardware protocol controllers are commonly available in three forms. Figure 2a shows the simple “I/O expander” (or “Serially Linked I/O”) device, which connects discrete analog and digital inputs and outputs to a network [27–30]. This device can act upon messages commanding certain output values, and can reply to requests for the state of the inputs. It can also automatically generate a message when specific input conditions occur. Some may even provide local control loop closure. Figure 2b shows a discrete protocol controller, which acts as a bridge between the communication bus and the system’s microcontroller, using either a serial (e.g. UART) or parallel link. Figure 2c shows a microcontroller with an on-board protocol controller; this is the most powerful option. Figure 2d shows the architecture of the system based on ASTI. The ASTI techniques are most suitable for replacing the I/O expander and stand-alone protocol controller. This is because integration complexity rises with application code size and the secondary threads have reduced real-time performance. However, a simple applica-

tion could use ASTI to replace a microcontroller containing an integrated protocol controller with a generic (less expensive) microcontroller.

2.3

Microcontrollers as Protocol Controllers

The processing needed per bit for common embedded network protocols is typically quite small, ranging from 10 to 200 instruction cycles [3, 26, 31–33]. Given the means to achieve adequate concurrency, a simple and inexpensive microcontroller running at a clock frequency of tens to hundreds of times the network bit rate could implement these protocols and perform other processing as well. 8 and 16 bit microcontrollers with clock rates of 1 MHz to 100 MHz are quite inexpensive, with prices ranging from $0.25 (US) to $5.00 (US). ASTI can make it possible to use these MCUs efficiently enough to be SWICPCs, eliminating the need for faster processors and their determinism-sapping microarchitectural features such as caches, superscalar execution and branch prediction. For example, in 2001 75% of the 8 billion microprocessors sold were four- and eight-bit units, all of which lack these features [34]. These microcontrollers are low in cost in comparison with discrete network interface ICs or network processors, offering cost savings on top of flexibility.

2.4

Cocall to secondary

Third_Sample Resynch_Sample Return

TCS Cocall back to primary

TSegment TSegmentIdle

Remove intervening primary code for integration a) Primary function and timing characteristics TOLERANCE 5 us CLOCK_FREQUENCY 8 MHz PROCEDURE RecBit INTO InterfaceFSM USING_COROUTINES BLOCK Second_Sample AT 25 us BLOCK Third_Sample AT 35 us BLOCK Resynch_Sample AT 65 us END b) Integration directives

Figure 3: Timing details for bit-level primary function

Software Architectures

The baseline software implementation presented here uses a single thread (the primary thread) to perform the protocol work, using functions at three levels. The top level executive function implements a finite state machine to monitor an idle bus, send a message or receive one. The middle level consists of send and receive message functions, which deal with message fields (e.g. sending an identifier, format header, data and CRC). The lowest level deals with sending and receiving individual bits, and also samples the bus multiple times per bit for voting, arbitration and resynchronization. Other bit-level activities for other protocols might include bit-(de)stuffing and updating a running CRC. The secondary thread contains an infinite loop which can handle events or behave as a cyclic executive. This generic structure supports polling and a run-to-completion task model, allowing a variety of approaches.

3.

First_Sample Second_Sample

ASTI METHODS

3.1 Overview ASTI enhances traditional methods for implementing communication protocols in software by reducing the number of context switches needed (illustrated in Figure 1c). We consider a system with one primary thread and one or more secondary threads. The primary thread has two types of real-time constraints, while we leave real-time analysis of secondary threads for future work. The primary thread’s timing constraints are met with static scheduling (STI and ASTI) rather than dynamic scheduling (interrupt, kernel) to reduce run-time overhead. Function-level constraints are attached to calls to certain functions, resulting in call-scheduled primary functions (CSPFs); for example the Receive Bit function must be called at the start of each bit time during message reception. Instruction-level constraints are attached to certain instructions within a function, relative to the start of the function, resulting in fragmented primary functions (FPFs); for example Receive Bit must sam-

ple the bus several times and then vote to eliminate noise. In general, input and output operations (resulting in IN and OUT instructions) lead to instruction-level timing constraints. Currently the user marks these in the C source with inlined assembly code labels. These labels are then referenced and given timing requirements in the integration directives file. Functions with both function- and instructionlevel constraints are denoted as call-scheduled fragmented primary functions (CSFPFs). Primary functions without either of the constraints are called unscheduled primary functions (UPFs). As the processor speed rises, these timing constraints lead to the appearance of idle time which may be used by the secondary threads. However, the idle time is fragmented; each transition between the primary and secondary threads requires a context switch, consuming valuable processor time. ASTI integrates groups of primary instructions into multiple locations in the secondary threads, reducing the number of context switches needed. This section presents the general ASTI methods as well as SWICPC-specific extensions. ASTI code transformations remove intervening primary code which fragments the idle time in order to create longer time fragments in which the secondary thread can run. Each time an integrated fragmented primary function is called, it uses a coroutine call to run a portion of the secondary thread (a segment) as well. This ensures progress for the secondary thread. The coroutine calls are responsible for saving the context of one thread before switching to the other during the idle time slots. The secondary thread is modified so it always executes a coroutine call back to the primary thread just before the end of this newly enlarged idle time. Using previously-introduced methods [2, 4, 35], the intervening primary code is integrated into each segment of the secondary thread such that it executes at the correct times regardless of the secondary thread’s progress. Conditionals in the primary code and secondary thread segments are padded to eliminate timing jitter, the code is statically ana-

lyzed for timing behavior, and then the primary code regions are inserted at the appropriate times. Integration involves control-dependence graph traversal, and transformations are cumulative. Non-looping primary code regions are handled individually. Moving a region into a conditional requires replicating it into both sides, while entering a loop requires either guarding the execution with a conditional which triggers on a specific loop iteration or else splitting and peeling the loop. Looping primary function regions are unrolled and treated as non-looping regions unless they overlap with secondary function loops. In the latter case the overlapping iterations of the two loops are unrolled as needed to match the secondary function loop body work to available idle time in the primary function loop body. This increases efficiency. We assume the processor clock runs at fClock and instructions take a predictable number of cycles. For softwareimplemented communication protocols we define the bus bit rate as fBus . One bit lasts TBit = 1/fBus or an integral multiple thereof (as later for J1850). The bit-level functions are the only fragmented primary functions in the system, and they must be called periodically (once per bit) during message transmission and reception. We target applications with only one hard real-time thread (the primary thread, used for the communication protocol), although recent extensions to STI [1] support multiple hard-real-time primary threads. During message transmission or reception, any other (secondary) threads will only run in the available idle time, slowing them to at best TIdleP erBit /TBit of their original speed. (Note that this is an improvement over the non-integrated case, where threads may be completely blocked). We leave performance analysis for these threads for future work.

• Perform static timing analysis of fragmented primary functions and then regularize timing by padding all execution paths through conditionals with nops to equal durations. Loops with unknown iteration counts are not supported with current methods, although we believe this limitation can be removed for many cases, as in our previous work. For SWICPCs, this has not been a significant constraint to date.

3.2 ASTI Procedure

• We may extend the duration of a fragmented primary function in order to reclaim idle time more effectively. Consider a system with a fragmented primary function Fj which runs at most as frequently as every τ (Fj ), and is invoked as a thread by a kernel or is called simply as function by an existing thread. Fj requires at most TW CET (Fj ) of processing time but also contains Tidle (Fj ) of idle time. There is other primary thread processing required which takes at most P i6=j TW CET (Fi ), and which must be executed before the P next invocation of Fj . Thus TCoarseIdle = τ (Fj ) − i6=j TW CET (Fi ) − TW CET (Fj ) − Tidle (Fj ) is the coarse-grain idle time, which is processing time available for secondary threads. However, to use this time for such work, two context switches must be executed (to switch to the secondary thread and then back). This overhead is a significant bottleneck when seeking efficient code. Ultimately the overhead becomes larger than the available processing time, eliminating the possibility of reclaiming it for the secondary threads. However, by delaying the other primary thread P processing i6=j TW CET (Fi ), TCoarseIdle is moved up and joined with the fragmented primary function Fj . This makes the idle time contiguous (after integration) and eliminates two context switches, improving throughput. This also simplifies integration, reducing the number of cases to consider.

3.2.1 Analyze and Transform CSPF- and CSFPFCalling Functions The functions which call CSPFs and CSFPFs must schedule those calls to occur at the required times (e.g. in a SWICPC, bit-level function calls must occur at multiples of TBit ). In addition, there may be processing between the calls, reducing the time available for the secondary thread. • Perform tree-based static timing analysis [36] on CSPFand CSFPF-calling functions, and pad timing variability in conditionals to equalize path durations. • Find maximum time TM axInterCallP rocessing between calls to CSPFs and CSFPFs. Pad so each call is separated by this amount in order to standardize the amount of idle time available for each fragmented primary function. We have investigated methods to reduce TM axInterCallP rocessing for SWICPCs through code motion [37] but do not apply them in the current work. • Note that loops with unknown iteration counts are not allowed between calls to CSPFs and CSFPFs, as they lead to an unknown TM axInterCallP rocessing . However, such loops are allowed if they contain one or more calls to the bit-level functions.

3.2.2 Analyze and Transform Fragmented Primary Functions The fragmented primary functions must be examined to determine quantities of work and idle time.

• Create a timeline of the processor activity (e.g. Figure 3a) for each of the fragmented primary functions. In the case of SWICPCs, the bit-rate and processor clock frequency must be defined. Gaps between instructions with timing constraints form idle time fragments, Tidle (F, i). As in our previous work, we use integration directives in a control file (Figure 3b) to specify the desired start time (as a delay from procedure entry) and tolerable error for each basic block holding a real-time instruction. Our thread-integrating compiler back-end (Thrint) reads these directives and uses them to guide integration. • Find the first and last fragments of idle time (a and b) in the fragmented primary function long enough for a single cocall to execute in each (a = min(i)|Tidle (F, i) > TCS , b = max(i)|Tidle (F, i) > TCS ). • The fragmented primary functions are modified by first removing the code between idle time fragments a and b and saving it for later replication into the secondary thread (as shown in Figure 3a) .

• A coroutine call is inserted at the beginning of the newly enlarged idle time in the fragmented primary functions to transfer control to the secondary thread.

3.2.3 Determine Cocall Period The granularity and distribution of idle time fragments determines how much secondary thread processing will be performed per cocall. For SWICPC, we also consider the timing information from the message-level and fragmented primary functions, the desired processor clock rate and network bit rate.

TSegmentIdle =

b X

(Tidle (F, i)) − 2 ∗ TCS

i=a

The segment idle time TSegmentIdle is the amount of time available for processing the secondary thread on each coroutine call, given that TCS is the time for a context switch. It includes all idle time fragments from a through b minus two context switches. Note that it also includes fragments shorter than TCS , allowing fine-grain idle time to be used. Only fragments a and b must be at least as long as TCS . In a later step, the secondary thread will be sliced into segments of duration TSegmentIdle and each segment will be perforated with primary thread code and a coroutine call.

3.2.4 Integrate Fragmented Primary Code into Secondary Thread The secondary thread is next modified to execute the intervening primary code at the correct times and also yield control back to the primary thread TSegment after being cocalled. • Pad away timing variations in secondary thread loop body. • Pad blocking loop bodies to last exactly TSegmentIdle per iteration. These polling loops are common for waiting for data to arrive at a UART, for example. Padding the loop bodies ensures that only one iteration executes per segment. • Integrate a coroutine call every TSegmentIdle . • Integrate the primary thread code extracted from the fragmented primary function into each segment of the secondary thread using standard software thread integration techniques and the algorithms in Figure 4. The result of this process is illustrated in Figure 1c. For reasons of expediency, the methods presented here do not support function calls in the secondary thread. This is because multiple calls to a single function may occur at different times relative to the start of the segment, requiring integration of the same primary code at different locations. One solution is to inline all function calls in the secondary thread, leading to potential code size explosion. An alternative is to delay such calls so they occur at the same time relative to the start of the segment, leading to reduced throughput. We have developed methods [38] for automatically selecting between function inlining and creating function clones which service clusters of function calls but do not apply them in this current work.

3.2.5 Handling Multiple Fragmented Primary Functions and Other Issues Supporting multiple fragmented primary functions requires a method to switch between the primary functions and still

Copy_and_Integrate_Subgraph(target_offset, master_code) { /* Method called on a node to integrate in the master_code subgraph at target_offset cycles from the beginning of this node */ start_t = end_t = 0 cur_node = this node while !Done && cur_node end_t += worst case duration of cur_node if target_offset > end_t // go to next node start_t += worst case duration of cur_node cur_node = cur_node->successor else if target_offset == start_t insert copy of master_code before cur_node else if target_offset == end_t insert copy of master_code after cur_node else // need to transform cur_node, integrate into it switch type of cur_node CODE: split cur_node at target_offset - start_t insert copy of master_code after cur_node PREDICATE: call this method recursively on 1st true and 1st false child LOOP: split loop at target_offset - start_t peel middle iteration of loop call this method recursively on first node of peeled iteration Done = true } Integrate_Asynchronous_Threads( ) { /* Method called to integrate multiple copies of primary code and cocall into secondary thread */ interv_primary = extract intervening code from primary function cocall_21 = load cocall code (from sec. to pri.) from file secondary_start = find first child node of infinite loop in secondary function offset_time = start time of secondary_start pad loop to multiple of cocall period while (offset_time < end of infinite loop’s first iteration) secondary_start->Copy_and_Integrate_Subgraph(offset_time, interv_primary) secondary_start->Copy_and_Integrate_Subgraph(offset_time, cocall_21) offset_time += TSegmentIdle }

Figure 4: Copy and Integrate Subgraph method integrates a subgraph based on a time offset from a given node. Integrate Asynchronous Threads integrates intervening primary code and cocalls into the secondary thread every TSegmentIdle .

make correct progress in the secondary thread, executing the correct secondary instructions the correct number of times. There are two methods. Each primary fragmented primary can be integrated with its own copy of the secondary thread, or else all of the primaries can be integrated with a single instance of the secondary thread using predicates to guard the execution of each function’s integrated code, cocalls and padding. Our solutions are not described further here, but details appear in [39]. In this paper we use the first method. It is possible to extend these methods to support multiple primary real-time threads, but this and the required timing analysis left for future work. We have developed other supporting methods which are not presented here due to space constraints, but appear elsewhere [39]. Coroutine calls and intervening primary thread code can be guarded with predicates to enable the use of a single rather than multiple copies of a host interface functions as presented here, reducing code explosion. Methods for receiver clock resynchronization based upon detecting bus transitions during message reception are introduced to allow greater tolerance when integrating.

Secondary Thread (Interface)

Dig. Out

Message Queues Dig. In

Manual

System MCU UART

Primary Thread (J1850)

Auto. x x

UART

Bridge MCU ASTI Software

x x

x x

J1850 Bus

Figure 5: ASTI code on a generic microcontroller provides J1850 protocol functionality to system microcontroller.

x x x

ASTI Step parse assembly code and form CDGs perform static timing analysis on all functions pad timing jitter in primary thread functions analyze and extract intervening primary code from fragmented primary functions analyze secondary thread and pad timing jitter all code but blocking loops blocking loops integrate cocalls and intervening primary fragmented primary code (except blocking loops) all code but blocking loops blocking loops generate final integrated assembly file

Table 1: Automation status of asynchronous STI steps

4.

EXPERIMENTS

We have written C code to implement an RS232-J1850 communication bridge with transmit, receive and in-frame response support and message queues to support inter-thread communication. Figure 5 presents the internal software architecture of the J1850 bridge. We use the architecture of Figure 2d (including the optional system MCU) to replace that of Figure 2b. The protocol thread runs on a generic microcontroller to replace the protocol controller IC. This thread is integrated with another “host interface” thread which communicates with the system MCU through a UART. We have integrated the send bit and receive bit primary thread functions with the host processor interface function. We have tested these functions through simulation to ensure their correctness and timing accuracy.

4.1

J1850 Protocol

J1850 is a standard automotive communication network [40]. There are two versions: 41.6 Kb/s Pulse Width Modulation (PWM) and 10.4 Kb/s Variable Pulse Width (VPW). The latter is more prevalent and is used here. VPW communicates via time dependent symbols. Each symbol has a single transition, can be either at the ACTIVE or PASSIVE level and has one of the two lengths, 64µs or 128µs (tN OM at 10.4 Kb/s baud rate), depending on the encoding of the previous bit. The J1850 protocol uses carrier-sense, multipleaccess with lossless collision resolution (CSMA/CR) arbitration. When a transmitting node sends a passive symbol but sees an active symbol, it determines that a higher priority message is being transmitted by another node and hence it has lost arbitration for the bus so it must stop transmitting and continue to function as a receiver. The node(s) that won arbitration continue to transmit, checking each bit of the current message and dropping out when necessary until just one node is left. This checking is done for every bit. A message has a maximum length of 12 VPW (Variable Pulse Width) bytes, plus framing symbols. A CRC byte is used to detect errors during communication.

4.2

Thread

Function

Primary

send msg rec msg send bit rec bit host int

Secondary

C Source Size 33 lines 40 43 13 180

Compiled (.text) size 458 bytes 966 146 76 857

Table 2: J1850 Source Code Sizes J1850 interface consists of a wired-or transistor driver for output and a comparator for input, while the serial interface requires merely an RS232 level shifter IC. The protocol’s C source code is compiled to assembly code with AVR-GCC v3.0 [41] at -O1 optimization (at higher levels of optimization GCC creates unstructured code, which our tools do not yet support automatically). Our research compiler Thrint performs static timing analysis of assembly code, padding, and most of the integration. As shown in Table 1, most of the code transformations needed for integration are performed automatically by Thrint. Code is simulated using AVR Studio v3.52 [42]. Bit-, message- and control-level functions were written in C for J1850 and are described in Table 2. This code implements protocol functionality but is not initially scheduled for proper timing (e.g. with padding or timers). The bit-level functions of the send and receive routines of J1850 access the bus multiple times while sending or receiving a bit. While transmitting a bit, the send bit function accesses the bus twice, once to transmit the bit and then to check for arbitration. The receive bit function samples the bus every 64 µs. In order to sample the value on the bus correctly, the receive bit function samples the bus thrice and then votes using the Send_bit 74 cycles

328 cycles (for 64 us signal)

Receive_bit 158 36 239 cycles cycles cycles

Experimental Method

We target the AVR architecture, which is load-store, has eight bit native data size, a register file of 32 general-purpose registers, on-chip single-cycle program and two-cycle data memories (up to 128 kbytes and 4 kbytes respectively) and no caches. We target a generic low-performance 8 MHz AVR microcontroller which features a UART (e.g. Atmega103). External circuitry is not simulated due to its simplicity; the

Code sampling Idle time introduced 1st bus 2nd bus 3rd bus Idle time introduced the bus to verify to execute a return sample sample sample to execute a return transmission as late as possible as late as possible

Figure 6: Send bit and Receive bit timelines show idle time in J1850 implementation

6000

STI Send ISR Send STI Receive ISR Receive

3000 2500

Receive Padding

5000

Receive Cocalls

Code Size (bytes)

Cycles Available for Sec. Thread

3500

2000 1500 1000

Intervening Primary Receive

Receive

4000

Receive Host_Interface Send Padding Send Cocalls

3000

Intervening Primary Send

Send

Send Host_Interface

2000

Receive_Bit Receive_Message

500

Send_Bit

1000

Send_Message

0

Host_Interface

2

3

4

5

6

7

8

9

10

11

12

0

Discrete

Message Size in Bytes

Integrated

Figure 7: With ASTI, secondary thread makes more progress during message activity than with interrupts

Figure 8:

samples received. This multiple sampling fragments the idle time. The timelines for the send and receive bit functions of J1850 are shown in Figure 6. The secondary thread consists of a function (host int) which handles bridge-to-host processor communication over a serial link connected to a UART. The thread receives and responds to various commands from the host processor (enqueue message to transmit, dequeue received message, query queue status, and flush queues). For simplicity, registers for an inactive thread are stored on that thread’s stack. During a coroutine call, all 32 data registers and the status register of the calling thread are pushed on its stack and those of the other thread are popped from the corresponding stack, so the coroutine call lasts 152 cycles. For the send routine, since the first idle time slot is less than 152 cycles, it cannot be utilized. For the receive routine, the first idle time slot is 158 cycles and the last idle slot is for 239 cycles. Hence by removing the intervening primary code, the entire idle time duration within the receive bit function can be used. Performance could be improved significantly by partitioning some or all of the register file, reducing context switching time by four cycles per register dedicated to a thread. This would also allow finer-grain idle time to be recovered. As the AVR ISA features 32 registers, register pressure is not expected to reduce performance significantly. We have examined the sensitivity of execution time to register pressure for graphics rendering primitives compiled for the AVR and in general found it to be acceptable [1]. Processors with smaller register sets will display more sensitivity. In order to account for clock differences between the sender and receiver, resynchronization is required during reception. It is performed dynamically by either extending or shortening the padding depending upon the time of the input bus signal transition.

context switch of a coroutine call. (Either of these times could be improved by partitioning the register file or using a processor with multiple register banks). The integrated code does not need to perform as many context switches and allows the use of shorter idle time slots, enabling the secondary thread to make progress even during periods of 100% bus utilization. We compare the number of CPU cycles available to the secondary thread as a function of message size in Figure 7. These figures are calculated based upon static timing analysis of the code in question. We present the average case, as actual bit’s duration may be 64 or 128 µs. The progress through the secondary thread is much faster using the new STI methods compared to the interrupt-based approach. The integrated code enables the processor to spend approximately twice as much time on the secondary thread while transmitting. The case of reception is even more striking. Because the largest idle time fragment in receive bit lasts only 239 cycles, there is not enough time to switch to and return from a secondary thread, so no secondary thread progress can be made using an interruptbased approach. The integrated code is able to dedicate time to the secondary thread, guaranteeing progress even during 100% message reception.

5. 5.1

ANALYSIS OF RESULTS Performance Improvement

ASTI improves secondary thread performance compared to traditional approaches because it is better able to use primary thread idle time. We compare performance to an interrupt-based approach in which a context switch takes 151 cycles, which is nearly the same as the 153 cycles of the

Code expansion after integration using padding loops

5.2

Verification Techniques

To ensure that the code transformations are correct, the integrated code is simulated with AVR Studio [42] and log files (which record output port signals) are generated. The simulated microcontroller clock speed is 8 MHz. These log files are then compared, transition by transition, with known good log files created from simulations using the original (non-integrated) code. The timing of this code has been configured to match the J1850 specification exactly and represents what a dedicated hardware J1850 controller would send or receive. The transmission and reception of 10 random J1850 messages were simulated. In each case the messages were correctly transmitted, received and decoded, and the timing jitter was within acceptable limits, as defined in the J1850 specification.

5.3 Code Expansion Memory is a limited resource in low-end embedded microcontrollers, so we wish to minimize the amount used. ASTI

leads to code expansion, mainly due to nop padding and code replication. As seen in Figure 8, the integrated code size for send or receive increases by less than 50% (compared with the original pre-integrated code) and within this, padding accounts for most of the code expansion. The total code size for the system using integration also rises due to the additional copy of the secondary thread. Before integration, 2503 bytes are required for the code (ignoring padding to schedule real-time instructions). After integration 4878 bytes are used. This code expansion, although significant, is acceptable, given the performance increase given for the secondary thread and the resulting ability to drop the system’s clock rate by at least a factor of two.

6.

CONCLUSIONS AND FUTURE WORK

Software implementations of protocol controllers introduce periods of processor inactivity or idle time within the threads implementing the controller. Due to multiple realtime constraints imposed on the threads, the idle time is fragmented. The idle time can be utilized by enabling another thread to execute during this period. Current techniques for context switching take too much time to efficiently perform this multitasking. This paper introduces a scheme called asynchronous software thread integration for efficiently utilizing the idle time by minimizing the number of context switches required per bit. Coroutine calls are used to perform the context switching. The key idea introduced is the removal of intervening primary code that not only exposes more idle time for secondary thread work but also requires that only two context switches be performed between transmission or reception of bits. The intervening primary code is inserted at the appropriate instants within the host interface thread. Thus the scheme can be used even to extract fine grain idle time and use it for host interface thread work. A standard automotive network communication protocol (J1850) is implemented using ASTI on an 8 MHz, 8 bit microcontroller. The resulting code, when simulated, executes the secondary thread much faster (at least 100% faster) than an interrupt-based approach. Several application-independent improvements to ASTI are possible. We are developing methods to support efficient function calls in the secondary thread through cloning and inlining [43]. Using register file partitioning instead of saving registers on the stack would reduce the number of cycles required to perform a coroutine call, improving performance and enabling the recovery of finer-grain idle time. Total code size could be reduced by integrating all primary fragmented primary functions into a single secondary thread. Performance estimates for the secondary thread could be derived statically, enabling prediction of throughput and response times for real-time systems. The coroutine calls could be extended to feature simple scheduling mechanisms, allowing support of multiple secondary threads. ASTI methods could be evaluated on processors with architectural support for faster context switches or built-in scheduling. Various improvements specific to the application of communication protocols can also be made. We have worked in the direction of equalizing inter-bit-call times through code motion rather than padding to improve performance [37]. Methods for automatically inserting data clock recovery code would also be useful. Rapid prototyping of application-specific communication protocols could leverage ASTI and a library of components ready to integrate.

7. REFERENCES [1] B. Welch, S. Kanaujia, A. Seetharam, D. Thirumalai, and A. G. Dean, “Extending sti for demanding hard-real-time systems,” in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM Press, November 2003, pp. 41–50. [2] A. G. Dean, “Compiling for concurrency: Planning and performing software thread integration,” in Proceedings of the 23rd IEEE Real-Time Systems Symposium, Austin, TX, Dec 2002. [3] A. G. Dean and R. R. Grzybowski, “A high-temperature embedded network interface using software thread integration,” in Second Workshop on Compiler and Architectural Support for Embedded Systems, Washington, DC, October 1999. [4] A. G. Dean and J. P. Shen, “Techniques for software thread integration in real-time embedded systems,” in Proceedings of the 19th Symposium on Real-Time Systems, Madrid, Spain, December 1998, pp. 322–333. [5] R. Gupta and M. Spezialetti, “Busy-idle profiles and compact task graphs: Compile-time support for interleaved and overlapped scheduling of real-time tasks,” in 15th IEEE Real Time Systems Symposium, 1994. [6] C. J. Beckmann, “Hardware and Software for Functional and Fine Grain Parallelism,” Ph.D. dissertation, University of Illinois at Urbana-Champaign, April 1993. [Online]. Available: citeseer.nj.nec.com/beckmann93hardware.html [7] P. Chou and G. Borriello, “Interval scheduling: Fine grained code scheduling for embedded systems,” in Proceedings of the Design Automation Conference, June 1995, pp. 462–467. [8] R. K. Gupta and G. De Micheli, “A co-synthesis approach to embedded system design automation,” Des. Autom. Embedded Syst., vol. 1, no. 1-2, pp. 69–120, 1996. [9] S. A. Edwards, “Compiling esterel into sequential code,” in Design Automation Conference, 2000, pp. 322–327. [Online]. Available: citeseer.nj.nec.com/edwards00compiling.html [10] R. S. French, M. S. Lam, J. R. Levitt, and K. Olukotun, “A general method for compiling event-driven simulations,” in Design Automation Conference, 1995, pp. 151–156. [Online]. Available: citeseer.nj.nec.com/french95general.html [11] E. A. Lee, “Recurrences, iteration, and conditionals in statically scheduled block diagram languages,” in VLSI Signal Processing III, R. W. Brodersen and H. S. Moscovitz, Eds. IEEE Press, 1988, pp. 330–340. [12] C.Loeffler, A.Lightenberg, H.Bheda, and G. Moschytz, “Hierarchical scheduling systems for parallel architectures,” in Proceedings of Euco, September 1988. [13] S. Ha and E. Lee, “Compile-time scheduling of dynamic constructs in dataflow program graphs,” 1997. [Online]. Available: citeseer.ist.psu.edu/ha97compiletime.html [14] M. Sgroi, L. Lavagno, Y. Watanabe, and A. L. Sangiovanni-Vincentelli, “Quasi-static scheduling of embedded software using equal conflict nets,” in

[15]

[16]

[17]

[18]

[19]

[20]

[21] [22]

[23] [24] [25] [26] [27]

ICATPN, 1999, pp. 208–227. [Online]. Available: citeseer.nj.nec.com/sgroi95quasistatic.html B. Lin, “Efficient compilation of process-based concurrent programs without run-time scheduling,” in Proceedings of the conference on Design, automation and test in Europe. IEEE Computer Society, 1998, pp. 211–217. E. A. Lee and D. G. Messerschmitt, “Static scheduling of synchronous data flow graphs for digital signal processing,” IEEE Transactions on Computers, January 1987. J. Cortadella, A. Kondratyev, L. Lavagno, C. Passerone, and Y. Watanabe, “Quasi-static scheduling of independent tasks for reactive systems,” Design Automation Conference, June 2000. [Online]. Available: citeseer.nj.nec.com/541753.html J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. E. Culler, and K. S. J. Pister, “System architecture directions for networked sensors,” in Architectural Support for Programming Languages and Operating Systems, 2000, pp. 93–104. [Online]. Available: citeseer.nj.nec.com/382595.html M. Embacher, “Replacing dedicated protocol controllers with code efficient and configurable microcontrollers – low speed can network applications,” National Semiconductor, Aug. 1996. S. George, “Hc05 software-driven asynchronous serial communication techniques using the mc68hc705j1a,” Motorola Semiconductor. G. Goodhue, “A software duplex uart for the 751/752,” Philips Semiconductors. T. F. Herbert, “Integrating a soft modem,” Embedded Systems Programming, vol. 12, no. 3, pp. 62–74, Mar. 1999. S. Holland, “Low-cost software bell-202 modem,” Circuit Cellar, no. 107, pp. 12–19, June 1999. N. Naufel, “Interfacing the 68hc05c5 siop to an i2c peripheral,” Motorola Semiconductor. T. Breslin, “Application note: 68hc05k0 infra-red remote control,” Motorola Semiconductor, 1997. AVR308: Software LIN Slave, Atmel Corporation, May 2002. P82C150 CAN Serial Linked I/O Device (SLIO) with Digital and Analog Port Functions Data Sheet, Philips Semiconductors, JUn 1996.

[28] MCP2502X/5X CAN I/O Expander Family Data Sheet, Microchip Technology, Inc., Aug 2000. [29] MIC74 2-Wire Serial I/O Expander and Fan Controller, Micrel, Inc., Aug 2000. [30] PCF8574 Remote 8-bit I/O Expander for I2C-bus, Philips Semiconductors, Nov 2002. [31] AVR304: Half Duplex Interrupt Driven Software UART, Atmel Corporation, Aug 1997. [32] AVR320: Software SPI Master, Atmel Corporation, May 2002. [33] AVR410: RC5 IR Remote Control Receiver, Atmel Corporation, May 2002. [34] E. Nisley, “Rising tides,” Dr. Dobb’s Journal, vol. 346, Mar 2003. [35] A. G. Dean, “Software thread integration for hardware to software migration,” Ph.D. dissertation, Carnegie Mellon University, February 2000. [36] S. Malik, M. Martonosi, and Y.-T. S. Li, “Static timing analysis of embedded software,” in Proceedings of the 34th Conference on Design Automation (DAC-97). NY: ACM Press, June 1997, pp. 147–152. [37] S. Vangara, “Code motion techniques for software thread integration for bit-banged communication protocols,” Master’s thesis, North Carolina State University, May 2003. [38] V. Asokan and A. Dean, “Relaxing control flow constraints in asti,” Center for Embedded Systems Research, North Carolina State University, Tech. Rep. CESR-TR-03-3, Jul 2003. [39] N. J. Kumar, “Sti concepts for bit-bang communication protocols,” Master’s thesis, North Carolina State University, Jan 2003. [40] SAE J1850 Class B data communication network interface, Society of Automotive Engineers, 1992. [41] avr-gcc 3.2. [Online]. Available: http://www.avrfreaks.net/AVRGCC/index.php [42] AVR Studio 3.52, Atmel Corporation. [Online]. Available: ftp://www.atmel.com/pub/atmel/astudio3.exe [43] V. Asokan, “Relaxing control flow constraints in asti,” Master’s thesis, North Carolina State University, Jul 2003.

Asynchronous Software Thread Integration for Efficient Software ...

Asynchronous Software Thread Integration for Efficient Software ...

Suggest Documents

Asynchronous Software Thread Integration for Efficient Software ...

Complementing Software Pipelining with Software Thread Integration

CARNEGIE MELLON UNIVERSITY SOFTWARE THREAD INTEGRATION FOR ...

Techniques for Software Thread Integration in Real-Time Embedded ...

System-Level Issues for Software Thread Integration: Guest ...

Hardware to Software Migration with Real-Time Thread Integration ...

Asynchronous Learning Networks: Priorities for Software Development

SIG: Asynchronous Learning Networks: Priorities for Software ...

Computer Support for Distributed Asynchronous Software Design ...

SENSEI: Software Evolution Service Integration - Software Engineering

Software Integration Challenges in Global Software Development ...

Software Integration Challenges in Global Software Development ...

Integrated infrastructure: toward software integration - BMC Software

STRATEGIES ON SOFTWARE INTEGRATION

Interviews on Software Integration

IMPROVING SOFTWARE PRODUCT INTEGRATION

integration enterprise software

strategies on software integration

Software integration model for Global Software Development (PDF ...

Identifying Performance Deviations in Thread Pools - Software ...

Thread-Shared Software Code Caches - BurningCutlery

Thread-Shared Software Code Caches - BurningCutlery

Efficient Software Hooks

Energy-Efficient Software