A Compiler-Based Approach to Protocol Optimization - Semantic Scholar

4 downloads 34437 Views 165KB Size Report
e-mail: ccastel|[email protected]. Second Workshop on .... when the protocol is used by a client or by a server in a bulk data transfer Cla89]. When the ...
A Compiler-Based Approach to Protocol Optimization Claude Castelluccia

Philipp Hoschka

INRIA Centre de Sophia Antipolis, 2004 Route des Lucioles, BP-93, 06902 Sophia Antipolis Cedex, FRANCE. e-mail: ccastel|[email protected] Second Workshop on High Performance Communication Subsystems, Mistic, Connecticut, U.S.A., August 1995.

1 Introduction For arriving at an end system protocol that has sucient performance for high-speed network communication, researchers have followed two dierent approaches: 

manual code optimization (e.g. [Jac88], [Cla89]): With this approach, a set of wellknown code optimsation techniques such as branch prediction (also referred to as header prediction) and function inlining are applied manually to a monolithic implementation of an existing protocol.



protocol conguration (e.g. [Box92], [Hos92], [OMa92], [Vog93], [Zit93]): This approach aims at optimising the end system protocol for a particular set of parameters (often referred to as quality of service parameters). These parameters depend on the application (e.g. bulk data transfer vs. RPC), the end system hardware (e.g. multi-processor vs. RISC workstation) and the underlying network technology (e.g. high-speed network vs. mobile network). A protocol conguration is assembled from a set of predened primitive modules or protocol building blocks.

The performance of a protocol constructed by protocol conguration can be further improved using the optimization techniques usually applied to monolithic protocol implementations. However, manually optimising an automatically congured protocol implementation is often impractical. This is because of the time and eort required and because of the potentially high number of alternative protocol congurations. In this paper, we describe a method that allows automating the application of code optimizations to protocol code implemented using the protocol conguration approach. 1

The rest of this paper is structured as follows: Section 2 presents the way in which Esterel code is translated into C code, and discusses the code optimizations that can be applied during this translation. We motivate the requirement for determining the execution frequency of dierent parts in an Esterel specication. Section 3 explains how these execution frequencies can be calculated from the equivalent Markov model of an Esterel specication. Section 4 demonstrates the use of this approach on an Esterel module implementing the TCP retransmission timer mechanism.

2 Protocol Conguration using Esterel The work presented in this paper took place within the Hipparch project ([Cas94], [Dio95]). In this project, the Esterel language [Ber92] is used for describing protocol building blocks. A good metaphor for a protocol conguration described in Esterel are the functional block diagrams used to describe a hardware card. A protocol conguration consists of a set of Esterel modules (analogous to chips) which are connected by a set of named signals (analogous to wires). Each module is characterised by a set of input signals and a set of output signals. Moreover, the protocol conguration as a whole contains a set of external signals. These are signals connecting the protocol module to its environment. Examples are the signal indicating that a packet was received from the network, or the signal indicating a request for data transfer from the application. Each Esterel module in the protocol conguration contains a description of how the module reacts to a change in its input signal values. At each moment in time, an Esterel module is in a certain state. When an input signal I is received, the module executes the sequence of statements associated with the combination of the input signal I and the module's current state V. After executing the statements, the module assumes a new state V', not necessarily dierent from V. Note that, as in the case of a hardware board, several modules of an Esterel protocol conguration may be active at the same time, i.e. an Esterel specication allows parallel execution. The set of Esterel modules making up a protocol conguration can be translated into executable code by the means of an Esterel compiler. This compiler translates the set Esterel modules of a particular protocol conguration into C code. The code generated by the currently existing Esterel compiler is neither very fast, nor very compact [Cas94]. This is a feature the compiler shares with many similar tools that generate executable code starting from a protocol dened in a formal description technique (FDT). However, we believe that these performance problems are often not inherent to the use of FDTs per se, but an outgrowth of the more theoretical orientation of this area of research. It is only recently that work on building code optimisers for FDT translators has been started, and rst results are encouraging ([Hof93], [Leu94]). Experience with manual optimization of monolithic protocol implementations has shown that two optimization techniques are worthwhile to integrate into the Esterel com2

piler: 

implementation of a fast path: this optimiziation concerns the order in which the different paths through the protocol conguration are sequentialised in the C code. For example, it is advantageous to map code sections that are often executed together into adjacent memory adresses [Pet90]. This will increase the spatial locality of the code, and lead to benets due to a reduction in memory access time with modern cache-based RISC-CPUs [Mos95]. However, applying this optimization requires information on how often each of the paths will be executed, and which nodes on the path will be executed together.



function inlining: the Esterel implementation of a protocol building block may contain calls to auxiliary C functions. Expanding these functions inline can lead to a considerable performance increase. However, function inlining must be applied in a controled way in order to avoid code size explosion. This is particularily important when generating the protocol code for a mobile system such as a Personal Digital Assistant (PDA). These systems often have severe memory restrictions. Thus, only functions that will be used frequently at run time should be inlined.

Thus, the application of both optimization techniques requires information on the execution frequency of the dierent parts in an Esterel specication. The next section addresses the question of how this can be accomplished.

3 Frequency Prediction using Markov Analysis In the general case, the exact number of times a certain part in a program is executed can be determined once the probability of taking each branch in the program is known [Knu73]. This observation has lead to the idea of applying the Markov analysis technique to the ow graph of a program ([Ram65], [Tri82], [Wag94]). A program ow graph consists of a set of nodes called basic blocks that are connected by arcs referred to as branches [Aho86]. When using Markov analysis, the ow graph of a program is regarded as a nite Markov chain where each basic block in the ow graph is mapped onto a state (not to be confused with the states of an Esterel module) of the Markov chain, and each branch is mapped onto a transition of the Markov chain. The probability of taking a branch is then equivalent to the transition probability of the corresponding transition in the Markov chain. Let n be the number of basic blocks in the ow graph, and let p be the probability that control passes from basic block i to basic block j . It can be shown [Tri82] that the number of times that each of the basic block i is visited V can be calculated by solving i;j

i

3

the following system of n linear equations:

V = 1 + j

X n

j

Vp k

k

=1

kj

where  = 1 for i=j, and 0 otherwise. Markov analysis requires that the probability of traversing each transition is known. In other words, the probability of traversing each branch in the program ow graph must be determined. This is a generalisation of branch prediction, which only determines the most probable outcome of a branch. With our approach, branch prediction for protocol code is done at two dierent points in time. First, the writer of the building blocks denes the probabilities for branches leading to code sections. However, for some branches the execution frequency depends on conditions that are only known when the building block is used in a particular protocol conguration. An example in the TCP protocol that dierent code paths are executed when the protocol is used by a client or by a server in a bulk data transfer [Cla89]. When the frequency of a branch depends on a discrete-valued decision variable, the designer of the protocol library rst denes several dierent branch predictions, one for each case. Then, a symbolic conguration variable is introduced with one value for each case. This variable can be set by the user of the protocol library when conguring a particular protocol. In the example analysed by [Cla89], the variable could have the values "Bulk_Server" or "Bulk_Client". Such a variable is easily introduced by introducing a new keyword in the interface specication language. In some cases, it may even be possible to derive the necessary information by analysing the interface specication (e.g. by estimating the size of the data structures exchanged by analysing the type specications). ij

4 Example: TCP Retransmission Timer Control In this section, we illustrate our approach on the example of the Esterel module handling the manangement of the TCP retransmission timer. This module originates from a prototype specication of the complete TCP protocol in Esterel [Cas94a] (see Appendix for the Esterel source code for the retransmission management module). The Esterel module implements the management of the TCP retransmission timer. It reacts on the reception of three external events:  N_S:

a signal indicating that the application requests the transmission of a data packet. Its value is the next sequence number to be sent.

 N_A: a signal indicating the receipt of an acknowledgement.

edged sequence number.

 ALARM:

a signal indicating a packet loss. 4

Its value is the acknowl-

In an Esterel program, a control transfer can occur in two dierent ways: rst, each if statement introduces two arcs, one leading to the basic block that is executed when the if test is true, and one leading to the basic block that is executed when the if test is false. Second, each await in the Esterel program has one arc for each external event dened for the protocol conguration. The arcs will be referred to as branches in the following. The Esterel code shown in the Appendix contains the static branch predictions we made for the retransmission management module, and gives justications for the estimates we made. For example, we assume that packet losses are an infrequent event. We assume that on average the network loses 5% of the packets. The probability for the ALARM signal is set accordingly. Moreover, we assume that the timer module is used by a server in an application using bulk data transfer. Thus, a high number of packets will be sent during each application operation (e.g. a le transfer). Determining the branch probabilities requires a thorough understanding of the algorithm used by the protocol, and the situations in which the protocol is used. It is thus relatively straightforward for the expert writing the library buidling block, but close to impossible to an application programmer, if only for reasons of time constraints. Figure 1 shows the Markov chain for the TCP retransmission timer management module. The transition matrix Q of this Markov chain is dened as follows: 2 66 66 66 66 66 66 66 Q = 666 66 66 66 66 66 66 4

0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 1:0 0 0 0

0 1 0 0 0 0 0 0 0 0 0 0 0

0 0 1:0 0 0:1 0 0 0 1:0 0 0 0 1:0

0 0 0 0:95 0 0 0 0 0 0 0 0 0

0 0 0 0 0:9 0 0 0 0 0 0 0 0

0 0 0 0:05 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0:9 0 0 0 0 0 0 0

0 0 0 0 0 0:1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0:99 0 0 0:35 0 0

0 0 0 0 0 0 0 0:01 0 0 0 0 0

3

0 7 0 777 0 77 0 777 0 777 0 77 0 777 0 777 0 777 0 77 0:65 777 0 75 0

In this matrix, the entry in row i and column j denotes the probability of transition from state to state in Figure 1. For practical protocol congurations, the resulting matrix can become rather large. Several aproaches to this problem are possible: the use of an algorithm for solving sparse matrices or the use of the graph-based control ow analysis algorithm developed in the area of compiler construction (e.g. [Aho86], p. 660). With a graph-based algorithm, the frequency prediction can be stopped when the frequency of a building blocks drops below a given threshhold. Calculating the number of visits in each state by setting up the system of linear equai

j

5

state1

initial state

1.0

state2

N_S? 1.0

state3

1.0

state4

SET_ALARM(Rxt_Cur) N_Clock=N_S

N_A? 0.95 state5 1.0

0.05 state7

0.1

N_A>=N_Clock

ALARM?

0.9 state6

1.0 0.1

N_S>N_A? 0.9

state8

t_rxtshift+=1 if(t_rxtshift>=TCP_MAX)

state9 state10

SET_ALARM(Rxt_Cur) N_Clock=N_S

SET_ALARM(−1)

0.01

state12

t_rxtshift=0

state11

1.0

0.99

N_S 0.65 1.0

1.0

0.35

state13

state14

Rexmt = ... Rexmt = ...

absorbing state

Figure 1: Markov Chain

6

1.0

tions described in the previous section results in the vector V=(1, 172, 172, 2000, 1900, 1710, 100, 100, 1539, 171, 152, 1, 99). We observe that some states are visited much more frequently than others. States 4 and 5 are the most frequently visited states with a probability of 24:6% and 23:4%. This information can be used for code optimization. Let us assume that we choose to optimise our protocol code by inlining some of the calls to the function SET_ALARM that appear in state3, state9 and state10 . Let us also assume that the speed gain achieved per function inlining is G time_units and the code increase cost is C size_unit. The gain G achieved by inlining the function calls SET_ALARM in state is then equal to N ? V ?G. Its cost is equal to N ? C (N is the number of calls to SET_ALARM in state . It is always equal to 1 in our example). States with a high ratio R = G /(N ? C) are the most interesting candidates for inlining. An heuristic that could be used to optimize the code speed for a given code size constraint is to inline the functions in the order of descending R until the code size limit is reached. The table 1 gives the value of R for each states containing a call to function SET_ALARM. We see that the function SET_ALARM of state9 is the most interesting candidate for inlining. In fact by applying inline expansion to state9 , we achieve 81% of the maximum time saving with only 50% of the maximum code size cost (the maximum gain and cost result from inlining all function calls). With a request/response application, the result of this protability analysis is very different. This is because in many cases, only a single data packet will be sent per application operation. Therefore it becomes more protable to inline the function SET_ALARM in state10 . With a request/response application By inlining in state9 , only 5% of the maximum achievable gain is achieved, whereas inlining in state10, achieves 47% of the maximal speedup. Both alternatives require 50% of the maximum code size cost. i

i

i

i

i

i

i

i

i

i

i

i

state3 state9 state10

Visit Count Total Gain R 78 78?G 78?G/C 694 694?G 694?G/C 77 77?G 77?G/C i

Table 1: Gain/Cost ratio for inlining candidates

5 Conclusion Optimising protocol code usually requires identifying its most frequently executed parts. In this paper, we showed how this task can be automated using a method based on Markov 7

analysis. This identication is performed by a Markov analysis of the frequencies of the dierent external events able to activate the protocol automaton (such as application data, network packets or timers) and of the outcome probability of each module internal test nodes. We are currently extending our implementation of the Esterel compiler for allowing more experimentation with this automatic frequency prediction approach.

References [Abb92] Abbott, M. and L. Peterson. A Language-Based Approach to Protocol Implementation. ACM SIGCOMM '92. 27-38, 1992. [Bal93] Ball, T. and J. Larus. Branch Prediction For Free. Technical Report 1137. Computer Sciences Department - University of Wisconsin - Madison, February 1993. [Ber92] Berry, G. and G. Gonthier. The Esterel Synchronous Programming Language: Design, Semantics, Implementation. Journal of Science Of Computer Programming. 19(2), pp. 87152. 1992. [Box92] Box, Donald, D. Schmidt and T. Suda. ADAPTIVE-an Object-Oriented Framework for Flexible and Adaptive Communication Protocols. Proceedings of the Fourth IFIP Conference on High Performance Networking, December 1992. [Cas94a] Castelluccia, C. and W. Dabbous. Modular Communication Subsystem Implementation using a Synchronous Approach. Usenix Symposium on High Speed Networking. Oakland, August 1994. [Cas94b] Castelluccia, C., I. Chrisment, W. Dabbous, C. Diot, C. Huitema, E. Siegel and R. De Simone. Tailored Protocol Development Using Esterel, Technical Report 2374, INRIA Sophia-Antipolis. Octobre 1994. [Cas94c] Castelluccia, C. A Modular and Ecient Framework for Integrated Layer Processing. Internal working document, INRIA, November 1994. [Cha91] Chang, P., S. Mahlke and W. Hwu. Using Prole Information to Assist Classic Code Optimizations. Software-Practice and Experience, 21(12), 1301-1321. [Cla89] Clark, D., V. Jacobson, J. Romkey and H. Salwen. An analysis of TCP processing overhead, IEEE Communications Magazine, June 1989, pp. 23-29. [Dio95] Diot, C., R. de Simone, C. Huitema Communication Protocols Development using Esterel, To be published in "Journal of High Speed Networks". [Hof93] Hofman, B. and W. Eelsberg. Ecient Implementation of Estelle Specication. Technical Report, Universitat Mannheim, Germany. [Hos92] Hoschka, P. Towards Tailoring Protocols to Application Specic Requirements. IEEE Infocom '93. pp. 647-653.

8

[Hos93] Hoschka, P. and C. Huitema Control Flow Analysis for Automatic Fast Path Prediction. Second IEEE Workshop on High Performance Communication Subsystems, 1993. [Jac88] V. Jacobson. 4BSD TCP Header Prediction. CCR, Vol 20, No 2, April 1990. [Leu94] Leu, S. and P. Oechslin. A Formal Approach to Optimized Parallel Protocol Implementation. TR 94-003. Institut fur Informatik, Bern, April 1994. [Mon94] Montz, A., D. Mosberger, S. O'Malley, L. Peterson, T. Proebsting and J. Hartman. Scout: A Communications-Oriented Operating System. Technical Report TR-94-20, Computer Sciences Department, University of Arizona, Tucson, June 1994. [Mos95] Mosberger, D., S. O'Malley and L. Peterson. Protocol Latency: MIPS and Reality. University of Arizona, Technical Report TR-95-02, 1995. [OMa92] O'Malley, S. and L. Peterson. A Dynamic Network Architecture. ACM Transactions on Computer Systems. 10, 2, 110-143. [Pet90] Pettis, K. and R. Hansen. Prole Guided Code Positioning. Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, June 1990. [Ram65] Ramamoorthy, C. Discrete Markov Analysis of Computer Programs. ACM National Conference, pp. 386-392. [Tri82] Trivedi, K. Probability and Statistics with Reliability, Queuing and Computer Science Applications. Englewood Clis: Prentice Hall, 1982. [Vog93] Vogt, M. and Plattner, B., Plagemann, T. and Walter, T. A Run-Time Environment for DaCaPo, Proceedings of the International Network Conference (INET), BFC-1  BFC-9, 1993. [Wag94] Wagner, T., V. Maverick, S. Graham and M. Harrison. Accurate Static Estimators for Program Optimization. ACM SIGPLAN '94 Conference on Programming Language Design and Implementation, pp. 85-96, 1994. [Wu94] Wu, J. and J. Larus. Static Branch Frequency and Program Prole Analysis. Procedings of the 27th Annual International Symposium on Microarchitecture, San Jose, California, November 1994. [Zit93] Zitterbart, M., B. Stiller and A. Tantawy. A Model for Flexible High-Performance Communication Subsystems, IEEE Journal on Selected Areas in Communication, 11,4, pp. 507518,1993.

A Appendix The Esterel specication in Figure 2 shows the Esterel specication of the management of the TCP retransmission timer used as example in the paper. For simplication, variable declarations are not shown. 9

All statements resulting in branches in the ow graph corresponding to this module are annotated by the probability to be taken. For a case statement, the value gives the probability that the input signal of the case statement will be observed. For an if statement, the value gives the probability that the condition of the statement is true. The branch predictions in this module were derived as follows: 

The number of packet losses due to damaged packets is usually low ( 1%) [Jaco88], and so is the number of packet losses due to congestion if the slow start algorithm is used (about 5%). Thus, in 95% of the cases a N_A signal will be received in the second case statement, and in 5% of the cases a ALARM signal will be received.



The probability that a connection has to be closed because the maximum threshhold of retransmission attempts was exceeded is negligible (0.01%).



The number of packets sent per application operation Pkt_N cannot be determined beforehand. It depends on a number of factors that are only known once the application and the environment using the TIMER_HANDLER module are known. For example, when the TCP protocol is used for a bulk data transfer application (e.g. ftp) the number of packets sent per le transfer is high. In contrast, when using TCP for a request/response protocol (e.g. telnet), usually one packet will be sent for each character typed. More generally, the number of packets sent per application operation results from dividing the data volume V sent in one application operation by the Maximal Transfer Unit (MTU) of the network connection if an MTU discovery module is installed, and by the default value of 512 Byte otherwise. Thus, a symbolic variable Pkt_N must be introduced whose value is determined by the compiler when the exact environment in which the TIME_HANDLER module is used is known.

10

module TIMER_HANDLER: input

N_S

(integer),

/* value of current sequence number */

N_A

(integer),

/* value of acknowledged sequence number */

ALARM

(integer),

Rxt_Cur (integer);

/* time-out of retransmission timer */ /* value of current rtt estimate */

...

loop trap NOTIMER in await

N_S do

P(1.0)

emit SET_ALARM(?Rxt_Cur);

/* set retransmission timer to current rtt estimate */

N_Clock := ?N_S;

/* store sequence number of ack for next rtt estimate */

loop trap RETRANSMIT in await case N_A do P(0.95) if (?N_A >= N_Clock) P(0.9) then if (?N_S > ?N_A) P(Pkt_N-1/Pkt_N) then emit SET_ALARM(?Rxt_Cur); N_Clock := ?N_S;

/* there are still unacknowledged packets */

/* set retransmission timer to current rtt estimate */

/* store sequence number of ack

for next rtt estimate */ else emit SET_ALARM(-1);

/* all packets acknowledged -> reset timer */

exit NOTIMER; end end case ALARM do P(0.05) trap CLOSE_TCP in emit Retrans_Phase(1); emit Slow_Start;

/* activate retransmission phase */ /* activate slow start module */

t_rxtshift := t_rxtshift+1; if (t_rxtshift >= TCP_MAXRXTSHIFT) P(0.01) then

/* bound on maximum number of retransmissions exceeded */

t_rxtshift := 0; exit CLOSE_TCP;

/* close the connection */

end; await N_S do Rexmt := (SHIFT_RIGHT(?T_Srtt, TCP_RTT_SHIFT)

/* recalculate retransmission time

+ ?T_Rttvar) * TCP_BACKOFF(t_rxtshift);

using exp. back-off */

Rexmt := TCPT_RANGESET(Rexmt, TCPTV_MIN , TCPTV_MAX); emit SET_ALARM(Rexmt); /* set timer */ end; handle CLOSE_TCP do nothing; end; /* loop */ end /* await */ end. /* module TIME_HANDLER */

Figure 2: Esterel Specication of TCP Timer Management 11

Suggest Documents