RTC: A Real-time Communication Middleware on Top of RTAI-Linux

2 downloads 234 Views 56KB Size Report
protocol, and at the application level a channel-oriented ... mented as modules of the Linux/RTAI operating system and ... However, the development of high.
RTC: A Real-time Communication Middleware on Top of RTAI-Linux  Tales Heimfarth, Marcelo G¨otz, Franz J. Rammig University of Paderborn Heinz Nixdorf Institut Germany ftales,mgoetz,[email protected] Abstract This paper describes RTC, a partially object-oriented middleware inspired on the ISO/OSI standard that implements a complete real-time communication platform on a cluster running under real-time Linux. Media Access Control (MAC) is implemented by means of a modified TDMA protocol, and at the application level a channel-oriented communication is provided. The RTC platform guarantees a bandwidth for each channel, which makes it attractive for multimedia applications, which currently is the main application considered. In addition, RTC supports non-real-time traffic using TCP/IP, for instance. The platform has two principal components: the communication protocols implemented as modules of the Linux/RTAI operating system and the user space API (Application Program Interface) implemented in a object-oriented manner and based on the LXRT feature of the RTAI. Furthermore, the non-real-time capabilities are supported by a software layer in the kernel of Linux. Currently, SCI is used as underlying communication network. SCI has been selected due to its low latency, low jitter and high bandwidth[6]. RTC, however has been designed to support other network technologies as well. Included in the paper are performance evaluation results that demonstrate the real-time properties of RTC.

1 Introduction Current real-time extensions to Linux offer rather limited support for cluster computing. Performance-demanding applications, on the other hand, can be served efficiently by appropriate computer-clusters. This, however, implies the necessity of a middleware on top of the RTOS in order to handle such a cluster. Multimedia applications may serve as a typical example. 1 This work is partially founded by the EU project EVENTS[2] (IST1999-21125)

Fl´avio R. Wagner Federal University of Rio Grande do Sul Informatics Department Brazil [email protected]

In such an application, usually there is a necessity to deal with a huge amount of data. Moreover, when DSP (Digital Signal Processing) algorithms must be applied on that data, computational costs also increase. Nowadays, there are platforms that can provide high computational performance relying on a distributed environment[1]. On those systems, in order to meet the communication requirements, particular network architectures and protocols guarantee messagedelivery delay bounds. However, the development of high performance clusters with real-time communication support is still a open research area. One of the highest-bandwidth, lowest-latency network protocol currently in existence is the IEEE 1596-1992 Scalable Coherent Interface (SCI) standard [4]. The protocol calls for link speeds of 1 GB/s and sub-microsecond latencies over distances of tens of meters. However, the base SCI protocol does not attempt to address real-time issues. Some proposals are based on the P1596.6 IEEE standard, which adds priority-based, real-time capabilities to SCI [5]. Nevertheless, those solutions are applied to the SCI-boards; this means that a new architecture for SCI-boards must be implemented. This paper presents RTC, a middleware that aims at providing real-time communication services on a distributed environment with high computational performance. RTC is based on a cluster operated under RTAI Linux as Real Time Operating System (RTOS). RTC provides a software solution, instead of a hardware one, covering a whole set of layers, from the hardware management and MAC (Media Access Control) up to the API (Application User Interface). These layers are implemented by means of reusable components. For the MAC layer, RTC uses a slightly modified TMDA protocol, although this can be easily changed due to the layered and component-based implementation of the platform. At the application level, a channel-oriented approach [9] with bandwidth reservation is used, which is very suitable for multimedia applications. Currently RTC supports SCI. However, due to the modular structure of RTC, other communications networks, e.g. Infiniband [10][11]

may be supported as well by simply exchanging same modules. The remaining of this paper is organized as follows. Section 2 describes relevant issues of the SCI standard. In Section 2, the architecture of the RTC platform is detailed. The evaluation tests in a SCI-based environment are described in Section 3. Finally, Section 4 draws conclusions and introduces future work.

2 The RTC Architecture

Non−RT Application User Space Kernel Space

LXRT − RT User Space Obj. Or. Obj. Or. RT Application API Proxy Management

Non−RT Driver

User−API Proxy

API Link Layer MAC Layer

List Management Connection Management

RTAI

2.1 Overview SCI + Node Hardware

In addition to the basic services of a middleware for distributed real-time computing, the RTC architecture deals with the problems of message priority inversion and media contention. E.g., in a SCI network, problems are dealt with at a higher level than SCI/RT [5] does. The main idea is to implement in software the control of the media access, which means, the shared memory segments, in this case. The component-based RTC communication platform consist of a set of layers that add necessary services to a real-time operating system. For this, an operating system that allows its modification and also offers real-time characteristics was needed. Therefore, a Linux operating system with the RTAI real-time extension [3] was used. The whole RTC platform is implemented as a set of hierarchical layers (Linux/RTAI modules[8]), each one corresponding to a specific functionality of the communication services. Figure 1 shows an overview of the RTC platform architecture. The organization of the layers is inspired on the ISO/OSI standard and will be detailed in the next subsections. With this architecture, it is easy to modify the RTC functionality, by changing one of its components. Besides the real-time functionality, the platform also offers the possibility to send non-real-time messages. Of course, those messages will always have lower priority than the real-time ones. In the current version of the platform, a network driver for Linux that implements non-real-time communication has been implemented. With this facility, the standard Linux connections with other nodes of the network can be made using, for example, the normal TCP/IP. In the RTC the communication paradigm provided by the hardware which can be shared memory (in the case of SCI) is hidden by the platform. The communication paradigm, from user point of view, is based on channels (connectionoriented) that connects any two processes allocated to different nodes of the network. In the RTC platform, each channel is unidirectional and defined by:

 Source node: node which sends the message  Destination node: node which receives the message

Figure 1. Overview of the RTC platform architecture

 Channel identification: used for internal management  Bandwidth: the reserved bandwidth for this channel  Smallest and largest messages: used to estimate the overhead of the messages In its current implementation, the system must be configured statically. This means that all channels that will be used are defined off-line and the set-up of their connections is made during initialization of the system. During the definition phase (off-line), the whole system is checked in order to confirm the availability of resources that are needed to fulfil the application requirements.

2.2 Connection Management If the communication paradigm of the network is shared memory, each process at each node that needs to communicate must allocate a dedicated memory segment. After that, this communication segment has to be made accessible to the node it wants to establish a connection with. These activities are handled by the Connection Management module. Moreover, this module provides to the other modules the correspondent addresses of the offered chunks of memory. This module is a standard Linux one and doesn’t have real-time constraints, since it is relevant only in the initialization phase of the system. In a network, reading and writing on remote memories may have different latencies. Therefore, in order to obtain optimal performance, it must be guaranteed that a process only accesses a remote segment using low latency methods. The recognition of a packet arrival is handled by other means and will be explained in the next subsections.

2.3 MAC (Media Access Control)

2.4 Packet List Management

The MAC module is the central one in the RTC platform. It controls access to the shared communication media, which is the shared memory in this case. The main goal of this layer is to establish a discipline to access the shared memory segments. The main tasks of this module are:

The next module in the hierarchy is the Packet List Management. The main goal of this layer is to manage the list of packets to be sent/received by the system. The necessity of maintaining multiple lists of packets makes it necessary to implement a centralized list management. Since the RTC platform also supports non-real-time traffic, interaction between the Linux standard kernel and RTAI is needed. A special device driver developed as part of the platform is responsible for this interaction. Some segments of this driver run at interrupt time and access lists of packets. Any access to these lists have to be protected against race conditions. As the Linux kernel doesn’t allow the use of semaphores inside interrupts, the technique of disabling interrupts is used to achieve mutual exclusion inside the lists’ code. The critical sections are very small to minimize jitter effects on the real-time system.

 Initialization of the RTAI operation (real-time system initialization)  Access control to the shared media  Clock synchronization among SCI network nodes As MAC protocol, the well known TDMA (Time Division Multiplexed Access) protocol was chosen. Well know advantages of TDMA as a MAC protocol are that it allows a static allocation of the bandwidth and results in a small transmission jitter. In the TDMA approach, the bandwidth is statically divided into a fixed number of so called slots. This means that, during a fixed amount of time (time-slot), only a single node will have access to the communication media. The number of slots doesn’t depend of the number of nodes, since a node can use more than one slot. After all nodes have used their respective slots, the whole cycle is repeated. A complete cycle, where all nodes have access to their slots, is known as a TDMA-round (or just round). For the current RTC implementation, it is assumed that all rounds should be equal. A global clock synchronization between all nodes in a TDMA system is necessary. As the TDMA protocol is based on time multiplexing, there must be a way to synchronize the clocks of the nodes. In the RTC implementation, the end of the round is chosen as a barrier for synchronization. At the end of the round, each node waits for a signal of a master node (which is one of the nodes on the network) to start a new round. It is important to notice that the absolute system time of each node is not synchronized by this method, only the internal time counts of the MAC modules in the various node are synchronized. As explained before, at each slot time just one node should use the media communication access. However, this access represents a write operation on a remote segment of memory (where the destination node is located). To receive a packet, the destination node may use any of the following slots (used by write accesses of the other nodes). Thus, the media discipline must be applied only to write accesses. This method can be implemented because write operations are performed on remote nodes, whereas read operations are local. In a situation where one node will receive packets from all other nodes, it should provide separate memory segments, one for each sender node, in order to be able to read packets from them.

2.5 Link Layer The basic unit of the transmission process in RTC is the packet that contains a message to be transported from application to application. The correct utilization of the bandwidth of each slot, to determine how the data will be fragmented in a transmission and assembled back at the receiving node, is the main goal of the Link Layer module. The main tasks of this module are:

 Channel management  Send the packets in the communication channel  Receive the packets sent by other nodes  Error detection This layer receives packets from the API module (see next sub-section) and put them on the communication media. When a packet is completely received, it is passed to the API module. The exchange of packets between the link layer and the API module is performed using input and output queues. Each queue belongs to a channel and is used either for input or output. In order to access these lists, primitives offered by the Packet List Management layer are used. 2.5.1 Sending messages A channel is defined as a unidirectional connection between two processes on different nodes of the network. If there are N processes on the same node connected with other processes on other nodes of the network, N channels will be configured for this node. Therefore, the platform must support more than one channel within a single TDMA slot.

Node 1 Total capacity of a slot: 300 bytes Channel Slot Bandwidth limit Origin Destination 3 2 100 bytes/round 1 5 5 2 180 bytes/round 1 3 Total: 280 bytes/round 8 3 100 bytes/round 1 4 Total: 100 bytes/round Table 1. Example of allocation of channels to slots

If the bandwidth of one slot is not sufficient to support all channels of this node, the system is configured in such a way that two or more slots are assigned for this node. It is important to notice that this decision is taken off-line. Table 1 shows an hypothetic system configuration. In this example, each slot can send 300 bytes per round. For node 1, there are 3 sender channels, whose specifications are also shown in Table 1. Those channels, due their bandwidth requirements, cannot be allocated to a single slot. Two slots thus will be made available in order that node 1 can send its messages on the 3 channels. Since a slot may need to support more than one channel, it is necessary to use a scheduling method that allocates the various possible channels of each node into the several possible slots. The scheduling algorithm used in RTC is a modification of a Weighted Round-Robin [7]. Channels are ordered by their identifications (unique for each one). They are allocated observing that in one slot just channels that belong to the same source node are allowed. The available time-slot is divided among the channels proportionally to their bandwidths. Allocation is performed off-line. This tool also takes into account the overhead caused by the header added by the platform to the original message. By doing so, a more accurate utilization of the system can be achieved. It can happen that the API needs to send a message with a size larger than the allowed bandwidth. In this situation, the link layer has the responsibility to divide the message into parts in order to be possible to send it. This strategy allows a better utilization of the available bandwidth. 2.5.2 Receiving messages The link layer is also responsible for receiving packets from other nodes. Each node exports a memory segment to every node on the network, and these segments will be used to receive the packets. Since write operations on these areas are made in a given order, the link layer must know the correct time in the TDMA-round to read them. As the configuration tool already applies a scheduling policy to order

writing messages, it also generates a table for the link layer to specify the correct time points to read them. This module also reconstructs messages that were divided on the source node. As the link layer divides messages coming from the API, it must be able to reassemble them. For achieving this, it makes use of particular control bytes in the message header that identify sub-messages belonging to an original message from the API. This task is completely hidden by this layer, which means that the API just read/write complete messages. In the same way, the lower layers (below the link layer in the hierarchy) just deal with packets, which can correspond to complete messages or to message fragments.

2.6 API The goal of this layer is to provide to the kernel application a set of high-level communication primitives and to hide all internal details of the RTC procedures. It also wraps the messages from the application, transforming them in packets to be sent to the layer below and containing the required header information. The main tasks of the API layer are:

 Provide communication primitives to the application  Wrap and unwrap the messages to and from the link layer  Register communication tables in the link layer The available communication primitives allow the application to send/receive messages of variable size in an asynchronous way. To send a message, two functions are provided: void rtc_channel_send (int channel, void *msg, int msg_size) void rtc_channel_send_high_priority (int channel, void *msg, int msg_size) The first function requests to send through a channel a message that may contain from a simple string to a complex data type, stored in the memory area pointed by msg, which has a size of msg size. The message is wrapped and queued in a FIFO manner. The second function has the same behaviour as the first one, but the message will be scheduled on the top of the queue, so that it will have a higher priority than the other messages already in the queue. Both functions are non-blocking. To receive messages the API layer makes two primitives available: int rtc_channel_receive (int channel, void *msg,int msg_size) int rtc_channel_receive_if (int channel, void *msg,int msg_size)

These primitives are used to receive a message in a blocking and a non-blocking manner, respectively. The caller asks to receive a message from a channel, with size of msg size, and to copy it to the memory area pointed by msg. The function returns the number of bytes that could be read. The API layer is also responsible for registering tables generated by the configuration tool. These tables are used by the link layer to manage the read and write operations. This task is automatically performed in the initialization phase of the API layer, and the application does not need to know about it. The next subsections describes other two modules built on top of the API, whose can be seen as an extension of the API module to offer the communication services of the RTC to the real-time user application. Using together the facilities of LXRT (RTAI module), the application now can be developed using object oriented concepts.

2.7 Proxy Management The Proxy Management is responsible to create a new kernel task for each application running in the user space. The RT user application calls the Proxy Management through the object oriented user API which is a library linked with the application. The Proxy Management module also assign correspondent Application Proxy to the caller task running in the real-time user space. This Proxy will, in turn, mediates the communication between this user application and the RTC kernel modules space.

2.8 User-API Proxy The aim of this module is to call the kernel API on behalf of the RT user application. Each Proxy stays in the sleeping state until the correspondent application requests an RTC service. Whenever a RT user space application task requests a communication service, the Proxy Management wakes up the User-API Proxy to call the kernel API in behalf of the user application. The result of the requested call is sent back to the application. In other words, the Proxy provides a bridge between the user space and the kernel space, supporting API calls from a RT application running on the user space.

2.9 Object Oriented User-API The User-API is a library that is linked with the application. It has the purpose to provide an API to the real-time user application in an object oriented manner. Each application is linked with this API library. When an application call a send procedure on the API, depending on the destination two actions can be performed. For destination:

in the same node The User-API solves this communication using the internal communication services provided by the RTAI at other nodes In this case, the User-API calls the Proxy Management in order to access the RTC services. The receive procedure is analogous.

3 Platform Evaluation The following evaluation was made on a SCI-based cluster with 3 nodes.

3.1 Media Access Control evaluation The MAC layer implements some of the most important features of the RTC platform, corresponding to its realtime properties. Two main measurements have been performed regarding the MAC layer. The first one was intended to evaluate the correct synchronization between different nodes at the end of each TDMA round, which represents the synchronization barrier. Other important aspect evaluated in the experiments was the minimal TDMA round time that can be achieved. The lower it is, the better. This is true for such applications that require more accuracy to determine the application endto-end time to send/receive a message. It can happen that the application decides to send a message after its allocated time slot in the current round. In this case, the system will send the message only one round later. However, if the application decides to send the message before the beginning of its allocated slot, the time to send the message will be lower. It is important to notice that the platform is aimed at multimedia applications that have diverse requirements. An SCI network composed by 3 nodes was configured and built. In order to have all the MAC activities monitored on the same time base, which is necessary to make a comparison between different machines, an external hardware was used. Moreover, the evaluation of the platform should be least intrusive as possible. This external hardware was composed by a PC machine, not included in the SCI network, which samples data from its parallel port. On each node of the SCI, the MAC module was modified in a way to change the state of an output pin from the respective parallel port each time when it begins a new TDMA slot. The test platform was composed of three 400MHz Pentium II systems with 64Mbytes RAM. As operating system, Linux 2.2.18 and RTAI 1.6 were used. The nodes were connected with Scali’s PCI-SCI adapter cards, based on LC2 SCI Link Controller. The MAC layer was configured with 4 slots with round times of 40ms, 10ms, and 1ms (minimal round time achieved). To evaluate the MAC layer, the configuration

2.58

0.10

2.57

0.08

2.56 Delay (ms)

Clock drift (ms)

0.12

0.06

2.55

0.04

2.54

0.02

2.53

0.00

2.52

0

100

200

300 400 Slot

500

600

700

Figure 2. Difference between the clock of the Node 0 and Node 2 with 100s of sleep time.

0

200

400 600 800 Received Message

1000

Figure 4. Delay between the placement of a packet in the media and its reception.

0.025

from the synchronization routine can be seen in the figure 3. The minimum and maximum difference between the clocks were in this case 6s and 24s, respectively. These values come from the SCI-latency ( 5s) plus additional computing time of the platform.

Clock drift (ms)

0.020

0.015

3.2 Evaluation of the bandwidth reservation

0.010

0.005

0

100

200

300 400 Slot

500

600

700

Figure 3. Difference between the clock of the Node 0 and Node 2 with 100s without sleep.

of channels was not necessary. For each round, the external hardware collected 1; 000; 000 samples during 1:5 seconds (1 sample/1:5s). The same experiment allowed the measurement of the synchronization drift between the MAC layers of any two different nodes. Figure 2 illustrates the values of the clock’s difference between node 1 and node 3, which have presented the worst results. It could be observed that the drift has a periodic behaviour. In our examples two nodes had a maximum difference of around 108s. In the current implementation, there is a sleep routine in the MAC reception routine. When the master sends the synchronization signal, all slaves wait for this signal in a sampling loop. A sleep was placed in the begin of this loop to assure that, in the case of a master fault, the node will not be blocked. A collateral effect is the detection time increase of the synchronization signal. The sleep time used in this prototype was 100s and it is the main contributor, as can be seen, to the observed drift between the clocks. The sleeping time at the end of each round has a great influence on the drift value. The same measure taken when the sleep was removed

At user level, the communication method implemented by RTC is channel-oriented. Each channel has a configurable bandwidth that must be guaranteed by the system. In order to evaluate this functionality, an experiment used the same configured MAC layer for the tests above, with a channel connecting 2 nodes. For this channel, a bandwidth of 300 Kb/s was specified. Two measurements were performed. The first one evaluated the reservation of the configured bandwidth. The source node is requested to send a large amount of data at a given time, using an API function for that. Each packet has 1Kb of data. As expected, the destination node have received all data from the channel with a rate obeying its reserved bandwidth. The aim of the second test was the evaluation of the time between the placements of a packet in the shared memory by the sender and the reception of the same packet by the receiver. For this purpose, two time-stamps were added to the packages in those moments. This measured delay is part of the end-to-end delay. Figure 4 shows the results of this experiment. We obtained a maximum delay of 2:57ms and a minimum of 2:52ms. Each slot in this experience had 2:5ms. As in the normal operation the reception slot follows always the send one, the measured delays were in the expected range. The small jitter results of the MAC and link layer computation time. In order to evaluate the system with a high network load, the system was configured to have more channels. All the slots were fully filled up with communications to consume

all the bandwidth increasing the load on the network. There was also a lot of non-real-time packets been transmitted by the platform. On this highly loaded network, the same tests as above were repeated, and the results obtained were exactly the same. This demonstrates that the real-time guarantees of the RTC platform are not influenced by a heavy network load.

4 Final Remarks This paper presented RTC, a platform for real-time communication on a cluster. Differently from other solutions that deal with the problem of media contention and priority inversion of messages at hardware levels, RTC proposes a solution at software level. The platform is composed by a set of Linux/RTAI modules, hierarchically related as layers. One of these layers is the MAC that is responsible to discipline the access to the media. The clock-driven TDMA algorithm was used. This also avoids a potential priority inversion problem. Based on the MAC, other layers were added to achieve a complete communication’s architecture. This architecture offers primitives to send and receive realtime messages and also transmit IP packets from the Linux kernel. The platform also offers a tool, where the user can statically configure the system. This tool generates information that is used in the initialization phase of the system. This tool also implements a scheduling algorithm that allocates the requested channels from the system to the available TDMA slots. The evaluation tests performed on the RTC platform have shown that it can guarantee a channel with fixed bandwidth to be used in real-time systems like distributed multimedia applications. The end-to-end maximum delay is influenced by the cycle size which is limited by the granularity of the real-time clocks and the synchronization process. A important part of this delay was evaluated and shown a small jitter. Moreover, this values and the requested bandwidth per channel are guaranteed even in a highly loaded network. It could be observed, from the analysis, that the jitter is more influenced by the synchronization method implemented in the MAC layer. As future work, some improvements on the platform are planned. The processor resources used by the communication layers shall be evaluated in order to know how much CPU time remains available to execute the application tasks. The channels are currently defined statically, and their setup is made during initialization phase. It would be interesting if the system could open and close channels at run-time. This will increase the effective use of the network. Moreover, the API will be extended to offer more high level communication primitives.

References [1] Arvind, K., Ramamrithm, K., Stankovic, J. A.. A Local Area Network Architecture for Communication in Distributed Real-Time Systems. Tech. Report UM-CS1991-004, Dept. of Computer Science, Univ. of Massachusetts at Amherts. 1991. [2] Homepage of the European Project EVENTS. http://www.eptrom.es/projects/events. 2001 [3] Homepage of the RTAI project. http://www.rtai.org/. 2000. [4] IEEE, 1596-1992. IEEE Standard for Scalable Coherent Interface (SCI), Piscataway, NJ: IEEEService Center, 1993. [5] R. Lachenmaier, T. Stretch: A draft proposal for an SCI/RT protocol using directed flow control symbols. White Paper, IEEE P1596.6 Working Group, Nov. 1996. [6] S. Lankes, M. Pfeiffer, T. Bemmerl. Design and Implementation of a SCI-Based Real-Time CORBA. Lehrstuhl f¨ur Betriebssysteme, RWTH Aachen. ISORC 2001. [7] Jane W. S. Liu. Real-Time Systems. Prentice Hall, 2000. [8] Alessandro Rubini. Linux Device Drivers. O’Reilly & Associates Inc., 1998. [9] A. Mittal, G. Manimaran, and C. Siva Ram Murthy.Dynamic real-time channel establishment in multiple access bus networks. Computer Communications, 2002. [10] Compaq, Microsoft, and Intel, Virtual Architecture Specification Version 1.0, Technical report, Compaq, Microsoft, and Intel, December 1997. [11] Infiniband Trade Association, Infiniband Architecture Specification, Release 1.0, http://www.infinibandta.org.