A Prototype Implementation of an SCI-ATM Bridge for Interconnection of Local Area Multi Processors ˇ iˇ Tarik C ci´ c Department of Informatics University of Oslo e-mail:
[email protected] May 13, 1997
Acknowledgments I would like to thank to my supervisors, professor Stein Gjessing and graduate research fellow Haakon Bryhni, for all assistance on writing this thesis. Without Bryhni’s supervision, this thesis would never look like it does. Thanks to the SCI-group at the Department for Informatics on the University of Oslo for interesting and educative discussions. Special thanks to graduate research fellow Knut Omang for helping me with all questions about SCI programming. Thanks to Kjell Gjære (Telenor R&D) and Kjetil Otter Olsen (USIT) for practical help with ATM and enabling wide area tests. Thanks to professor Olaf Owe for the suggestions he gave me and for reading the drafts. Thanks to my wife Janne for the support and the patience she showed. Finally, thanks to my daughter Emina. She did not cry too much during her first year of life, and she awarded me with a smile whenever I had hard times. ˇ ci´ Tarik Ciˇ c
Oslo, May 13, 1997
i
ii
Contents 1 Introduction 1.1 ATM-SCI Connectivity . . . . 1.2 Problem Statement . . . . . 1.3 Method . . . . . . . . . . . . . 1.4 Organization of this Thesis
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
1 2 3 3 5
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
2 Background 2.1 A Brief Overview of the SCI and ATM Standards 2.1.1 Scalable Coherent Interface – SCI . . . . . . 2.1.2 ATM – Asynchronous Transfer Mode . . . 2.2 Laboratory . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Hardware . . . . . . . . . . . . . . . . . . . . 2.2.2 Programming Interface . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
7 . 7 . 7 . 9 . 12 . 12 . 13
3 Basic Measurements of Network Performance 3.1 Performance Measurements . . . . . . . . . . . . . . . . . . 3.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . 3.2 SCI Performance . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Performance of ATM network . . . . . . . . . . . . . . . . . 3.3.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Throughput Using TCP/IP over ATM . . . . . . . . 3.3.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 An Original Measurement Method for Symetric Systems 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
17 17 17 19 19 19 22 24 24 29 30 32 35
4 Implementing a Simple Bridge 37 4.1 The Algorithm: from an Idea to the Formal Proof . . . . . . . . . 37 4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 The 5.1 5.2 5.3 5.4 5.5
SCI-ATM Bridge Emulator Requirements . . . . . . . . . Architecture . . . . . . . . . . Interconnection Model . . . Design of the Bridge . . . . . Why Implement a Dedicated
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ATM Protocol?
iii
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
47 47 48 50 51 52
iv
CONTENTS
6 The 6.1 6.2 6.3
Bridge – Program Description Data Model . . . . . . . . . . . . . . . . . . . Execution Flow . . . . . . . . . . . . . . . . . Communication Protocols . . . . . . . . . . 6.3.1 Bridge Request Protocol . . . . . . . 6.3.2 Tarik’s ATM Protocol . . . . . . . . . 6.4 Error Control . . . . . . . . . . . . . . . . . . 6.4.1 The Life of a Data Packet . . . . . . . 6.4.2 When a Packet is Lost ... . . . . . . . 6.4.3 ... and When an Acknowledgment is 6.5 Instrumentation . . . . . . . . . . . . . . . . 6.6 Inter-thread Signaling . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
55 55 57 60 61 62 64 64 65 66 67 67
Bridge – Performance Analysis Total Producer/Consumer Performance . . . . . . . . . . . . A Detailed Time Usage Study . . . . . . . . . . . . . . . . . . Time Consumption in the ATM Threads . . . . . . . . . . . 7.3.1 Time Consumption in the ATM Sender . . . . . . . . 7.3.2 Time Consumption in the ATM Receiver . . . . . . . 7.4 Throughput with Multiple Data Transfers . . . . . . . . . . 7.4.1 Unidirectional Data-flow . . . . . . . . . . . . . . . . . 7.4.2 Bidirectional Data-flow . . . . . . . . . . . . . . . . . . 7.4.3 Throughput with Four Simultaneous Transmissions 7.5 Performance Measurements for Longer Distances . . . . . 7.5.1 ATM Performance . . . . . . . . . . . . . . . . . . . . . 7.5.2 Bridge Performance . . . . . . . . . . . . . . . . . . . . 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
69 69 71 77 78 79 80 80 81 82 82 82 83 84
. . . . . . . . . . . . . . . . . . . . . . . . Lost . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
7 The 7.1 7.2 7.3
8 The Bridge – Analytical Performance Model 8.1 Throughput . . . . . . . . . . . . . . . . . . 8.1.1 Tarik’s ATM Protocol . . . . . . . . 8.1.2 TCP/IP transmission over ATM . . 8.2 Latency . . . . . . . . . . . . . . . . . . . . . 8.3 Summary . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
85 86 86 90 92 93
9 Towards a Hardware Bridge Implementation 9.1 Hardware Model . . . . . . . . . . . . . . . . 9.2 Throughput Expectations . . . . . . . . . . 9.3 Latency Expectations . . . . . . . . . . . . . 9.4 Shared Memory Mapping . . . . . . . . . . . 9.5 Summary . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
95 95 97 103 104 105
10 Conclusions 107 10.1 General Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 10.2 Critique of Method Used . . . . . . . . . . . . . . . . . . . . . . . . 109 10.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A About the Bridge Emulator A.1 Hardware Requirements A.2 Software Requirements . A.2.1 Execution . . . . . A.2.2 Compilation . . . . A.3 Usage . . . . . . . . . . . . A.3.1 The Programs . . . A.4 Program Options . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
111 111 111 111 112 112 112 113
List of Figures 1.1 Three implementation models . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7
SCI node model . . . . . . . . . SCI Request packet format . . ATM cell format . . . . . . . . AAL5 packet format . . . . . . SCI-ATM laboratory . . . . . . Slib, usage ilustration . . . . . FORE’s API, usage illustration
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
4
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
8 9 10 11 12 14 15
3.1 Latency measurements, diagram . . . . . . . . . . 3.2 SCI throughput: small buffers . . . . . . . . . . . . 3.3 SCI throughput: large buffers . . . . . . . . . . . . 3.4 SCI throughput: multiple threads . . . . . . . . . 3.5 ATM throughput: low bandwidth . . . . . . . . . . 3.6 ATM throughput: medium bandwidth . . . . . . . 3.7 ATM throughput: high bandwidth . . . . . . . . . 3.8 ATM throughput: simplex connection, max QoS 3.9 ATM throughput: duplex connection, max QoS . 3.10 ATM packet-discarding rate, simplex . . . . . . . 3.11 ATM packet-discarding rate, duplex . . . . . . . . 3.12 ATM producer rate control . . . . . . . . . . . . . . 3.13 Throughput with TCP/IP over ATM . . . . . . . . 3.14 ATM latency: shorter packets . . . . . . . . . . . . 3.15 ATM latency: longer packets . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
18 20 21 22 24 25 25 27 27 28 28 29 30 31 32
SCI-ATM bridge: Basic program structure . . . . . . bridge, correctness proof . . . . . . . . . . . . . . . . bridge throughput: low bandwidth requirements . bridge throughput: high bandwidth requirements bridge: expected throughput . . . . . . . . . . . . . . bridge: achieved throughput . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
38 42 44 44 45 45
4.1 4.2 4.3 4.4 4.5 4.6
Simple Simple Simple Simple Simple Simple
5.1 5.2 5.3 5.4 5.5
Message passing bridge architecture . . . . Shared memory bridge architecture . . . . Interconnecting two SCI rings by ATM link The bridge – basic functionality . . . . . . TAP – TCP comparison . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
48 49 50 51 53
6.1 6.2 6.3 6.4 6.5
Thread Hierarchy . . . . . . . . . . Shared data structures . . . . . . . The Bridge: Execution Flow Chart Communication protocols . . . . . Time sequence diagram . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
56 56 59 61 62
v
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
vi
LIST OF FIGURES 6.6 TAP: packet layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.7 Error recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.1 Data-flow scheme, full . . . . . . . . . . . . . 7.2 Data-flow scheme, bridge-bridge . . . . . . . 7.3 Performance study: events of interest . . . . 7.4 Sequence of events in bridge A . . . . . . . . 7.5 Sequence of events in bridge B . . . . . . . . 7.6 Sequence of events in bridge A, detail . . . . 7.7 Sequence of events in bridge B, detail . . . . 7.8 Buffering time, for all packets . . . . . . . . . 7.9 ATM time, for all packets . . . . . . . . . . . . 7.10 Total bridge/bridge latency, for all packets 7.11 Timing points in the ATM threads . . . . . . 7.12 Multiple data transfers, unidirectional . . . 7.13 Multiple data transfers, bidirectional . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
69 71 72 73 73 74 74 76 76 77 78 81 81
8.1 TAP/TCP, throughput comparison . . . . . . . . . . . . . . . . . . 94 9.1 9.2 9.3 9.4
Layers, software and hardware bridge . Hardware bridge model . . . . . . . . . . Hardware model expectations, example Hardware model expectations, example
. . . . 1 2
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
95 96 102 103
A.1 Usage example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
List of Tables 1.1 Implementation methods, comparison . . . . . . . . . . . . . . .
4
3.1 SCI latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 ATM latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1 inProcess Data Structure . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2 outProcess Data Structure . . . . . . . . . . . . . . . . . . . . . . . 58 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8
Bridge throughput for different packet sizes . . . . Acknowledgment statistics, single data transfer . . Statistics for a typical session . . . . . . . . . . . . . . ATM sender: time usage statistics . . . . . . . . . . . ATM receiver: time usage statistics . . . . . . . . . . Acknowledgment statistics, multiple data transfers ATM latency measurements — longer distances . . Bridge throughput — longer distances . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
70 75 75 79 79 81 83 84
8.1 Analytical model: parameters overview . . . . . . . . . . . . . . . 88 8.2 Analytical model: expected latency . . . . . . . . . . . . . . . . . 93 9.1 Hardware bridge: variables . . . . . . . . . . . . . . . . . . . . . . 99 9.2 Hardware bridge: parameters . . . . . . . . . . . . . . . . . . . . . 99 9.3 Hardware bridge: expected latency . . . . . . . . . . . . . . . . . 104
vii
Chapter 1
Introduction The need for faster computing systems seems to be ever increasing. Unfortunately, physical limits posed on every particular hardware solution are rigidly fixed, and attempts to come as near as possible to those limits increase costs significantly. This is why improving performance of computer systems by building stronger and stronger computers seems to loose ground. An alternative way is to combine (preferably existing) workstations into clusters and make them join forces on solving a computational problem [NOW 97]. There are several problems with computer clusters. One of them is that, traditionally, processors communicate with memory, peripherals and each other through a system bus. The system bus is a broadcast medium where all traffic is visible to all nodes, despite that usually only one node is interested in it. Only one node can “talk” at a time, and the bus obviously becomes a bottleneck when the number of nodes increases. This problem is usually referred to as the scalability problem. Another problem related to the computer clusters is latency, i.e. delay between a request for data and the receipt of the data. The latency can sometimes be avoided by using local copies of data in cache memory. The problem of caching is that every cached copy of the data must be kept consistent when the data is modified; inconsistencies may lead to undesired program behavior. This problem is called the cache coherence problem. A solution for these and other related problems have been suggested through concept of Local Area MultiProcessor – LAMP [Gustavson, Li 95]. The specific technology that has been developed for implementing LAMP is Scalable Coherent Interface – SCI1 . While solving the clustering problems, SCI poses another problem, namely locality. An SCI network cannot stretch over long distances, and we need another interconnection method to make two remote SCI clusters communicate. Conventional interconnection methods, such as the TCP/IP programming in the existing wide area networks, are not of much help — they are too slow and cannot support the SCI-specific services, such as shared memory. We obviously need an interconnect technology well suited for overcoming long physical distances, and as fast as possible. The most suitable technology for this purpose today is Asynchronous Transfer Mode – ATM2 . ATM 1 The SCI technology is briefly presented in chapter 2 of this thesis. A beautiful introduction to LAMP and SCI is given in [Gustavson, Li 95]. A concise article-formed introduction to SCI is given in [Gustavson, Li 96]. Full SCI specification is given in [IEEE 92]. 2 Chapter2 of this thesis shortly presents the ATM technology. An introduction to ATM is given in section 10.4 in [Halsall 95]. Complete specification of the ATM user-network
1
2
CHAPTER 1. INTRODUCTION
is a wide area network technology, thereby solving the distance problem. It operates on high bandwidth, though somewhat lower than SCI. ATM enjoys a major advance compared to the most competing technologies: ATM connections guarantee “Quality of Service” (QoS) attributes, making ATM suitable for interconnection purposes. Nevertheless, many difficulties follow the SCI-ATM interconnectivity; some of them will be revealed in this paper.
1.1 ATM-SCI Connectivity Why may somebody wish to interconnect SCI and ATM? A basic reason is that SCI covers the needs for fast networking on distances ranging from very short to local area networks (LAN), and ATM is suitable for networks ranging from LAN to the wide area networks (WAN). Interconnecting ATM and SCI successfully would mean that, using only two basic technologies, we get speed networking system for both local and wide area networks. We isolate several more specific applications: • ATM-interconnection of computer clusters based on SCI in order to solve complex calculations which do not fit a single computer or computer cluster. In this way computers can cooperate efficiently even when they are physically located at different sites. • Access to a clustered parallel server (e.g. HTTP, NFS, media server). Scalable support for heterogeneous media types including isochronous data streams like video and sound is a challenge for both the network and the cluster. One of the most input/output intensive applications today is handling of the exponential growth in the WWW traffic. Scalability is an important requirement for server design, where SCI has promising properties. Connecting such clusters to the rest of the world seems to be a well suited application for ATM. • Using SCI as a switching technology for ATM is discussed in literature [Gustavson, Li 96]. Its main advantage is switch scalability. On the other hand, it requires very efficient SCI-ATM interconnection for each port of the SCI-based switch. Thus, primarily because of high cost/performance ratio, we do not expect this idea to be realized soon. Both ATM and SCI represent modern and fast communication technologies, but connecting ATM and SCI is not an easy task. Simple connectivity and block data transfer is not that complicated to implement, but transparent connection of two SCI systems over ATM, or using SCI as switching technology for ATM, is hard. The reasons are many. SCI and ATM address different needs. As an example, SCI supports shared memory and even cache coherence, which is not a goal of ATM, and ATM provides no hardware (or other) support for it. ATM’s 48-byte payload size is a poor match to the typical 64-byte SCI payload, and this must be handled by the interface. Different error handling and flow control are major technical challenges. Since each SCI transaction has a request and response part (described in section 2.1.1), SCI is latency vulnerable. ATM has significantly higher latencies, and even worse, they are not handled in ATM protocols. Some independent research projects show that the increased latency often means much worse application performance [NOW 97]. interface is given in [UNI 94].
1.2. PROBLEM STATEMENT
3
What does this mean for us? Because of these difficulties, we will primarily approach the SCI-ATM interface using the traditional I/O model. On the end of this report, we will also discuss problems related to the transparent SCI over ATM.
1.2
Problem Statement
The major problem this thesis is concerned about is the interconnection of SCI based clusters by means of ATM. We want to show that the interconnection of SCI clusters over ATM is possible, and to gain knowledge which will help us to estimate to which extent the interconnection is implementable. In order to encompass the interconnection problem, we will concentrate on a number of subproblems like: • the possibilities and limitations of the existing hardware, • prototype implementation, design of the SCI-ATM bridge algorithm, communication protocol design • prototype analysis • implications on future hardware implementations. We will focus on implementing a working prototype of the SCI-ATM bridge for interconnection of computer clusters. Studying the implementation will help us to understand the system behavior, and how the various parameters (buffer size, chosen protocol, inter-bridge distance etc.) influence it. Furthermore, we will use the gathered knowledge to propose a basic architecture and to approximate the performance of a future hardware implementation.
1.3
Method
Since operational SCI and ATM hardware is available, implementing a hardware SCI-ATM bridge would be the ultimate proof of the concepts of SCIATM interconnection. However, high speed hardware design is not possible for us in terms of cost and complexity. To our knowledge, two projects are pursuing a hardware solution [NAVY 97], [Kure, Moldeklev 94], but none is commercially available yet [Gustavson, Li 96]. Another method is simulating an SCI-ATM bridge in software. All the items, events and processes which characterize an SCI-ATM interconnection may be simulated in discrete event time on a software model. One project, led by Haakon Bryhni, takes this approach, and is currently going on at the Department of Informatics on the University of Oslo. A third method, situated in between pure software and hardware approaches, is implementing an SCI-ATM bridge emulator. We choose the word “emulator”3 to depict a method of using existing hardware to maximal extent, using software interventions only when necessary. An overview of pros and cons for all three methods is given in table 1.1. Figure 1.1 shows which parts of the SCI-ATM bridge would be implemented in software and which in hardware, depending on which of the three methods would be used. 3 to emulate: “1a: to strive to equal or excel 1b: Imitate 2: to equal or approach equality with - em.u.la.tor n” (Source: Webster’s dictionary of the English language).
4
CHAPTER 1. INTRODUCTION
Bridge A
ng
I ri
Bridge B
SC
1
I ri
SC
ng
ATM
Hardware method: Hardware
Data Producers Hardware Software
Emulation method: Hardware
2
Data Consumers Software Hardware
Simulation method: Software
Figure 1.1: Software/hardware approach to three possible bridge implementation models
Method Hardware Implementation Emulation Simulation
Positive sides performance (throughput + latency) costs, design time, close to reality flexibility, costs, large configurations possible
Negative sides costs, design time, flexibility, errors hard to fix flexibility (limited to laboratory configurations) detached from reality, modeling inaccuracies, not real time
Table 1.1: Three possible implementation methods, comparison
1.4. ORGANIZATION OF THIS THESIS
5
Since a hardware implementation is beyond our means and the time available, the choice is between implementing a simulator or an emulator. We choose to implement an SCI-ATM bridge prototype using the emulator method for the following reasons: By choosing the emulator method we come closer to a realistic model of the SCI-ATM bridge. The process of fighting with the current technology helps us to learn the required functionality of a future hardware implementation, and to understand the realities of the processes involved in moving data across the different interconnections. The model developed in this process will benefit both hardware design projects and simulator projects. We will provide an operational prototype to prove that the interconnection methods we will propose are implementable in reality. By providing rich instrumentation of the prototype we will extract functional behavior and performance characteristics in a typical system, gaining insight in problems related to the ATM-SCI interconnection. Furthermore, the instrumentation provides a real life evaluation of the quantitative properties of the “atomic” bridge operations. Based on the achieved results and the general knowledge we will establish parameterized interconnection models, and predict functionality and performance of possible hardware implementations.
1.4
Organization of this Thesis
Chapters 2 through 4 provide background. Chapter 2 gives a short general introduction to the SCI and ATM standards, and presents the laboratory environment where the practical work is performed. In chapter 3 the performance analysis of the available SCI and ATM hardware is presented. In chapter 4 I introduce a simple SCI-ATM bridge algorithm, and analyze its correctness. Chapters 5 through 8 present the SCI-ATM emulator. In chapter 5 I present the emulator model by stating the model requirements and discussing the architecture. In chapter 6 I introduce the programming solution for the bridge emulator. The bridge performance analysis is given in chapter 7, and the analytical model of the bridge performance is constructed in chapter 8. In chapter 9 I suggest a hardware bridge solution, and implement its analytical performance model. In chapter 10 I make a summary of the performed work, discuss the achieved results, and suggest some interesting further research.
6
CHAPTER 1. INTRODUCTION
Chapter 2
Background 2.1 2.1.1
A Brief Overview of the SCI and ATM Standards Scalable Coherent Interface – SCI
SCI is a point-to-point unidirectional network solution1 which includes devices and software support for services ranging from simple message passing to coherent cache memory sharing2 . For detailed study of SCI standard the reader is encouraged to read “IEEE Standard for Scalable Coherent Interface” [IEEE 92], where all details are throughly discussed. In the sequel I give some basic information about the SCI standard, divided in two parts. The first tries to give an idea how SCI looks like physically, and the other gives some basics about higher abstraction layers. Physical layer SCI is supposed to connect computers on shorter physical distances, typically up to 10m when using copper links. Fiber-optic links are supported, increasing node distances. SCI links are of unidirectional point-to-point type thereby solving the bus bottleneck problem. Each node has at least one incoming and one outgoing connection, but it is possible to connect several interfaces at a single node, achieving a simple switching mechanism. The most simple network model is ring connecting two or more nodes, but switching gives a possibility of more sophisticated topologies. An SCI system is based on a 64-bit architecture, where 16 bits are used as node identifier, and the remaining 48 bits are used within each node. It is therefore possible to scale up to 65536 (216 ) nodes, with 256 TBytes (248 ) address space for each node. A fixed addressing model enables SCI to perform basic packet routing at the physical level, preventing the nodes of being bottlenecks for the highbandwidth links. Simple header comparison picks packets addressed to the local node and puts them in a waiting queue, input FIFO. Other packets are sent further on, but a bypass FIFO is needed in order to prevent collision with a packet possibly coming from the node output FIFO. Figure 2.1 shows basic SCI node model. 1 “Point-to-point” means that only two computers share a single link. “Unidirectional” means that on a single physical link all data flows in one direction. 2 Cache coherence is not supported in the SCI implementation we are currently using. However, coherent memory sharing is supported.
7
8
CHAPTER 2. BACKGROUND
Input FIFO
Output FIFO
Node Applicaton Logic
output link
MUX
Bypass FIFO
Address decode
input link
Figure 2.1: SCI node model Logical layer SCI is described in terms of a distributed shared-memory model with cache coherence3 because it is the most complex service SCI provides. Simpler services (memory sharing, message passing) are also provided, and the services may be freely mixed with each other. Transactions4 are performed by sending packets from the output queue in one node to the input queue in another. A packet consists of a sequence of 16-bit symbols. It contains a header (containing address, command and status info), data (optional) and a check symbol. On its way to the target node it may pass through intermediate nodes, as previously explained. Transactions consist of two subactions, request and response subaction at “requester” and “responder” nodes respectively. Each subaction consists of two packet transmissions, send and echo packets. Hence a typical transaction involves four packet transfers. Figure 2.2 shows the SCI request-packet format. It is shown as an example, and discussion of other various packet formats is given in [IEEE 92]. TargetID field specifies node identification number (as explained on page 7), command field specifies type of packet (read00, writesb etc.), also length of data field. SourceID identifies requester node, control is used for flow control and excluded from cyclic redundancy code calculation, which is given in CRC, 48 bytes of address offset select specific memory or register location and are typically dependent of (and interpreted by) the responder. Possible presence of Ext field, extended header, is signaled by a bit in command field. It is used in cache-coherence transactions, other uses are reserved for future extensions to SCI. SCI provides various transaction commands: • response expected subaction commands: – selected-byte read, write and lock (e.g. “readsb”) 3 To improve performance, computers transfer blocks of data from main memory to fast cache memory. In a multiprocessor system there can be copies of the same data block in many caches at the same time. If one processor modifies data, all other copies become obsolete. SCI maintains coherence by hardware, so the caches may be used without modifying application software. 4 In the SCI terminology, an information exchange between two nodes (for instance writing four bytes to a shared memory buffer) is called “transaction”.
2.1. A BRIEF OVERVIEW OF THE SCI AND ATM STANDARDS
16 bits
targetID
16 bits
command
16 bits
sourceID
16 bits
control
3 bytes
addressOffset
0 or 16 bits
9
ext
0, 16, 64 or 256 bytes
data
16 bits
CRC
Figure 2.2: Request packet format, grayed fields are of variable size – non-coherent memory read and write for 16, 64, and 256 bytes block (e.g. “nwrite64”) – coherent memory control (“mread00”), read and write (e.g. “mwrite16”) – cache to cache control (e.g. “cwrite64”) • no-response subaction commands – move commands (e.g. “smove64”, “dmove16”) A detailed discussion of transaction commands is given in [IEEE 92].
2.1.2
ATM – Asynchronous Transfer Mode
ATM is an increasingly popular networking standard, doomed to take the leading position among the computer networks. It is designed to carry different types of information (audio, video, data), and is primarily intended as a Wide Area Network (WAN) technology. Due to its effectiveness and partly due to the fact that installation of long WAN links based on a new technology needs time, ATM is increasingly popular also as a Local Area Network (LAN) technology. Basic terms and switching principles ATM is connection oriented — the route that information will follow from one computer to another must be decided during the connection start up, before actual data transfer takes place. To minimize transmission delay ATM is based on short, fixed-length cells, containing 5 bytes header and 48 bytes payload. Routing information, called virtual path and virtual channel identifier (VPI/VCI pair)5 , is contained in the header of each cell. On the way to its destination, a cell typically passes through one or more switches (of course, point-to-point connections are also possible). Cell routing in a switch is a simple task, since it is based on a single look-up operation. Every incoming port on the switch has associated a routing table which 5 There is hierarchy relation between virtual path and virtual channel in the sense that a virtual path contains many virtual channels.
10
CHAPTER 2. BACKGROUND
2 3
VPI
GFC
VPI VCI PTI
4 5
CLP
byte1
HEC
6
48-byte payload 53
Figure 2.3: ATM cell format
determines the outgoing port for each VPI/VCI pair. VPI/VCI identifiers may be changed in the switch before the cell is sent further. Each virtual circuit (virtual path/channel pair) can be associated with different traffic characteristics, called Quality of Service (QoS). QoS is usually negotiated during a connection establishment, and determines the peak and the mean bandwidth reserved for a virtual circuit, and also the maximal burst length (longest uninterrupted sequence of data). Signaling on an ATM link is done over a specialized pair of virtual channels, called Signaling Virtual Channels (SVC).
Cell Format ATM cell header appears in two forms, one at user-network segment 6 and one within network. The difference is in a 4 bits generic flow control (GFC) field, which takes place of the first four bits of virtual path identifier. Other fields contained in the header (as shown in figure 2.3) are: • Virtual Path Identifier (VPI), 8 bits long at user-network segment, 12 bits within the network, • Virtual Channel Identifier (VCI), 16 bits long, • Payload Type Identifier (PTI), three bits, used to flag user data/network control, congestion and service data unit type, • Cell Loss Priority (CLP), one bit, in the case of heavy load cells with this bit set are discarded first, and finally • Header Error Checksum (HEC), which is a 8 bit CRC on the first four bytes, generated by the physical layer. 6 User network segment is part of ATM network before local switch / remote concentrator unit performs flow control.
2.1. A BRIEF OVERVIEW OF THE SCI AND ATM STANDARDS AAL user service data unit
Data, < 65536 bytes
48 bytes
11
Tail 0-47 bytes
1 bit 1 bit
Pad
UU CPI Length
48 bytes
2 bit
4 bit
CRC
48 bytes
UU - AAL layer user-to-user identifier CPI - Common path identifier CRC - cyclic redundancy check
Figure 2.4: AAL5 convergence sublayer and segmentation and reassembly PDU format ATM adaptation layer Since a sequence of cells is not directly useful in many applications, and in order to simplify the task of application programming, ATM adaptation layers (AAL) have been introduced. Currently five different AALs are defined: • AAL1, continuous bit rate • AAL2, aimed at video transmission, specification not completed yet • AAL3/4, packet transfer service, with segmentation and reassembly (SAR) header including CRC7 , sequence number etc. • AAL5, simple packet transfer service (with null SAR sublayer) ATM adaptation layers are intended to cover different needs for transmission of audio, video and data and hence they differ in timing relationship between source and destination users (for instance, voice transmission packets become obsolete if time-outed, and have to be discarded), bit rate constancy and connection mode. AAL1 and AAL2 have timing relationship, while AAL3/4 and AAL5 do not (AAL3/4 and AAL5 packets may be discarded only due to heavy traffic). AAL1 is the only one having constant bit rate, AAL2 for instance is assumed to carry compressed video data where the same quantity of data may have different time length of a video play — therefore the AAL2 bit rate is variable. AAL3/4 and AAL5 are regarded as connectionless services in the sense that all routing is done at lower abstraction level and no separate signaling virtual channel connection is needed, which contrasts AAL1 and AAL2. AAL is comprised of two sublayers, convergence sublayer and segmentation and reassembly sublayer (CS and SAR). CS provides timing and cell loss recovery (AAL1,2) and cell loss detection (AAL3/4,5). SAR breaks bigger data blocks (up to 65536 bytes) into 48-byte payloads, and reassembles them at the destination. AALs add extra information (sequence number, CRC, length identifier etc.) to each user data block. The length and containment of information vary for each AAL type. Simplicity and conciseness of AAL5 makes it our choice when implementing different ATM applications described in this report. In figure 2.4 we show AAL5 convergence sublayer and segmentation and reassembly Protocol Data Units (PDUs). 7 CRC - Cyclic Redundancy Check, a number calculated from the cell bytes by a polynomial formula. It is used for error detection.
12
CHAPTER 2. BACKGROUND
SS20
SS20
ATM Switch
SCI ring
SS20
Ethernet
SS20
Figure 2.5: SCI-ATM laboratory
2.2 Laboratory 2.2.1
Hardware
The following study is done in a laboratory consisting of four SparcStations 20, each equipped with a “Dolphin Interconnect Solutions” SBus-to-SCI adapter (also called “SBus-2”) connected in a four-node cluster by 1 Gbit/s links. Two workstations have FORE SBA200 SBus ATM-adapter, and two have newer FORE SBA200E, all with 155Mbit/s links. All ATM adapters are connected to an ASX200WG switch. Dolphin SBus-to-SCI adapters [Dolphin 97] have 256 kBytes on-board RAM, SBus/SCI address translation circuit and a DMA controller with up to 64 kByte block move from SBus memory to SCI memory domain. Note that the SBus-2 cards were set in use after the work on this thesis was started; some work in chapter 4 is performed with old SBus-1 cards. FORE SBA200 and SBA200E hardware [FORE Ada. 96] includes an onboard 25 MHz i960 microprocessor, specialized circuits for HEC, CRC and AAL3/4 and AAL5 calculations, 256 kBytes RAM and a DMA circuit supporting block transfers up to 16 words with 32 bit data path. ASX200WG switch [FORE Sw. 96] has 2.5 Gbit/s switching fabric, serves up to 24 ports, and transit delay is less than 10µs. It is served by a i960 CA switch control processor. It has also Ethernet interface and its switch control software is programmable via AMI (“ATM Management Interface”). Topology is shown in figure 2.5. The major restriction of the used SCI hardware compared to the SCI standard is the lack of support for cache coherence. The current implementation accepts 1, 2, 4, 8 and 16 bytes selected read/write transactions, and generates 1, 2, 4, and 8 bytes. The move transaction is limited to 64 bytes. ATM adapter supports AAL3/4 and 5, also so-called AAL-Null, single cell generator, which may be used to implement other AALs.
2.2. LABORATORY
2.2.2
13
Programming Interface
In this work all programming is done in the C language, using the standard Solaris c-compiler with default code optimization. SCI programming interface is based on the Slib function library developed by Knut Omang at the Department of Informatics, University of Oslo [Omang 96]. This library offers a higher level programming environment including a light weight message passing interface, easy setup and management of shared memory segments, barrier synchronization utilities and support for programs running multiple threads on each node. ATM programming is done by using FORE’s application programming interface, distributed with the ATM adapter software. The interface contains a set of sockets-like commands enriched with ATM features (Quality of Service, simplex/duplex transfer etc.), and a set of ATM specific commands (e.g. Permanent Virtual Circuit (PVC) management). An example of Slib usage is given in figure 2.6. The code presents a SCI producer/consumer program core. A simple producer/consumer program (as used intensively in this thesis) passes through the following phases: 1. initialization including file descriptor assignment with s_open() command 2. connection establishment by calls to s_mkread() / s_mkwrite() commands 3. data transfer with a sequence of s_read() / s_write() calls 4. timing result presentation and closing of connection(s) ATM producer/consumer programs use a similar algorithm, though commands are different. An example is given in figure 2.7. In this example a permanent virtual circuit (VPI 0, VCI 77/78) is used. Vpvs, Aal_type, Atm_conn_resource, Atm_info and many other ATM-specific data types are defined in “fore/types.h” header file. Atm_conn_resource structure, for instance, is a triple, containing peak bandwidth, mean bandwidth and mean burst for the connection. The atm_open() command assigns the desired ATM device to a file descriptor. After setting QoS demands in the Atm_conn_resource structure, the atm_bind_pvc command is invoked in the sender, and the atm_connect_pvc command in the receiver. This commands must occur in pairs. Note that the QoS parameter is given by reference; the atm_bind_pvc() and atm_connect_pvc() commands may decrease desired values if no bandwidth is available (in the current implementation of FORE APIs this does not happen, we will return to this later). If any of the given commands fails, a value less than zero is returned, and the atm_error() command may be used to write the error message to the standard output. The atm_send() and atm_recv() commands respectively send and receive a data packet. The packet may be of length from 4 Bytes to “maximal packet length”; this value is returned from the atm_open() command. An alternative method of ATM connection establishment is use of Berkeley sockets-like commands (atm_gethostbyname(), atm_connect(), atm_listen(), atm_accept()). Contrary to the traditional sockets, these commands are not independent of the underlying network type, but they are handy in development of more general applications for ATM networks.
14
CHAPTER 2. BACKGROUND
... #include "slib.h" ... void main(int argc, char** argv) { int fileDes, myRank; char *buffer; double initTime; ... /* Other declarations, initialization */ myRank = s_rank(); /* Obtain my SCI rank. Read config file if unknown */ buffer = (char *)memalign(64, 65536); /* Allocate 64k buffer. Memory offset must be a multiply of 64, in order to DMA to work efficiently */ fileDes = s_open(); /* Open new file descriptor */ if(consumer) /* Create a read connection via port ’PORTNO’, buffer size set to 64k. ’TRUE’ implies blocking call. */ s_mkread (fileDes, PORTNO, 65536, TRUE); else /* Create a write connection to remote node ’remRank’ via port ’PORTNO’. Wait for answer (a corresponding call to s_mkread. */ s_mkwrite (fileDes, remRank, PORTNO, TRUE); initTime = realtime(); /* Store curent time, in seconds (accurate to microseconds) */ while (more_data) if (consumer) { /* Read 64k data into buffer. s_read returns amount data actually read, if negative a error state has occured */ if (s_read (fileDes, buffer, 65536) < 0) sciPerror ("s_read failed "); /* Display error message */ } else { /* Write 64k data from the buffer. Report if error occured. */ if (s_write (fileDes, buffer, 65536) < 0) sciPerror ("s_read failed "); } usedTime = realtime() - initTime; ... s_close (fileDes); }
Figure 2.6: Slib SCI producer/consumer illustration, raw I/O commands
2.2. LABORATORY
#include #include #include void main(int argc, char** argv) { char *buffer; int fd; /* file descriptor */ Vpvc vpvcOUT = 77, vpvcIN = 78; /* We want to write data using virtual path 0, virtual channel 77, and to read from virtual path 0, virtual channel 78. This virtual path and channels must be known to the switch we are connected to, and reserved as PVC */ Aal_type aal = aal_type_5; /* We want to use AAL5 */ Atm_conn_resource qos; /* Quality of Service */ Atm_info info; /* Info structure, currently containing only max packet size */ ... /* Other declarations */ /* Assign ATM device to file descriptor. Info structure is filled up */ fd = atm_open("/dev/qaa0", O_RDWR, &info); buffer = (char *)malloc(info.mtu); /* Allocate buffer of max packet size */ /* Demand quality of service. If we ask for too much, we may be refused. */ qos.peak_bandwidth = 155000; /* kbits/s */ qos.mean_bandwidth = 50000; /* kbits/s */ qos.mean_burst = 80; /* kbits */ if (sender) { /* We are producer in this connection. We bind the file descriptor to outgoing PVC. We request also AAL type and quality of service */ if (atm_bind_pvc (fd, vpvcOUT, aal, &qos) < 0) atm_error ("atm_bind_pvc"); /* Report error state */ } else { /* We are the consumer. We connect to the incoming PVC, bound by the producer at opposite side. If resoult is less than zero, an error occured. */ if (atm_connect_pvc (fd, vpvcIN, aal, &qos) < 0) atm_error ("atm_connect_pvc"); } while (more_data) { if (sender) atm_send (fd, buffer, info.mtu); /* Send a packet of max size */ else atm_recv (fd, buffer, info.mtu); /* Read a packet into buffer */ } ... atm_close (fd); }
Figure 2.7: FORE’s API, usage illustration
15
16
CHAPTER 2. BACKGROUND
Chapter 3
Basic Measurements of Network Performance In this chapter I will present results of performance measurements on both SCI and ATM based networks. There are two characteristics we are particularly interested in, latency and throughput. Since different authors not always use this two terms in the same manner, we assure us from misunderstanding by giving definitions of the throughput and latency as used and measured in this work. Then I present a series of measurements and show how throughput and latency depend on different communication parameters. Finally, I introduce an original measurement method, constructed during the work on this thesis.
3.1 3.1.1
Performance Measurements Definitions
We define throughput and latency for network connection between two computers, named A and B. Definition 1 Throughput of the network connection from computer A to computer B is defined as a value expressing how much data passes through the connection during the time period from just before the first packet was sent until the last packet was received. The throughput is usually expressed in megabytes per second, [
MBy tes s
].
Definition 2 Latency is defined as period of time elapsed between the start of sending of a data packet from computer A to computer B, and the end of receiving of the data packet on computer B. Latency is usually expressed in microseconds, [µs]. When measuring ATM network performance we need an additional explanation for the definition of throughput. In a situation where a certain percentage of data packets is lost, we calculate throughput as (received data length)/(time) ratio. When measuring performance of a duplex connection, we calculate throughput as (total data)/(time) ratio, where “total data” is amount of data sent in both directions. Let us also briefly discuss problems of, and solutions for, latency measurement. The following discussion is mostly concerned about the ATM 17
18
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE Computer
Computer
A
B Blocking recv call invoked Initialization finished
’Send packet’ invoked Setup finished, data trans. start
Actual data recv. began
t’2
t1
t’1
t2
t0
’Send packet’ returned Data transmission finished Blocking ’recv’ call invoked
Actual data receival began
Answer received
Packet received
t3
t4
Answer sending started
10 01
t1 t0
Send command returned Answer sent
t2
Arrow of Time
Figure 3.1: Latency measurements, diagram
latency, but a similar reasoning may be applied on the SCI message passing. In figure 3.1, latency is represented with time t0 . t0 is the sum of packet delay t2′ and transmission time t1′ , and we would also like to know how long they are. Because a receive command is typically of blocking nature1 , we cannot measure t1′ and t2′ . What we can do is to measure time t1 , which we will call “send-time”, and to assume that it is equal to transmission time t1′ , which further implies that also t2 is equal to t2′ (note that t1′ + t2′ = t1 + t2 ). One may wonder how the packet delay and the packet transmission time relate to the standard terms “frame propagation delay” and “frame transmission time”. We talk about the latter ones usually in the link layer context, while transport layer is more proper placement for the packets we analyze here2 . Packet delay as we see and measure it includes link propagation delay, but it includes also DMA setup time and switch latency (for ATM). The atm_send() command returns before DMA transfer is finished, making send time for longer packets shorter than necessary frame transmission time. In this paper we will not analyze lower abstraction layers, all discussion concerns AAL5 and SCI raw read/write packets. Let us go back to the latency measurement problem. We cannot measure t0 directly because of the major problem connected with latency measurements: clock synchronization. Typically, the two computers used in our measurements have unsynchronized internal clocks, because they are not initialized in the same moment. We may override this problem in two ways: constructing a simple hardware device which would send a “reset” instruction to the processors simultaneously, or using some kind of software synchronization. The software synchronization may be based on for instance SCI signaling3 , which we know is very quick. Unfortunately, this is not satisfactory, because 1 This means that it may be invoked a long time before actual data transmission begins, and does not return before the data is transmitted or an error detected. 2 It is difficult to find correct placement for the ATM layer and the ATM adaptation layers in the standard ISO reference model. 3 Also for ATM latency measurements, recall that all our workstations are in one SCI ring (figure 2.5).
3.2. SCI PERFORMANCE
19
the s_barrier() command [Omang 96] unblock on different computers in about 25µs intervals, what is much longer than minimum SCI latency. Even worse, it is the operating system who schedules processes, so the awakened process may regain control after arbitrary long time, in range of 1ms. To avoid clock synchronization problem, we would like to register all time data on one computer. This is, of course, possible, as long as the following symmetry assumption is valid: On a network link between two symmetric computers (with identical hardware, software and load), the average value of latency and all its sub-parameters (packet delay, transmission time etc.) is the same for a given packet length, regardless of data flow direction. If we also assume that “processing time” t3 is negligible compared to latency t0 , we can simply measure time t4 on computer A, and calculate the latency as l = t4 /2. This method of latency measurement is in literature sometimes called “ping-pong” latency measurement [Bryhni, Omang 96]. At the end of this chapter I will show how the clock synchronization problem can be partly overridden, and apply this original method to packet delay measurement.
3.1.2
Methodology
We measure throughput by using simultaneous calls of a producer and a consumer program, varying parameters of interest. The parameters always include buffer size for a single read/write command, but also total amount of data to be transfered, number of cooperating threads, delay between two simultaneous calls, connection type on ATM (simplex/duplex, QoS) etc. When measuring ATM network performance, and since ATM packet transport is unreliable, we are interested also in how many packets come through to the consumer. We express this value in percents (successfully received / total sent * 100%). Timing information is written to a file, which is adequately transformed and visualized by a math drawing tool. All measurements are done at least three times, and we present a mean value unless stated different. Latency is measured by two simple programs executing on two different computers. The programs are based on “ping-pong” method, i.e. program A sends a data packet to program B, and program B returns an answer of equal length to program A. Program A writes the mean latency for a big number (~1000) of packets to a file. The measurement is repeated for different packet sizes, and plotted by a drawing tool.
3.2 3.2.1
SCI Performance Throughput
We will examine throughput achievements in our SCI ring. This value is dependent on the SCI buffer size, and eligible buffer sizes are 64 Bytes 128 kBytes. The choice of buffer sizes can be explained by following: • DMA memory alignment is 64 Bytes • smaller amounts of data can be transfered more efficient by SCI shared memory commands
20
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE
14
12
Throughput [MBytes/s]
10
8
6
4
2
0
1000
2000
3000
4000 5000 Buffer size [Bytes]
6000
7000
8000
Figure 3.2: SCI throughput: Buffer size varying 64 Bytes - 8192 Bytes
• total RAM in our SBUS-2 SCI adapter is 256 kBytes; buffer size of 128 kBytes (and more) results in that only one SCI transaction can be outstanding, which decreases the performance significantly, as we shall see latter. We shared the measurement interval in two overlapping parts: 64 - 8192 Bytes and 4 - 128 kBytes. The results are shown in figures 3.2 and 3.3 respectively. The results reveal that maximal achieved speed for a zero-copy4 block data transfer on the SCI ring is about 20 MBytes/s, when buffer size is 64kBytes - 128kBytes (120kByte buffer ~ 20.67MByte/s). The results show also that 75% of the maximum speed is reached at buffer sizes as short as 9kBytes, which corresponds to the maximum ATM buffer size; this will be proved important latter. The performance suffers a dramatic fall at buffer size of 128kBytes. This is explained by the fact that the SCI adapter has 256 kBytes buffer space; for transfer of 128 kBytes data (+ protocol data unit including CRC and other informations) the adapter is restricted to only one outstanding operation, despite that almost 128 kBytes of the buffer space remains unused [Omang 97]. The throughput is decreased by approximately 50%, as expected. The throughput is not satisfactory at lower buffer sizes (1 kByte and less). SCI gives us better tools for data transfer of small packets. 4 Zero-copy:
with no memory copying in user space.
21
3.2. SCI PERFORMANCE
22
20
Throughput [MBytes/s]
18
16
14
12
10
8
6
20
40
60 80 Buffer size [kBytes]
100
120
Figure 3.3: SCI throughput: Buffer size varying 4 kBytes - 128 kBytes
One may wonder what happens when more than one producer/consumer pair are active in the SCI ring. We examined the following situations: 1. two and more producer/consumer pairs invoked on the same two nodes 2. one producer/consumer pair invoked with two or more cooperating threads of execution 3. two producer/consumer pairs invoked on two node pairs (making all four nodes in the SCI ring active) We did not notice any significant difference in performance between cases (1) and (2). In contrast to this, case (3) shows that two separate data transmissions on two separate node pairs can coexist almost without fall of performance at all, which means that cumulative throughput (thr1 + thr2) practically reaches 50 MBytes/s! This is maybe not so surprising, but indicates clearly that the bottleneck in SCI communication is not the SCI ring itself but the SBus interface. The experience we gathered says that a significant increase in throughput, around 20%, is achieved. It is not noticed significant difference between cases (1) and (2) stated above, throughput remains just under 26 MBytes/s. Throughput achieved when up to four threads are cooperating on sending data is revealed in figure 3.4. Using even more threads does not improve performance significantly. We notice that more cooperating threads implies more stable throughput value — when using three or four threads no sudden zig-zags are recorded.
22
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE
26
24
Throughput [MBytes/s]
22
20
18
16
−−− one thread −x− two threads
14
−o− three threads −*− four threads
12
10
20
40
60 80 Buffer size [kBytes]
100
120
Figure 3.4: SCI throughput: cooperating threads, buffer size 4-128 kBytes
Both performance improvement and throughput stability can be explained by the DMA mechanism used by SCI. Every packet transfer takes two steps: DMA setup and data transfer. While the setup is engaging the computer’s CPU, data transfer is done by DMA alone. When we use only one thread, this must happen in sequence, which reduces the total throughput because the DMA engine is not active permanently. Using more than one thread makes simultaneous DMA setup and data transfer possible, thereby increasing the throughput. However, throughput can not increase infinitely, because at a certain point (26 MBytes/s) the DMA engine works continuously and reaches its limit. The described method for increasing throughput is in literature called DMA setup-time hiding [Omang 97], [Bryhni, Omang 96]. We also notice that deviation in the throughput measurements is small. Recorded data for each separate measurement follows the values plotted in 3.4 quite closely, implying that the zig-zags are not caused by any casual events. The multi thread case is of special importance to us when implementing SCI-ATM bridge.
3.2.2
Latency
When talking about SCI latency, there are at least two values of interest: 1. latency of shared memory operations (also called “programmed I/O”) 2. latency of raw read/write operations
23
3.2. SCI PERFORMANCE Packet Size [Bytes] 4 8 16 64 512 1024 2048 4096 8192
Latency (read -write) [µs] (SS20) 126 131 143 441 428 472 539 659 964
Latency (programmed I/O) [µs] (Ultra-2) 4.0 4.1 12.4 19.4 84.5 158.8 307.5 N.A. N.A.
Table 3.1: SCI latency depending on the packet size
Shared memory operations, such as 8-Byte remote store mwrite08(), are characterized by extremely low latency compared to most of the competitive technologies. Raw read/write has significantly higher latency, it is an interrupt based solution involving the driver and operating system scheduling to activate the process waiting for response from a remote node. On the other hand, the bad side of programmed I/O is that it is most effective when using active waiting, which wastes CPU cycles. In table 3.1 we show raw read/write latency measured in our SCI ring, for some characteristic buffer sizes. Because of current compatibility problems when using SBus-2 adapter on SparcStation 20, we were unable to measure latency of programmed I/O5 . As a compensation we refer to the measurement results published in [Omang 97]. These measurements are performed in SCI cluster based on more advanced UltraSparc-2 computers, and represent ping-pong latencies of consecutive remote stores using optimal size of store commands (4 and 8 Bytes are currently available). We do not expect programmed I/O latencies to increase significantly in SparcStation 20 based ring because programmed I/O is only initialized from the workstation — the commands are handled in the SCI adapter6 . The programmed I/O latency is card-dependent, and we use the same cards in our SCI ring. On the other side, it is in general not wise to give premature conclusions, we would prefer to test a fully compatible adapter. Let us go back to the raw read/write latency. Notice the dramatic latency jump for 64 Bytes buffer size. For comparison, latency measured for 60 Bytes buffers lies at 202µs, and for 68 Bytes at 216µs! An explanation for this may be that buffers with length which is a multiple of 64 Bytes are automatically transfered by DMA, with its costly setup. Buffers with length not divisible by 64 are sent by remote store operations. The read/write latencies for shorter packets remain high compared to the corresponding programmed I/O latencies, because the rest of the raw read/write operation, including operating system interventions, is the same as when sending large blocks.
5 Remote store commands have an abnormal behavior. SS20 platform is currently not supported by the SBus-2 adapters manufacturer. 6 For instance, the latency is similar in the PCI-SCI system [Ryan 96].
24
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE
120
Throughput [kbit/s]
110
100
90
80
70
1000
2000
3000
4000 5000 6000 Buffer size [Bytes]
7000
8000
9000
Figure 3.5: ATM throughput: low bandwidth requirements (128kbit/s)
3.3 Performance of ATM network 3.3.1
Throughput
As an introduction to ATM network measurements, we set up three different scenarios and measure achieved throughput: 1. low speed 128kbit/s bandwidth, which corresponds to low demanding applications such as telephony, or teleconferencing with real time data compression mechanisms 2. medium speed 15Mbit/s (=1.875 MBytes/s) bandwidth, corresponding to average quality video data transfer, such as compressed MPEG-2 3. high speed 155Mbit/s (=19.375 MBytes/s) bandwidth, which is maximum available on our link. All measurements are done using AAL5 and simplex data transfer. The quality of service requirements were set to the same value for peak and mean bandwidth and sufficient burst length (e.g. 15000 kbit/s, 15000 kbit/s, 80 kbit in the second case).
25
3.3. PERFORMANCE OF ATM NETWORK
15 14.5 14
Throughput [Mbit/s]
13.5 13 12.5 12 11.5 11 10.5 10
1000
2000
3000
4000 5000 6000 Buffer size [Bytes]
7000
8000
9000
Figure 3.6: ATM throughput: medium bandwidth requirements (15Mbit/s)
16
14
Throughput [MByte/s]
12
10
8
6
4
2
0
1000
2000
3000
4000 5000 6000 Buffer size [bytes]
7000
8000
9000
Figure 3.7: ATM throughput: high bandwidth requirements (155Mbit/s)
26
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE
I must remind that, in the current implementation of Fore ATM driver, the QoS settings are not fully respected by the connection establishment mechanism. Only peak bandwidth requirements are respected and also truncated if the requirement exceeds physical limits. However, if several connections with maximum requirements are established simultaneously, they all get their requirements accepted, but the real speed is of course limited, and packet dismissal seems to be somewhat increased. The results are shown in figures 3.5, 3.6 and 3.7. What we see is a bit surprising. Figure 3.5 shows that throughput never exceeds 85 kbits/s, despite that we have demanded 128 kbits/s! This makes only about 66% of demanded bandwidth, even though our demand is quite modest. When demanded QoS is 15Mbit/s, throughput reaches 11.53 Mbit/s, which yields 76.9% of the demand. Throughput is however quite stable, as we expect. It is also important to notice that packet dismissal in both measurements presented so far is zero. When the highest QoS is demanded, achieved throughput varies the most. Picture 3.7 shows that throughput reaches 15.87 MBytes/s for buffer size of 8192 Bytes. This is a mean value for six measurements, and the used delay is 300µs (an explanation of delay and its importance is given latter). This value is 90.4% of the physically possible maximum: MBy tes s Mbbits 155 s
15.87
bits
· 8 By te ·
48 53
= 0.904
(where 8 is the number of bits in a byte, and 48/53 is payload/packet_len ratio for AAL5). When measuring performance of ATM network, there is one more information we need from our producer/consumer system: how many packets are actually coming through our connection. We usually relate this number to the number of packets actually sent. This is very important, since packet dismissal in an ATM network (and, as we will see, particularly when sending shorter packets) may be rather large. This make us focus at another important parameter, besides buffer size, namely delay. Buffer size is the amount of data sent with one call to FORE’s atm_send() API, which also determines packet size. Delay is minimum time elapsed between two successive calls to atm_send()7. Actual time between two atm_send() calls may be longer than the delay parameter, if the atm_send() command itself lasts longer than desired delay. In such a case, active waiting delay loop (figure 3.12) will be skipped. We choose the active waiting mechanism because it is simple and sufficiently accurate for its purpose, a single while loop with the realtime()8 call takes 3 − 5 µs. There is another important parameter for an ATM connection with big impact on the throughput — whether the connection is in simplex or duplex mode. Duplex connection, as the experiments show, decrease throughput9 when compared to simplex connection. Duplex is also less reliable in the sense that unexpected packet loss occurs not only with shorter packets, as in simplex mode, but also when sending longer ones, and despite relatively long sending delay. We show throughput achievements together with percentage of successfully received packets for both simplex and duplex connection in figures 3.8, 3.9, 3.10, 3.11. 7 More proper name for delay may be “rate-based flow control”. It is not classic “wait” command, the send-time is incorporated in it. 8 realtime() command returns system time in seconds and is accurate to 1µs. 9 Always decrease throughput in one direction, but for shorter packets total throughput is increased.
27
3.3. PERFORMANCE OF ATM NETWORK
Throughput [MByte/s]
15
10
5
0 8000
600
6000
500 400
4000
300 200
2000
100 0
Buffer size [bytes]
Delay [us]
Figure 3.8: ATM throughput: simplex connection, max QoS
Throughput [MByte/s]
15
10
5
0 8000
600
6000
500 400
4000
300 200
2000 Buffer size [bytes]
100 0
Delay [us]
Figure 3.9: ATM throughput: duplex connection, max QoS
28
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE
100
Get−through [%]
80 60 40 20 0 8000
600
6000
500 400
4000
300 200
2000
100 0
Buffer size [bytes]
Delay [us]
Figure 3.10: Percentage of packets coming to remote side through an ATM simplex connection, max QoS
100
Get−through [%]
80 60 40 20 0 8000
600
6000
500 400
4000
300 200
2000 Buffer size [bytes]
100 0
Delay [us]
Figure 3.11: Percentage of packets coming to remote side through an ATM duplex connection, max QoS
3.3. PERFORMANCE OF ATM NETWORK
29
... delay = 350; /* micro seconds */ for (packet = 0; packet < (DATA_LEN / packetSize); packet++) { sendTime = realtime(); /* time in seconds, accurate to 1e-6 */ if (atm_send(fd, tbuf, packetSize) < 0) { atm_error("atm_send"); exit(1); } while (sendTime + delay/1000000.0 > realtime()); /* active waiting ... we are not in rush */ } /* end packetNo loop */
Figure 3.12: ATM producer rate control, pseudo-code segment
3.3.2
Throughput Using TCP/IP over ATM
This study would not be complete without the throughput measurements for an ATM producer/consumer pair based on TCP/IP protocol. Why would somebody wish to use TCP/IP in an ATM network? The reasons are many. TCP/IP is a well tested and well tuned protocol, offering a superior programming comfort when a reliable data transfer is desired. Most of the existing network software is based on it, and hence may be used with ATM without modifications. Raw ATM is somewhat faster and offers direct control to the programmer, but (as we will see further in this paper) it is not always easy to yield a performance gain from it. I dare give a prognosis that Raw ATM programming is best suited for heavy ATM-dependent and time critical applications such as real time video. As long as we do not demand a certain quality of service, and particularly if we wish a reliable service, TCP/IP is the right choice. We measured throughput in our ATM network using a producer/consumer system based on TCP/IP. First we made measurements with standard TCP window size of 8192 Bytes, and achieved throughput of more than 12 MBytes/s (mean of 12.026 MBytes/s for three measurements, buffer size 11776 Bytes. Some separate measurements reached 12.3 MBytes/s). We wondered what would happen if we increase TCP window size. The measurements were repeated with window size of 16 and 32 kBytes. Maybe in contrast to our expectations, no dramatical boost in throughput was registered, except for shorter packets (~2kBytes). For longer packets we even experience fall of performance! This phenomena may get an explanation based on the experience from this chapter. Let us discuss the following situations, where the packet size is varied in the TCP connection on ATM network: 1. Very short packets (≤ 1kB). Packet loss is so big that TCP’s “Go-backN” error recovery scheme has big problems to re-send all packets and yields poor results no matter how big the window size is. 2. Shorter packets (≈ 2kB). Packet loss is not so big, and bigger window size improves TCP performance significantly. 3. Middle size packets (≈ 8kB but less than the drivers maximum packet size, 9176 Bytes in our case). Packet loss is practically zero, data transfer goes equally fast for all three window sizes.
30
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE
14
12
Throughput [MBytes/s]
10
8
6 TCP TCP window window size: size: −o− −o− 88 kB kB 4 −x− −x− 16 16 kB kB −+− −+− 32 32 kb kb 2
0
2000
4000
6000
8000 10000 Buffer size [Bytes]
12000
14000
16000
Figure 3.13: Throughput with TCP/IP over ATM
4. Notice the performance fall for 9216 Bytes packet length. This is the first packet length in our measurement where TCP-packet is longer than maximum packet length supported by the driver. 5. Long packets (≥ 9kB). The throughput varies a lot, probably depending on TCP packet segmentation in ATM-packets. Measurement results are shown in figure 3.13. A general conclusion for all the cases stated above is that bigger window size does not improve throughput significantly. This may be explained by that propagation delay for such a short physical link (size order of 100 meters) is small. We have to conclude also that TCP transport does not charge us very much for the service it offers. We will compare our self-implemented error recovery system with TCP/IP in one of the latter chapters.
3.3.3
Latency
We measure ATM latency as described in section 3.1.2 of this chapter. We share the measurement interval in two parts: 1. shorter packets (4 Bytes - 208 Bytes), covering packet sizes from minimum (one word) to 192 Bytes10 . 208 Bytes packet size is included to 10 192
Bytes is an interesting packet length because 192 is the lowest multiple of 48 and 64, which are respectively the ATM cell size and the most usual SCI transaction length. The 192 Bytes packet length is likely the choice for SCI packet encapsulation in a future transparent SCI over ATM implementation.
31
3.3. PERFORMANCE OF ATM NETWORK
370 360 350
Latency (us)
340 330 320 310 300 290 280
20
40
60
80
100 120 Buffer Size (Bytes)
140
160
180
200
Figure 3.14: ATM latency: shorter packets
emphasize characteristic latency drop. 2. longer packets (512 - 9176 Bytes), addressing packet sizes more suitable for longer data transfer, used intensively throughout this work. The results are shown in figures 3.14 and 3.15 respectively. We see that latency for longer packets increases almost linearly, following approximatively the formula: µs B + 320µs L = 0.115 Byte where L is latency [µs], B is packet size [Bytes]. We see that latency has a per-byte cost of 0.115µs/Byte, and a constant delay of 320µs. The delay is mostly spent on DMA setup, but not entirely, since the lowest latency for 4-Byte packets is approximately 290µs, and DMA setup is expected to take a fixed amount of time regardless the packet size. I stress that result deviation is quite big for latency measurements, particularly for shorter packets. Standard deviation for 10000 packets, lies around 30µs. Some isolated packets had latency up to 3000µs. Latency for the shorter packets is somewhat lower than the approximative formula shows. An interesting observation is also that latency sometimes decreases despite that packet size increases. 208 Bytes packets, for instance, have ≈10µs lower latency than 192 Bytes packets. We can partly explain latency drop by ATM segmentation mechanism: 192 Bytes and 208 Bytes packets both need 5 ATM cells to be transmitted when using AAL5 (recall figure 2.4, page 11). This is also registered for another packet
32
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE
1300 1200 1100
Latency (us)
1000 900 800 700 600 500 400 300
1000
2000
3000
4000 5000 6000 Buffer Size (Bytes)
7000
8000
9000
Figure 3.15: ATM latency: longer packets
sizes, but we abstain from giving a general explanation due to measurement data instability.
3.4 An Original Measurement Method for Symetric Systems Recall that in section 3.1.1 on page 18 we abstained from latency measurement method based on system time synchronization on computers A and B because of possibility of synchronization errors. In this section I introduce a method which can be used for measurement of time elapsed between two events which happened on two separate computers — without need for system clock synchronization. The method, which we call “Tarik’s Measurement Method”, is usable on a symetric system (i.e. which conforms with our symmetry assumption, page 19). It is based on registering the time when certain events happened on the two computers measured by their local system clocks, and a simple mathematical calculation. We will apply this method to packet delay measurements, and compare the results with the results from the previous measurements. We can also measure latency, or any other time interval as long as our programs have control and can register their local (processor) time just before and after the interval is elapsed. We give a definition of packet delay: Definition 3 We define packet delay as the period of time elapsed between the end of a packet send command executed on computer A and the end of
3.4. AN ORIGINAL MEASUREMENT METHOD FOR SYMETRIC SYSTEMS 33 the corresponding packet receive command executed on computer B. The time defined here was depicted by t2 in figure 3.1. One may discuss usefulness of measuring of the packet delay, but for us it is interesting since it is determined by two distributed events; it conforms with definition 3. Even more, it is an example of time not directly measurable on a single computer, making “Tarik’s Measurement Method” worth of defining. Remember that more discussion of the problem is given on page 18. The method is based on accurate11 measurement of time-events when 1. sending of a data packet is finished (from computer A), 2. receiving of the packet is finished (in computer B), 3. sending of a equally long answer is finished (from computer B), 4. receiving of the answer is finished (in computer A). Events (1) and (4) are local to the computer A, and events (2) and (3) are local to the computer B. We operate with a symmetric system, where two symmetric programs operate on two equally loaded computers equipped with the same network cards etc. Therefore we expect that differences (4)(3) and (2)-(1) are equal (more precisely, have equal mean value for a big number of measurements). If we had two perfectly synchronized computers, we could calculate mean packet delay for n measurements by: D=
(2, 1) + (4, 3) 2
(3.1)
where D is packet delay, (2, 1) is mean value of time differences between events (2) and (1), and (4, 3) is mean value of time differences between events (4) and (3). The problem is that time measured on computer A is period of time since a certain event happened in the computer A (maybe switching the power on), and is not the same as time on computer B. In the sequel we introduce some semi-formal theory and state and prove a theorem which shows that precise measurement of events (1) to (4) above gives us the possibility to measure packet delay as defined in definition 3. For this occasion we introduce the notion of time-event as a unique identifier of time when an event happened. We introduce also the operator Θ, “absolute time operator”, which transforms time-event space into (usual) linear time space, giving us the absolute time when an event happened. For a given event e1 , Θ(e1 ) is expressed in standard time units, seconds for instance. Notice that we do not know how much Θ(e1 ) is, but we know for instance, assuming that e1 was the timeevent when it was 7:00 PM yesterday, and e2 was the time-event when it was 8:00 PM yesterday, that Θ(e2 ) − Θ(e1 ) = 3600s. Let us finally introduce relative time operator. The relative time operator gives us the time when an event happened relative to a fixed zero time. The number of relative time operators is infinite, since we have infinitely many times we may choose as the zero time. Also relative time operators transform time-event space into linear time space. For a given relative time operator ϕ and its zero time tϕ0 , the following relation holds for each time-event e: ϕ(e) = Θ(e) − Θ(tϕ0), ∀e • time−event e
(3.2)
11 Accurate up to time necessary to read system time and store it in an array, which we know does not exceed 5µs on our hardware.
34
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE Now we state the theorem.
Theorem 1 We are given four sets of time-events: P= Q= R= S=
{p1 , p2 , . . . , pn } {q1 , q2 , . . . , qn } {r1 , r2 , . . . , rn } {s1 , s2 , . . . , sn }
which are in absolute time space ordered by Θ(pi ) < Θ(qi ) < Θ(ri ) < Θ(si ) Θ(si ) < Θ(pi+1 )
, ,
∀i • 1 ≤ i ≤ n ∀i • 1 ≤ i < n
Let us assume that time-event sets P and S are registered in a local time space A, and sets Q and R are registered in a local time space B. Time in space A is measured by relative time operator A, and time in space B is measured by relative time operator B. Then the following equation is valid: ! Pn Pn 1 i=1 (Θ(qi ) − Θ(pi )) i=1 (Θ(si ) − Θ(ri )) = + 2 n n ! Pn Pn 1 i=1 (B(qi ) − A(pi )) i=1 (A(si ) − B(ri )) (3.3) + 2 n n Proof. It is not difficult to show that equation (3.3) holds, basing the proof on the relation (3.2). Notice that expression (3.1) is an abbreviated form for the left side of equation (3.3), under assumption that arrays P . . . S defined in the theorem represent our locally registered times, as stated in the “Tarik’s Measurement Method” description. We reason as follows: 1 2
Pn
i=1 (B(qi ) − A(pi ))
=
1 2
n
+
Pn
Pn
i=1 (A(si ) −
n
i=1 (Θ(qi ) − Θ(tB0 ) − Θ(pi )
n Pn
B(ri ))
+ Θ(tA0 ))
i=1 (Θ(si ) −
=
1 2
=
1 2
=
1 2
−nΘ(tB0 ) + nΘ(tA0 ) + n
Pn
!
= +
Θ(tA0 ) − Θ(ri ) + Θ(tB0 )) n
i=1 (Θ(qi ) − Θ(pi ))
!
+
Pn
−nΘ(tA0 ) + nΘ(tB0 ) + i=1 (Θ(si ) − Θ(ri )) n P (Θ(qi ) − Θ(pi )) + −Θ(tB0 ) + Θ(tA0 ) − Θ(tA0 ) + Θ(tB0 ) + n P (Θ(si ) − Θ(ri )) n ! Pn Pn (Θ(si ) − Θ(ri )) i=1 (Θ(qi ) − Θ(pi )) + i=1 n n
what is what we wanted to prove. Equation 3.4 is valid since n X
i=1
(Θ(C)) = nΘ(C)
(3.4)
!
35
3.5. SUMMARY Packet Size [Bytes] 4 64 192 512 1024 2048 4096 8192 9176
Latency [µs] 291 304 358 374 444 543 776 1268 1322
Packet Delay [µs] 228 241 287 290 326 398 559 883 981
Send Time [µs] 67 68 75 84 100 132 217 344 329
P D + ST [µs] 295 309 362 374 426 530 776 1227 1310
Relative Error [%] 1.37 1.64 1.12 0.00 4.05 2.39 0.00 3.23 0.91
Table 3.2: ATM latency depending on the packet size, two methods
when C = const. End of proof. In table 3.2 we show • latency measurements as measured in section 3.3.3 • packet delay as calculated by “Tarik’s Measurement Method” • locally measured average send time (time t1 in figure 3.1), measured independently each for characteristic packet sizes. We also calculated the sum of the send time and packet delay, which should be the same as the latency. In the last column the relative error is shown, calculated as ρ=
|Packet Delay + Send Time − Latency| · 100% Latency
We see that the relative error is very small, in light of the facts that we measure extremely short time intervals and that standard deviation is high. One may wonder why we need “Tarik’s Measurement Method” measurement method. Well, by defining it I did not open a new page in the history of science, and I am aware of its serious limitations. Packet delay, as defined in definition 3, could be calculated by subtracting average send time from average latency. On the other hand, this is an independent method, leading to the same results, so it can be used for verification. It also teaches us how to manually adjust the time-scale of measurement results, when the events are registered locally; we will use this in chapter 7. It is also a nice illustration of how difficult is to present a simple idea on more solid theoretical background. This justifies its place in this thesis.
3.5
Summary
In this chapter we measured performance of the SCI and ATM hardware. Insight in their basic functionality and performance is very important since the whole thesis is about integration of this two basic technologies. We showed that the available SCI bandwidth cannot be fully utilized because of the adapter-to-host bottleneck. We also pointed out the disproportion between the SCI message passing and the SCI programmed I/O latency due to the DMA setup cost.
36
CHAPTER 3. BASIC MEASUREMENTS OF NETWORK PERFORMANCE
We showed that the raw ATM throughput reaches 90% of the maximal bandwidth, which must be characterized as a very good result. The TCP/IP over ATM yields good results as well, up to 70% of the maximal bandwidth. This is very good, given the overhead of the reliable transport mechanism. The biggest goal is, however, that we gathered useful knowledge about SCI and ATM. This knowledge will help us to implement the SCI-ATM bridge, and the measured results give us a starting point for the bridge analysis.
Chapter 4
Implementing a Simple Bridge As the first step towards our desired SCI-ATM bridge implementation, I implement a simple bridging system, basicly a fusion of our producer/consumer programs used in the previous experiments. Things, however, are not always so simple as they may seem at first glimpse. We want the simple bridge to be as effective as possible, to look like a DMA based hardware implementable component, and to have code reusable in further, more advanced implementations. I suggest a solution based on two threads sharing a common data structure. The first thread is a modified SCI consumer, receiving data from a separate SCI producer and writing it to a buffer. The second thread is an ATM producer, reading the buffer and sending data to a separate ATM consumer. The basic structure of the system is shown in figure 4.1. As for simplicity, in this implementation we do not pay attention to an important problem in ATM communication – packet loss. While SCI offers a reliable packet transportation service, ATM gives no guarantee, and a packet traveling from a sender to a receiver may easily be discarded. However, we trust to our experimental data (figure 3.8), and assume that, as long as we send long packets using AAL5 and a simplex connection, all packets are coming through.
4.1
The Algorithm: from an Idea to the Formal Proof
In recent years parallel programming turned from a topic reserved for the experts to an everyday reality for many computer programmers. Now it is hard to imagine how the programmers managed before parallel programming became so common — some problems are so natural to be solved in parallel that it is a really good feeling to have support to do so. Such a problem is our SCI-ATM bridge algorithm. It is a slightly modified ring buffer problem1 , which means that two separate procedures are writing/reading a common data structure, each performing some specific tasks on it. In our case the buffer reading procedure is also sending the data over ATM to a remote host, and the buffer writing procedure is receiving data over SCI from a third host. 1 Ring buffer: data structure consisting of a memory buffer of certain size, two pointers and a flag. The pointers are usually called accordingly to what they are pointing to: First (empty position) and Last (position written, not read yet). The flag is usually called Empty flag, and used to arbitrate the situation when First and Last point to the same position, what could mean either that the buffer is empty or full.
37
38
CHAPTER 4. IMPLEMENTING A SIMPLE BRIDGE
ATM Consumer
Simple SCI-ATM Bridge
SCI Consumer Buffer Writer
ATM
Data Flow
Data Flow
SCI Producer
Shared Data Structure
readHead writeHead
Buffer Reader ATM Producer
to SCI Ring
empty (Bool)
Figure 4.1: Simple SCI-ATM bridge: Basic program structure Alas, this simple problem is not that easy to solve! The problem is that processes share the same data structure and change it without knowing what the other process is doing at the moment. It is easy to imagine a situation where process R (“reader”) attempts to read the buffer without having the pointers and the flag properly set (because “writer” W did not reach to update them yet). We obviously need to synchronize the processes in order to not do so. A few programming tools offer such functionality, the most common is called semaphore2 [Dijkstra 72], which would be used as binary semaphore [Dahl 95] in our case with two concurrent processes. Another usual term is mutex 3 , which is a simplified binary semaphore. We choose to use a mutex in our bridge implementation, as supported by the Solaris threads library [Lewis, Berg 96]. Let us consider a “natural” solution to our problem, written in pseudo-C: #define RINGBUFSIZE 256k /* standard, this is actually parameterized */ typedef struct { int fd; /* ATM file descriptor */ unsigned int totalLen; /* Total amount of data */ unsigned int maxAtmBuf; /* Max packet size for ATM */ 2 A semaphore is a simple data structure generally intended to coordinate access to resources. It is easiest to imagine a semaphore as a non-negative integer associated with (at least) two operations: one to decrease the value of the integer under condition that the integer is greater than zero, and one to increase the value. Eventual invocation of the operation “decrease” when the value of the integer is equal to zero results in process blocking and waiting for some other process to invoke “increase”. The value of the integer is protected and not directly accessible. It is usual however to implement an additional operation on semaphores which shows whether next “decrease” will result in waiting or not. 3 Mutex is a mutual exclusion mechanism similar to binary semaphore, simplified in sense that “decrease” and “increase” operations must occur in pairs and thereby exclude a portion of code from simultaneous execution in more than one process at the time
4.1. THE ALGORITHM: FROM AN IDEA TO THE FORMAL PROOF
39
} ATMdata; typedef struct { ... } SCIdata; /* Similar to ATMdata */ static char *ringBuffer; /* shared data structure */ static u_int first=0, last=0, empty = 1; /* first written not read, last written, Boolean empty */ static mutex_t ringMutex; void * ATMwrite (void * param) { u_int written = 0; /* Bytes already sent on ATM */ u_int canWrite; /* So many bytes can be sent in the next atm_send() */ u_int nowWritten = 0; while (written < param.totalLen) { mutex_lock (&ringMutex); /* calculate how many bytes we can write */ if (empty) canWrite = 0; /* Nothing to send */ else { if (first == last) /* Buffer full - send as much as possible */ canWrite = param->maxAtmBuf; else canWrite = min (param->maxAtmBuf, (last - first + ringBufSize) % ringBufSize); } canWrite = min (canWrite, ringBufSize-first); [A] mutex_unlock (&ringMutex); if ((int)canWrite > 0) { [B] nowWritten = atm_send (atmParam.fd, (char *)((u_int)ringBuffer+first), canWrite); if ((int)nowWritten < 0) atm_error("atm_send"); else { mutex_lock (&ringMutex); written += nowWritten; first += nowWritten; if (first == ringBufSize) first = 0; if (nowWritten>0) empty = (first == last); mutex_unlock (&ringMutex); } } else /* Nothing to write - yield control to other threads */ thr_yield(); [C] } /* end while */ return NULL; } void * SCIread (void * param) { ... } void main (int argc, char** argv) { SCIdata sciData; ATMdata atmData; /* to be used by thread procedures */ readCommandLineArgs(argv); initSCIandATM(); /* Connect SCI producer and ATM consumer
*/
/* Start SCI reading and ATM writing in separate threads */ thread_create (SCIread, (void *)&sciData, ...); /* Solaris API */ thread_create (ATMwrite, (void *)&atmData, ...); /* Solaris API */ }
40
CHAPTER 4. IMPLEMENTING A SIMPLE BRIDGE
To shorten this program code, the SCIread() procedure text is excluded. It is given in another form on page 40. This is the typical execution scenario: 1. the main procedure reads command line parameters given by the user and performs initialization accordingly 2. the main procedure starts two separate threads of execution running SCIread and ATMwrite thread-procedures and waits until they finish. Procedure SCIread() (ATMwrite()) starts a while loop where it 1. enters critical region and checks how many bytes could be read (written) from (to) remote host such that the buffer is not corrupted 2. goes out from the critical region and starts reading (writing) which is the only time consuming action 3. reenters the critical region and updates the pointers/the flag accordingly 4. repeats the loop until all data is read (written). Function atm_send() writes canWrite bytes data from the buffer. It returns number of bytes actually sent. If −1 is returned, it implies that an error has occurred. atm_write can freely use first pointer when invoked, since it changes only inside the function ATMwrite() (for further discussion of this problem see [Dahl 95], page 21-23). Further comments may be needed for lines [A], [B] and [C]. [A] is needed because atm_send() can write only in a continuous memory buffer. [B] implies that no write action should be undertaken if no writing is possible in a moment. It also guards us from invoking atm_send() with zero length of packet to send4 . Instead [C] is executed, which yields control to another unspecified thread of execution (the other thread in our case). [C] is important as an accelerator, it hinders a thread in hopeless circulation when no write action is possible. It is however not the most wise way to save time, since yielding control to another unspecified thread is a time expensive operation. In the next implementation of SCI-ATM bridge, we will solve this problem using inter-process signaling. The code is written so that following invariant is satisfied whenever the threads are outside the critical sections: I : r ead
= w r itten + (last ⊖f ir st) + if !empty and f ir st =last th RBS el 0 fi
(4.1)
where RBS is ring buffer size, the operation “⊖” means “minus modulo RBS”, and f ir st, last and empty are global variables as used in the program code. The r ead and w r itten variables are local to the SCIread() and ATMwrite() respectively, their values correspond to the number of bytes currently read and written. Invariant (4.1) is not sufficiently strong for a formal correctness proof of the whole algorithm, but it helps us to reason about the algorithm. We try to prove that SCIread() procedure is partially correct provided that the rest of the program behaves properly. Let us take a look at a slightly modified variant of SCIread(): 4 When atm_send() is invoked with the third parameter (data length) zero, four (4) is returned, just like four bytes of data have been sent. It is, of course, a bug which has an undesired effect on our program.
4.1. THE ALGORITHM: FROM AN IDEA TO THE FORMAL PROOF
41
proc SCIread (int total) begin int read=0, canRead, nowRead; {I ∧ total > 0 ∧ r ead = 0 ∧ #H=total} while read0); UNLOCK Critical region 2 end {I ∧ #H=total − r ead} od {I ∧ total = r ead ∧ #H=0}} end; I is invariant (4.1), first, last, empty and ringBuffer are global variables as defined in program text on page 38, bufSize is SCI buffer size, RBS is size of the ring buffer. H represents the sequence of data not read yet. Operator # (used as #H ) is the sequence length. LOCK and UNLOCK are critical region boundaries, code included between them is never accessed by more than one thread at a time. Note also that s_read() has different syntax when used in the real program; this high-level reasoning is not really concerned about file descriptors and buffer pointers. It is beyond the scope of this thesis to give an introduction to formal proofs of program correctness, for a detailed study of the problem it is wise to check the literature about Hoare logic, for instance [Dahl 92]. In the sequel we cite the deduction rule for while-sentences, as given in [Dahl 92], and prove correctness of our algorithm. T DW DO :
⊢ P ⇒ I; ⊢ {I ∧ B}S{I}; ⊢ I ∧ ¬B ⇒ Q ⊢ {P }{I}while B do S od{Q}
In short, this means that, given predicates P and Q in front and after a while loop respectively, we know that Q will be logically valid after execution of the loop5 provided that P is logically valid before the loop starts if the three premises ⊢ P ⇒ I; ⊢ {I ∧ B}S{I} and ⊢ I ∧ ¬B ⇒ Q are true. In order to reason about the algorithm we must define what the function s_read() returns and which effect it has on the buffer. Informally we say that s_read(): • reads at most canRead bytes into ring buffer from position last, • in a case of internal error stops execution of whole program immediately - does not waste the data, • returns the number of successfully read bytes. 5 What
loop.
further means that we do not know what happens if the while loop is an infinite
42
1. ⊢ 2. ⊢
CHAPTER 4. IMPLEMENTING A SIMPLE BRIDGE
{I ∧ total = #H > 0 ∧ r ead = 0} while r ead < total do . . . od {I ∧ #H=0}
(T DW DO 2, 3, 4)
I ∧ total>0 ∧ r ead>0 ∧ #H=total ⇒ I ∧ total=r ead∧ #H=0
(true)
3. ⊢
I ∧ ¬(total > r ead) ∧ #H=total −r ead ⇒ I ∧ total = r ead∧ #H=0 3.1 case total = r ead I ∧ t ∧ #H=0 ⇒ I ∧ t ∧ #H=0 3.2 case total < r ead I ∧ t ∧ f ⇒ I ∧ f ∧ #H=0 which is always true in mathematical sense.
4. ⊢
{I ∧ #H=total −r ead ∧ total > r ead} CRIT1; now Read := s r ead(r ingBuf f er , last, canRead); CRIT2 {I ∧ #H=total −r ead} (SEQ 5, 6, 7)
5. ⊢
{I ∧ #H=total −r ead ∧ total > r ead} canRead := if empty th buf Size el if f ir st = last th 0 el min(buf Size, f ir st ⊖ last) fi fi; canRead := min(canRead, RBS −last){I ∧ #H=total −r ead∧ canRead à min(f ir st ⊖ last, RBS −last, buf Size)} 5.1 case empty canRead := min(buf Size, RBS −last) 5.2 case ¬empty canRead := min(f ir st ⊖ last, RBS −last) Implication is obviously true in both cases.
6. ⊢
{I ∧ #H=total −r ead ∧ canRead à min(f ir st⊖ last, RBS −last, buf Size)}now Read := s r ead( r ingBuf f er , last, canRead){I ∧ #H=total −r ead −now Read}
(SRD 8)
7. ⊢
{I ∧ #H=total −r ead−now Read}r ead := r ead + now Read; last := last ⊕ now Read; empty := (now Read > 0) {I ∧ #H=total −r ead} (SEQ 9)
8. ⊢
I ∧ #H=total −r ead ∧ canRead à min(f ir st ⊖ last, RBS −last, buf Size) ⇒ IsnowRead r ead(r ingBuf f er ,last,canRead) ∧ #H=total −r ead(true)
9. ⊢
I ∧ #H=total −r ead−now Read ⇒ empty , last, r ead (I ∧ #H=total −r ead)nowRead>0,last⊕now Read,r ead+now Read
(true)
Figure 4.2: The proof of partial correctness for the SCIread() procedure
43
4.2. PERFORMANCE
Formally it is much more difficult, but I suggest the following ad-hoc Hoare sentence: SRD :
{Psx r ead(buf ,l,cr ) ∧ #H=a ∧ l +cr 1kByte) and shorter inter-bridge distances. The latency is expected to be much lower than in the software model, but still rather high. The possibility for the throughput-latency trade-off is inherited from the software model. In the other words, we still can achieve lower latency by setting lower window size W , an higher throughput on longer distances
106
CHAPTER 9. TOWARDS A HARDWARE BRIDGE IMPLEMENTATION
by setting bigger W . For shorter inter-bridge distances it is advisable to keep W low. It is shown that a major performance improvement can be achieved by making the packet transmission initialization more effective. Finally, I discussed the SCI shared memory mapping over ATM. The conclusion is that, while achieving this service in hardware should not pose a problem, an effective prototype implementation has to wait improved hardware solutions.
Chapter 10
Conclusions This final chapter presents the major conclusions emerged from the performed work, together with critique of the used method and some suggestions for interesting further work.
10.1
General Conclusions
From the performed work it is clear that the SCI-ATM bridge for the stream data traffic is fully implementable. The implemented bridge emulator yields 25% of the theoretically possible throughput; I showed that a corresponding hardware implementation would have had almost full utilization of the ATM link it relies on. During the work on this thesis, a number of interesting observations is made. Some of them the author regards as interesting, hopefully also for some other researchers of high speed network bridges. An attempt to classify the conclusions is made in the sequel. The relationship between the ATM throughput and the inter-packet delay. Chapter 3 (section 3.3.1, page 29) investigated how the ATM throughput depends on the delay between consecutive packet-send calls. It is shown that the throughput achievements can be increased with a careful selection of the inter-packet delay. Thus, traffic shaping is essential for ATM performance. Implementation of the inter-bridge communication protocol. In this thesis I present a self-implemented communication protocol (“TAP”, section 6.3.2, page 62). Despite that its performance did not show to be better than the performance of TCP/IP over ATM (on shorter inter-bridge distances), I showed that a simple, reliable and hardware-implementable ATM protocol can be written for the cluster interconnection purpose. Furthermore, the measurements in section 7.5 show that, when used with bigger window size, TAP utilizes the underlying ATM link very good, also on long physical distances. The relationship between the inter-bridge latency and the sender / receiver time. In order to keep the ATM latency in the inter-bridge communication low, 107
108
CHAPTER 10. CONCLUSIONS It is essential to program so that the ATM receiver needs the same or shorter amount of time to receive a data packet than the ATM sender needs to send it.
Using the variables defined in chapter 8 we would say: tAS ≥ tAR Notice that we talk about the send / receive time excluding the waiting time. The waiting mechanism will always equalize this two times, but with different consequences. Since tAS can never be exactly like tAR , we can distinguish between the following two cases: 1. tAS > tAR In this case, the ATM receiver has to wait a small amount of time for each packet, but this has no significant effect on the latency — the latency is increased only by tAR − tAS . 2. tAS < tAR In this case, the ATM sender sends packets faster than the receiver can accept them. There are two further possibilities: (a) the ATM interface has a buffer where the packets can be stored until the ATM reader reads them, (b) the packets are stored directly in the bridge buffer. In case (2.a), which characterizes the software bridge implementation, the packets are preserved in the first buffer (in the ATM adapter) until the bridge reads them. This double buffering causes a significant latency increase. In case (2.b), which is to expect in any hardware bridge implementation, the ATM packets would often have to be discarded, because the ATM reader would not be ready to read them. This would negatively affect both throughput and latency, because the number of the resent packets would increase significantly. The relationship between the packet sending / receiving initialization time and the packet transmission time. The packet transmission time should be longer than the time which the ATM sender and the ATM receiver need to initiate the data packet sending / receiving. In this way we reach the best throughput, making it dependent only upon the ATM link attributes (bandwidth, propagation delay, etc.; discussion in chapter 8). How the throughput and latency depend on the bridge buffer size. Recall that in chapter 8 we introduced parameter “W ”, which represented the number of data packets storable in the bridge buffer per connection, and which also defined the connection window size (the number of outstanding packets). We showed that increasing W does not necessarily improve the overall performance. On the shorter inter-bridge distances, bigger W does not mean bigger throughput. On the longer distances, W improves the throughput by hiding the effect of the signal propagation delay.
10.2. CRITIQUE OF METHOD USED
109
The latency is likely to increase when W is large. This is because of the increased buffering time, typically when bridging from the fast SCI network to the slower ATM. In every practical system, parameter W should be carefully set to the best value for the expected bridge usage pattern. Minimizing the ATM latency. ATM latency is the biggest obstacle for the ATM-based LAMP interconnection. This problem is discussed in the literature, and research on solving it is in progress [Lin et al. 95]. From my work it can be concluded that The SCI-ATM bridge performance is very sensitive to the latency of the ATM connection. The ATM latency can be decreased by implementing more effective packet transmission initialization, in particular by minimizing the DMA setup time. As showed in chapter 9, minimizing the DMA setup time in the ATM bridge interface boosts the performance for both shorter and longer packet sizes. Minimizing the DMA setup time is necessary for a possible transparent SCI over ATM implementation. A shared memory implementation would eliminate the DMA usage, but its throughput in current implementations is not satisfactory, and it requires further research (discussed in section 9.4). Concurrency. In any network bridge implementation, it is important to implement an effective context switching mechanism, i.e. switching between the parallel multiple sender and receiver processes should take minimal time. Slow Solaris thread management showed to be a major obstacle for higher performance of the software model (section 7.2). Transparent SCI over ATM. Finally, we come to the central question of the SCI-ATM interconnection: can separate SCI rings be interconnected transparently over ATM? After studying the related works ([Gustavson, Li 96] [Bryhni, Kure 97] [Kure, Moldeklev 94]) and after the studies performed in this thesis, it is clear that Interconnecting of distant SCI clusters transparently over ATM cannot provide all SCI services with high performance. Why? Let us go back to chapter 2. Recall that the SCI standard specifies cache coherence between the clustered computers (page 8). Maintaining cache coherence is, clearly, an extremely latency sensitive operation. On the other hand, light waves need a third of a millisecond to travel 100 kilometers, and no matter which hardware we use, no matter how good we program, latency of a single SCI transaction, supposed to execute on a remote SCI cluster transparently connected over ATM on a distance of 50 km, will never drop under 300µs. Can this be considered as a reasonable performance? Hardly for the cache coherence maintenance. But for many other applications, such as stream traffic, this latency is not an obstacle.
10.2
Critique of Method Used
In this thesis we followed the problem solution method stated in section 1.3 to the maximal extent. This was not easy, because of my lacking experience and that the scientific area to which this thesis belong is rather new.
110
CHAPTER 10. CONCLUSIONS
This leaded to that the initial problem statement, including transparent SCI over ATM study, had to be redefined. Fortunately, this happened early and did not cause delay of the thesis. The major problems I met may be classified in two groups: 1. Problems of subjective nature: • programming problems; when experimenting with the software bridge emulator, a sudden crash occasionally happens when many parallel transmissions are on • coarse modeling; the analytical bridge model could be somewhat more detailed, and less relying on the experimental data 2. Problems of objective nature: • relatively few scientific works are done — it was hard to encompass the problem in the beginning • slow Solaris inter-thread signaling. It caused that, when the first software bridge model was finished, it was very difficult to distinguish subjective and objective reasons for the relatively poor performance. Protocol specification for “Tarik’s ATM Protocol” could have been in more detailed, including state transition diagram etc. As for completeness, I could have implemented simple SCI shared memory mapping over ATM. However, because of the imperfectness of the used hardware, this implementation would not enlighten anybody — its small shared memory blocks would make it too limited even for a serious modeling. I have to remind the reader once more that section 7.5 was added to the thesis after the text was concluded — a couple of days before the printing. This is the reason why the results from this section is not discussed elsewhere in the text.
10.3
Future Work
By working on this thesis, the author has gained a basic insight into high speed network computing. Many questions remain unsolved, and call for a study. It would be interesting to further test the presented software solution using long distance ATM links, and check the relationship between performance expectations and reality. Furthermore, when the hardware improvements allow (e.g new PCI/SCI adapter implementation), it would be interesting to extend the software model to support the shared memory mapping. Further studies on transparent SCI over ATM could be performed. A prototype implementation, possibly with a limited range of available services, can be constructed as soon as the SCI hardware with the user-accessible SCI transactions appear. Access to a clustered parallel server is an interesting research area. One of the possible solutions (as discussed in section 1.1) is using an SCI cluster as the server, and accessing it by ATM. I think that this solution deserves to be researched, and I hope to come back to it in my further education.
Appendix A
About the Bridge Emulator This appendix describes the bridge emulator program from a practical point of view. The program code is printed and distributed separately, and can also be downloaded from http://www.ifi.uio.no/∼tarikc/Studies.html The program code includes: • source code for the bridge program • source code for the sample producer/consumer • dependencies • a Gnu-make makefile The program can be compiled and tested if the requirements from sections A.1 and A.2 are satisfied. It is fully possible that the program can be compiled also on other platforms (e.g. Pentium Pro computer equipped with PCI-SCI card), but the author cannot give any guarantees for that.
A.1 Hardware Requirements The bridge program is supposed to run on SparcStation 20 workstations. The workstations must be equipped with the Dolphin SBus-2 SCI adapter, and the SBA 200 FORE ATM adapter. The sample producer / consumer programs can be executed on workstations without the ATM card. It is expected that the program can be run on newer versions of the specified hardware, but no guarantees can be given.
A.2 Software Requirements A.2.1
Execution
The program requires Solaris version 2.5 or newer. The SCI and ATM (“qaa0”) drivers must be installed. If the raw ATM transmission is desired, a Permanent Virtual Circuit (PVC) must exist between the computers where the bridge applications reside. 111
112
APPENDIX A. ABOUT THE BRIDGE EMULATOR Bridge Application
Consumer
WS A1
WS A4
WS B4
WS B1
(node 1)
(node 0)
(node 0)
(node 1)
SCI ring B
SCI ring A
WS A3
WS A2 (node 2)
(node 3)
ATM network
WS B3 (node 3)
WS B2 (node 2)
Producer
Figure A.1: Usage example
A.2.2
Compilation
The code can be compiled with the standard SUN “cc” compiler, using the “gnumake” program. The ATM and SCI object libraries must be visible to the compiler. Furthermore, the “slib” SCI library [Omang 96] requires that the “SLIB_HOME” environment variable is set to the directory where the library resides. It will be a pleasure for the author to help anybody interested in studying / experimenting with the SCI-ATM bridge program.
A.3 Usage A.3.1
The Programs
Once successfully compiled, there are four executable files available: • bridge – main program • consumer – sample SCI consumer • producer – sample SCI producer • killb – tool which helps to stop the program execution1 Now we assume that the bridge is executed on two separate SCI rings2 denoted by A and B, and show the typical usage pattern. In this example we assume that the SCI node ID for the both computers executing the bridge is 0 (“null”)3 , and that the consumer node ID is 1 (in ring A) and the producer node ID is 3 (in ring B). This situation is shown in figure A.1. 1. the bridge program is started on one computer in each cluster. The sequence of starting must be indicated by (“–m”) command-line parameter: 1 The
bridge can be stopped also by pressing CTRL-C. course, the bridge can be executed also on two computers in a single SCI cluster, as long as there is an ATM connection between them. 3 Note that they are in separate rings. 2 Of
113
A.4. PROGRAM OPTIONS WSA4>bridge -m 0 ... WSB4>bridge -m 1 2. the consumer program is started in ring A: WSA2>consumer-n 3 -a 0 -d 10000 3. the producer is started in ring B: WSA1>producer-n 1 -a 0 -d 10000
4. data transmission goes on. The both bridges show the “trace” information: sent packets, received ACKs etc. 5. the data transmission finishes; steps (2) and (3) can be performed again.
A.4 Program Options A range of options can be specified when executing the bridge. The following is a list of available options, where only “–m” is obligatory (the others are optional). If the option needs a VALUE, it is indicated. -m start sequence, (“client”)
VALUE :
0 – started first (“server”), 1 – started second
-q quiet mode, no transmission trace performed -p payload length,
VALUE
64-9152 [Bytes]
-t maximum parallel transmissions,
VALUE
1-255
-b buffer size, number of packets per buffer,
VALUE
1-1023
-i virtual channel in, permanent virtual channel for the raw ATM, incoming connection. (Virtual path is currently always zero) -u virtual channel out, permanent virtual channel for the raw ATM, outgoing -A bandwidth, desired bandwidth -n no error control, suppress the error control -e error ration, percentage of packets to be willingly discarded, 0-100 (more than 50 is not recommended)
VALUE
-o no write file, suppress the timing data writing to the file “times.txt” -T TCP host name, host name of the remote bridge, implies a TCP connection (over ATM or other underlying network) -P TCP port number, port number of the remote bridge,
VALUE :
0-65535
The following options can be given to the producer and consumer programs: -n remote node, obligatory, -d data length,
VALUE
VALUE :
0, 1, 2, ...
1-231 [kBytes]
114 -a bridge node,
APPENDIX A. ABOUT THE BRIDGE EMULATOR VALUES:
0, 1, 2, ...
-p packet length, VALUES 64-9152 [Bytes] (64-65536 for TCP-based interbridge communication). Must be the same as specified for the bridge program. -t transmission ID, bridge
VALUES
0-254, must be less than “-t” value for the
-f output file name, string -l latency measurement ON, measure and write the total producer / consumer latency to file “lat.txt”
Bibliography [Bryhni, Kure 97] Haakon Bryhni and Øivind Kure: SCI-ATM Interconnection, Telenor Research and Development, Technical report, (1997) [Bryhni, Omang 96] Haakon Bryhni and Knut Omang: A Comparison of Network adapter based Technologies for Workstation Clustering, In Proceedings of 11th International Conference on Systems Engineering, Las Vegas, (July 1996) [Dahl 92]
Ole-Johan Dahl: Verifiable Programming, Prentice Hall, (1992)
[Dahl 95]
Ole-Johan Dahl: Parallell Programmering, (N ORWEGIAN ), Compendium 46, Department of Informatics, University of Oslo (August 1995).
[Dijkstra 72] E.W. Dijkstra: Hierarchical Ordering of Sequential Processes, Academic Press (1972). [Dolphin 97] Dolphin Interconnect Solutions: SBus-2 Functional Description, Version 0.27 – preliminary (1997) [FORE Ada. 96] FORE Systems, Inc.: ForeRunnerT M SBA-200 ATM SBus Adapter, User’s Manual, Software version 4.0 (1996) [FORE Sw. 96] FORE Systems, Inc.: ForeRunnerT M ATM Switch, User’s Manual, MANU 0065 - Rev. A, (March 1996) [Gjessing, Kaas 91] Stein Gjessing and Ellen Munthe-Kaas: Formal Specification and Verification of SCI Cache Coherence, Technical report no.158, Department of Informatics, University of Oslo (Nov 1991) [Gustavson, Li 95] David B. Gustavson and Qiang Li: Local-Area Multiprocessor: the Scalable Coherent Interface, SCIzzL, Santa Clara University, Dept. of Computer Eng., ( HTTP :// WWW 1. CERN . CH /RD24/P OST S CRIPT / SCI _ LAMP . PS .Z) (1995) [Gustavson, Li 96] David B. Gustavson and Qiang Li: The Scalable Coherent Interface (SCI), IEEE Communications Magazine, pages 52-63 (August 1996) [Halsall 95]
Fred Halsall: Data Communications, Computer Networks and Open systems, fourth edition, Addison-Wesley (1995)
[Hennessy 94] John L. Hennessy and David A. Patterson: Computer organization and design: the hardware/software interface San Mateo, California, Morgan Kaufmann (1994) 115
116 [IEEE 92]
BIBLIOGRAPHY IEEE Std 1596-1992: IEEE Standard for Scalable Coherent Interface (SCI), (August 1993)
[Kure, Moldeklev 94] Øivind Kure and K. Moldeklev: An ATM network interface for an SCI-based system, INDC ’94, I: IFIP transaction C-23: Information Networks and Data Communication, ed: P. Veiga and D. Khakhar, North Holland (1994) [Lewis, Berg 96] Bil Lewis and Daniel J. Berg: Threads Primer — A Guide to Multithreaded Programming, SunSoft Press, A Prentice Hall Title, Mountain View, California (1996) [Lin et al. 95] Mengjou Lin, Jenwei Hsieh, David H.C. Du, Joseph P. Thomas and James A. MacDonald: Distributed Network Computing over Local ATM Networks, IEEE Journal on Selected Areas in Communications, Vol. 13, No. 4. ( FTP :// FTP . CS . UMN . EDU :/ USERS / DU / PAPERS / ATMPERF . PS ) (May 1995) [Martin et al. 97] Richard P. Martin, Amin M. Vahdat, David E. Culler and Thomas E. Anderson: Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture, ISCA 24, Denver, Co. Also: ( HTTP :// NOW . CS . BERKELEY . EDU /) (June 1997) [NAVY 97]
US Navy and Acorn Networks hardware SCI-ATM bridge project, –no publications available–, ( HTTP :// WWW . ACORN NETWORKS . COM /W HATS N EW /C HIP S PEC . HTML )
[NOW 97]
“Network Of Workstations” project, Berkeley, University of California, ( HTTP :// NOW . CS . BERKELEY . EDU /)
[Omang 96] Knut Omang: Slib: Adding Programming Support to a Cluster of SCI Connected Workstations, Department of Informatics, University of Oslo (1996) [Omang 97] Knut Omang and Bodo Parady: Performance of Low-Cost UltraSparc Multiprocessors Connected by SCI Communication Networks and Distributed Systems Modeling and Simulation, Phoenix, Arizona, ( HTTP :// WWW . IFI . UIO . NO /∼SCI /) (Jan. 1997). [Ryan 96]
Stein Jørgen Ryan, Stein Gjessing and Marius Liaaen: Cluster Communication using a PCI to SCI interface, In Proceedings of IASTED Eighth International Conference on Parallel and Distributed Computing and Systems, Chicago (October 1996)
[UNI 94]
ATM forum: ATM User-Network Interface Specification, version 3.1, Prentice Hall (September 1994)
[Welsh et al. 97] Matt Welsh, Anindya Basu, and Thorsten von Eicken: ATM and Fast Ethernet Network Interfaces for User-level Communication, Proceedings of the Third International Symposium on High Performance Computer Architecture (HPCA), San Antonio, Texas, ( HTTP :// WWW 2. CS . CORNELL . EDU /UN ET / PAPERS . HTML ) (February 1997)