Dec 9, 2002 - LIST OF EQUATIONS . ... Structure of MOSIX networking communicating processes ............................ 3. 1.3 ...... under the TCP protocol, only those packets that come in with port number 23 are matched and are then dropped.
A TECHNIQUE FOR IMPROVING THE SCHEDULING OF NETWORK COMMUNICATING PROCESSES IN MOSIX By
RENGAKRISHNAN SUBRAMANIAN B.E., South Gujarat University, 1998 A REPORT Submitted in partial fulfillment of the requirements for the degree
MASTER OF SCIENCE
Department of Computing and Information Sciences College of Engineering KANSAS STATE UNIVERSITY Manhattan, Kansas December, 2002 Approved by:
Major Professor Dr. Daniel Andresen
ABSTRACT MOSIX is a software tool for supporting cluster computing. The core of the MOSIX technology is the capability of multiple workstations and servers to work cooperatively as if part of a single system. The primary job of MOSIX is, when you create (one or more) processes, MOSIX will distribute (and redistribute) your processes (transparently) among various MOSIX-enabled workstations / servers, to obtain the best possible performance. When two processes that communicate over the network are created and are then distributed over the network, MOSIX still binds them to the base workstation where they were created for system calls. This means that the communication processes talk to each other through their base node and not directly. The technique discussed in this report discusses this communicating method, the reasons behind this method and a technique for improvising this method by not changing the basic architecture. The proposed technique uses the firewalling system called IPTables, available in the Linux operating system to allow the processes to communicate through their base node and also improve the performance of the processes to nearly the same as if the processes were communicating directly.
TABLE OF CONTENTS
LIST OF FIGURES ......................................................................................................... III LIST OF TABLES ...........................................................................................................IV LIST OF EQUATIONS ....................................................................................................V ACKNOWLEDGEMENTS .............................................................................................VI DEDICATION..................................................................................................................VII 1.
INTRODUCTION AND BACKGROUND..............................................................1
1.1
MOSIX ................................................................................................................... 1
1.2
Structure of MOSIX networking communicating processes ............................ 3
1.3
The Solution........................................................................................................... 6
2
APPROACH TOWARDS THE SOLUTION .........................................................7
2.1
Introduction – The “triangle routing” ................................................................ 7
2.2
Reasoning ............................................................................................................... 8
2.3
Timing Analysis................................................................................................... 11
2.4
Architecture ......................................................................................................... 15
3 3.1
IMPLEMENTATION...............................................................................................17 Environment information .................................................................................. 17
3.2 IPTables ............................................................................................................... 18 3.2.1 About ............................................................................................................. 18 3.2.2 Netfilter Architecture .................................................................................... 21 3.2.3 NAT background ........................................................................................... 22 3.2.4 NAT Architecture in IPTables ...................................................................... 24 3.2.5 NAT example usage...................................................................................... 25 3.2.6 Performance Evaluation of IPTABLES ........................................................ 28 3.2.7 Importance of performance evaluation......................................................... 30 3.3
IPTables for the problem at hand ..................................................................... 32
i
3.3.1 3.3.2 3.3.3 4
How............................................................................................................... 32 Actual Rules .................................................................................................. 32 How do these rules work? ............................................................................. 33
TESTING.................................................................................................................37
4.1
Purpose................................................................................................................. 37
4.2
Environment ........................................................................................................ 38
4.3 Test Procedures................................................................................................... 38 4.3.1 General.......................................................................................................... 38 4.3.2 MOSIX.......................................................................................................... 40 4.3.3 IPTables ........................................................................................................ 41 4.3.4 Direct communication................................................................................... 42 5
RESULTS................................................................................................................43
5.1
MOSIX ................................................................................................................. 43
5.2
IPTables ............................................................................................................... 45
5.3
Direct Communication ....................................................................................... 46
5.4
Summary.............................................................................................................. 46
6
CONCLUSION........................................................................................................51
6.1
Observations ........................................................................................................ 51
6.2
Inferences............................................................................................................. 51
6.3
Future Work ........................................................................................................ 52
7
RELATED RESEARCH........................................................................................53
8
REFERENCES .......................................................................................................55
ii
LIST OF FIGURES FIGURE 1-1: ORIGIN OF PROCESSES – I .........................................................4 FIGURE 1-2: ORIGIN OF PROCESSES – II........................................................4 FIGURE 1-3: COMMUNICATION OF PROCESSES AFTER MIGRATION BY MOSIX .5 FIGURE 1-4: BEFORE MIGRATING PROCESS B................................................6 FIGURE 1-5: AFTER MIGRATING PROCESS B .................................................6 FIGURE 2-1: PROCESSES A AND B................................................................7 FIGURE 2-2: PROCESS B IS MIGRATED TO NODE C .........................................8 FIGURE 2-3: MICROSCOPIC VIEW .............................................................. 10 FIGURE 2-4: TIME ANALYSIS..................................................................... 14 FIGURE 2-5: FLOWCHART ......................................................................... 16 FIGURE 3-1: IPTABLES WORKING FLOWCHART (FROM [AMERICO02PERFORMANCE]).............................................................. 20 FIGURE 3-2: PACKET TRAVERSING IN NETFILTER (FROM [RUSTY02LINUXNAT]) ........................................................................................................ 21 FIGURE 3-3: NAT ARCHITECTURE, IPTABLES (FROM [RUSTY02LINUXNAT]). 25 FIGURE 3-4: HOW DO THESE RULES WORK? STEP 1...................................... 34 FIGURE 3-5: HOW DO THESE RULES WORK? STEP 2...................................... 35 FIGURE 3-6: HOW DO THESE RULES WORK? STEP 3...................................... 36 FIGURE 3-7: HOW DO THESE RULES WORK? STEP 4...................................... 37 FIGURE 4-1: GENERAL TEST PROCEDURE ................................................... 38 FIGURE 4-2: MOSIX TEST PROCEDURE: STEP 1.......................................... 40 FIGURE 4-3: MOSIX TEST PROCEDURE: STEP 2.......................................... 41 FIGURE 4-4: IPTABLES TEST PROCEDURE .................................................. 42 FIGURE 4-5: DIRECT COMMUNICATION TEST PROCEDURE ........................... 43 FIGURE 5-1: EXECUTION TIME COMPARISON CHART................................... 49 FIGURE 5-2: BANDWIDTH COMPARISON CHART.......................................... 49 FIGURE 5-3: %CPU UTILIZATION COMPARISON CHART .............................. 50 FIGURE 5-4: LOAD AVERAGE COMPARISON CHART .................................... 50
iii
LIST OF TABLES TABLE 3-1: PERFORMANCE EVALUATION PARAMETERS OF [AMERICO02PERFORMANCE]............................................................... 29 TABLE 5-1: MOSIX TEST RESULT............................................................. 44 TABLE 5-2: IPTABLES TEST RESULT.......................................................... 45 TABLE 5-3: DIRECT COMMUNICATION TEST RESULT................................... 46 TABLE 5-4: COMPARISON OF LATENCY ...................................................... 47 TABLE 5-5: COMPARISON OF BANDWIDTH.................................................. 47 TABLE 5-6: COMPARISON OF CPU UTILIZATION ......................................... 48 TABLE 5-7: COMPARISON OF LOAD AVERAGE ............................................ 48
iv
LIST OF EQUATIONS EQUATION 2-1: DISSECTION OF TIME TAKEN BY A PACKET IN ITS PROCESSES’ UHN ................................................................................................ 13 EQUATION 2-2: TIME SAVED BY THE PACKET.............................................. 14 EQUATION 3-1: TIME TAKEN FOR PROCESSING A TCP PACKET, 1400 BYTES AND 10 RULES ................................................................................... 30 EQUATION 3-2: TIME SAVED IF PACKETS ARE REDIRECTED AT THE FIREWALL 31 EQUATION 3-3: RECALCULATED TIME SAVED FOR PACKETS REDIRECTED AT FIREWALL ......................................................................................... 31
v
ACKNOWLEDGEMENTS I sincerely thank Prof. Daniel Andresen, my major professor, for giving me encouragement, timely advice, guidance and facilities to complete this project. I also thank him for being flexible, adjusting and patient during the course of this project.
I would like to thank Prof. Masaaki Mizuno and Prof. William H. Hsu for serving in my committee. I would like to thank Prof. Mitchell L. Neilsen for agreeing to proxy during my final examination.
I would like to thank Ms. Delores Winfough for patiently helping me out in understanding the policies of the graduate school.
I thank Mr. Jesse R. Greenwald and Mr. Daniel R. Lang for helping and solving my dayto-day problems with my experiments. I thank Mr. Thomas J. Rothwell for partnering with me during the initial periods of the project.
I would like to thank Mr. Ashish Sharma for help in the benchmark programs. I thank Mr. Sadanand Kota and Mr. Madhusudhan Tera for help with using MOSIX.
vi
DEDICATION
To my parents
vii
1. Introduction and Background This report aims at discussing the scheduling technique used by MOSIX 1 on processes that communicate over the network, explore the reasons behind the particular scheduling technique on network processes and suggest a new technique that will improve the performance of processes communicating over the network. The first section will give introduction about MOSIX, the architecture of network communicating processes in MOSIX. The second section will discuss about approach and architecture towards solving the problem. The third section will discuss the implementation in detail. The fourth section will discuss the tests done to evaluate the solution and discuss why such tests were conducted. The fifth section will discuss the results of these experiments, the performance improvement of this solution over the native MOSIX scheduling technique with the help of graphs. The final section will discuss the conclusion and make comments related to future work that could be done with this solution as base.
1.1 MOSIX MOSIX is a software tool for supporting cluster computing. It consists of kernel- level, adaptive resource sharing algorithms that are geared for high performance, overhead- free scalability and ease-of- use of a scalable computing cluster. The core of the MOSIX technology is the capability of multiple workstations and servers (nodes 2 ) to work cooperatively as if part of a single system. The algorithms of MOSIX are designed to 1
MOSIX stands for Multi computer Operating System for UnIX.
2
A MOSIX-enabled work station or server will be called ‘node’ hereafter
1
respond to variations in the resource usage among the nodes by migrating processes from one node to another, preemptively and transparently, for load-balancing and to prevent memory depletion at any node. MOSIX is scalable and it attempts to improve the overall performance by dynamic distribution and redistribution of the workload and the resources among the nodes of a computing-cluster of any size. MOSIX conveniently supports a multi- user time-sharing environment for the execution of both sequential and parallel tasks. [barak99scalable ]
MOSIX can transform a Linux cluster of x86 based workstations and servers to run almost like an SMP. The ma in purpose of MOSIX is when you create (one or more) processes in your login node MOSIX will distribute (and redistribute) your processes (transparently) among the nodes, to obtain the best possible performance. The core of MOSIX is a set of adaptive management algorithms that continuously monitor the activities of the processes vs. the available resources, in order to respond to uneven resource distribution and to take advantage of the best available resources. [mosix02web]
The algorithms of MOSIX use preemptive process migration to provide: •
Automatic work distribution - for parallel processing or to migrate processes from slower to faster nodes.
•
Load balancing - for even work distribution.
•
Migration of processes from a node that run out of main memory, to avoid swapping or thrashing.
•
Migration of an intensive I/O process to a file server
•
Migration of parallel I/O processes from a client node to file servers.
2
1.2 Structure of MOSIX networking communicating processes MOSIX supports preemptive (completely transparent) process migration (PPM). After a migration, a process continues to interact with its environment regardless of its location. To implement the PPM, the migrating process is divided into two contexts: the user context – that can be migrated, and the system context – that is UHN 3 dependent, and may not be migrated. [barak99scalable ] The user context, called the remote, contains the program code, stack, data, memorymaps and registers of the process. The remote encapsulates the process when it is running in the user level. The system context, called the deputy, contains description of the resources, which the process is attached to, and a kernel-stack for the execution of system code on behalf of the process. The deputy encapsulates the process when it is running in the kernel. It holds the site-dependent part of the system context of the process, hence it must remain in the UHN of the process. While the process can migrate many times between different nodes, the deputy is never migrated. [barak99scalable] The processes that MOSIX migrates for automatic work distribution, load balancing also include processes that communicate over the network or processes that reside in the same workstation and communicate with each other. The above explained structure of processes can be seen more pictographically in specific scenarios below.
There are at least two scenarios that arise while exploring the origin of processes that communicate over the network.
3
Unique Home Node, the node where the process is created
3
Scenario a) The communicating processes originated in the same node and have now either been migrated to a different node or not been migrated.
Scenario b) The communicating processes originated in different nodes and have now been migrated to a different node or not been migrated.
Node a Process A & B
Cluster of nodes
Node b
Figure 1-1: Origin of processes – I
Node a Process A
Cluster of nodes
Node b Process B
Figure 1-2: Origin of processes – II
In Figure 1-1, the communicating processes (A and B) originate from Node a and are migrated by MOSIX to any of the nodes in the cluster. Let us say, process B has been migrated to Node b. In such a case, MOSIX still binds the process B to its node of origin
4
(which is Node a) and routes all the communicating packets from B to A through A itself, which is obvious.
In Figure 1-2, the communicating process A can originate from Node a and its counterpart process B can originate from Node b and can be migrated to any of the nodes in the cluster. Let us say, the process B is now migrated to Node a by MOSIX. But, now, since Node b is the node of origin for B, all communicating packets from B to A traverse through Node b, instead of directly communicating with A (which is in the same node).
Hence, the communicating processes look like Figure 1-3:
B to A
B to A
Node a Process A Process B
Node b
Figure 1-3: Communication of processes after migration by MOSIX
Let us discuss another offshoot picture of Scenario b. In this case, Processes A and B have their UHN as Nodes a and b respectively. Process B is later moved to Node c. In such a case, the new communication picture looks like the following set of figures.
5
Node a Process A
Node b Process B
Node c
Figure 1-4: Before migrating process B
Node a Process A
Node b
Node c Process B
Figure 1-5: After migrating process B
From the figures above, it is seen that process B, though moved to Node c, communicates with its counterpart process A through its UHN (which is still Node b). Process B should have communicated with its counterpart directly instead of going through its UHN. This underlying problem now becomes very obvious. This method of re-direction of MOSIX increases latency and in many cases causes inefficiency in the whole system. This needs to be rectified so that B contacts A directly, instead of traveling through its UHN.
1.3 The Solution
6
The technique discussed in this report achieves better performance of the communicating processes in terms of decreased latency and increased bandwidth and better load averages on the nodes for various configurations.
2 Approach towards the solution 2.1 Introduction – The “triangle routing” As discussed in the previous section, the problem with these processes is their binding to their UHN. Let us take the following example to discuss the approach.
Node a Process A
Node b Process B
Figure 2-1: Processes A and B
Processes A and B are two processes that are communicating over the network. Now, MOSIX migrates process B to a different Node c.
7
Node b
Node a Process A
Node c Process B
Figure 2-2: Process B is migrated to Node c Now, the communications between processes A and B happen through node B. As discussed in previous sections, the user space of process B still resides in Node b. This means that the MOSIX user level code in Node b identifies the packets from Node a that are meant for process B and notes that process B now resides in Node c. So, after identifying the packet from Node a, MOSIX in Node b redirects the packet to process B in Node c.
2.2 Reasoning Let us now break the communication into various parts and work through the various layers that each packet from process A goes through while communicating to process B. The process A resides in the user space. It communicates to the kernel for getting hold of a socket. (Let us assume that process A is a client and is seeking service from process B. This example will be taken in the rest of the paper.) The kernel provides a socket to the user space process A. The process then identifies that it has to go through TCP/IP stack of
8
the kernel. So, it goes through the TCP/IP stack and then through the firewalling system, goes down to the lower layers like the device drivers, MAC and the physical layers. At the receiving end, process B has already opened a port and is waiting for a connection through a socket from any other process that needs its service. When A requests a service to B, A’s packet goes out of the socket, through the kernel, through the TCP/IP stack, adding header after header at each layer, through the firewalling system, through the lower layers to the lower layer of Node b (UHN of process B). The lower layers of node b identify the packet from Node a and take it up through the firewalling system, through the TCP/IP stack, ripping off the headers one by one, up to the kernel / user space border to MOSIX. Now, MOSIX looks at the packet and decides that process B is no longer residing in this node b and so it redirects the fate of the packet is its new destination node c (the new node where process B is now residing). The packet again takes the same path as it took in Node a and goes down to the physical layer and contacts the lower layers of Node c. The same process happens in Node c and the packet from Node a, process A, reaches its destination to process B, at Node c. The above communication can be redrawn with a microscopic view as shown below.
9
User Space Kernel / Socket Layer
MOS IX
TCP Firewall Node b Lower Layers
User Space Kernel / Socket Layer
User Space MOS IX
Kernel / Socket Layer
TCP
TCP
Firewall
Firewall
Lower Layers
Lower Layers
MOS IX
Node c
Node a Figure 2-3: Microscopic View
It is observed that this extra path that the packet takes in the Node b increases the latency of the packet, because, the packet is taking an all- new round about route to its destination. Moreover, this all- new round about route consumes a chunk of the bandwidth of the network available between Nodes a & b and Nodes b & c. This eats up more time of MOSIX too, which takes time to decide on what to do with each such packet coming from A for B in Node b. On top of it, this could happen for many such processes from
10
Node a communicating with many other processes in Node c whose UHN is Node b. since MOSIX is installed on all nodes of the cluster, MOSIX schedulers in every node will be spending enough time for any such re-direction that is reached to the respective node for each packet. This naturally decreases the performance of the whole system. Now, if there is a method by which we can redirect these packets that arrive from Node a to Node b towards Node c at a lower layer, a layer, which can filter such packets, a layer which does not consume more time for deciding upon the fate of the packet, a layer where we can do a network address translation on the incoming packet and redirect it towards Node c, it would drastically reduce the load tha t Node b takes to handle redirection. This would also decrease the latency of the packet and eventually increase bandwidth of the whole system. How can all this be achieved? Which is the best of the lower layers that can best do all this at considerable ease? Naturally, the answer lies in the firewalling system. The firewall can intercept packets, filter them, it can identify the headers of the packet, it can identify the destination of the packet, it can do a network address translation and can redirect the packet to a different destination. It resides at a very low level in the networking layers and can therefore be effectively used without bothering to remove and add headers all the way up and down the network levels.
2.3 Timing Analysis The basic purpose of the analysis given above was to dissect the problem into smaller segments and understand where there is a delay and where there could be an improvement in the architecture. In this section, the architecture for solving the problem at hand with the help of firewalling system would be explained in detail.
11
The time taken for taken by each packet to travel from process A in Node a to process B in Node c in the discussion above can be divided into the following parts: •
Time taken by packet to travel from Node a to lower layer in Node b
•
Time taken by packet to travel from node b lower layer to MOSIX in node b
•
Time taken by packet to travel from MOSIX in node b to lower layer in node b
•
Time taken to travel from lower layer in node b to process B
In the above dissection, we are more concerned about the time taken by packet to travel from lower layer of node b to MOSIX and back from MOSIX to the lower layer. It is in this particular time interval that we are going to implement the firewalling technique to do the redirection. So, let us break down this time interval into more detail. When it is broken down, it will look like the following: (Refer figure 2.4)
12
Total time taken by packet to travel from physical (lower) layer of Node b to MOSIX and back to the physical layer = (Time taken by packet to travel from physical layer to the firewall, say T1 ) + (Time taken by packet to travel from firewall to TCP/IP layer, say T2 ) + (Time taken by packet to travel from TCP/IP layer to the socket layer, say T3 ) + (Time taken by packet to travel from socket to MOSIX, say T4 ) + (Time taken by MOSIX to decide on the fate of the packet, say T5 ) + (Time taken by packet to travel from MOSIX space to the kernel / socket layer, say T6 ) + (Time taken by packet to travel from socket to TCP layer, say T7 ) + (Time taken by packet to travel from TCP to firewall layer, say T8 ) + (Time taken for packet to travel from firewall to physical layer, say T9 ) Equation 2-1: Dissection of time taken by a packet in its processes’ UHN
13
T5
User Space MOSIX
T4
Kernel / Socket Layer
T3
TCP/ IP
T2
Firewall
Lower Layers
T6
T7 T8
T1
T9
Figure 2-4: Time Analysis
As proposed earlier, the aim is to reduce the time taken by packet to travel all the way reaching MOSIX by intercepting it at the firewall layer. In that case, the amount of time that will be saved if the packet is intercepted at the firewall layer will be: Time saved = (Total time spent in Node b) – (T2 + T3 + T4 + T5 + T6 + T7 + T8 ) + (Time taken by firewall layer) Equation 2-2: Time saved by the packet It is not necessary to calculate T3 or T4 or any timing above the firewall layer for our purpose. This is because; the tests that we do (later in the report) will give the summation of these times, (T3 + T4 + T5 + T6 + T7 ). Hence, the rest of this paper will concentrate upon the firewalling techniques like, how to intercept the desired packets, how to do a network address translation and how to redirect 14
them to the right node c, where they should eventually go, instead of the detailed timing of packets above the firewall layer.
2.4 Architecture The primary aim of the solution is to intercept the packet and do a corresponding network address translation. The step-by-step procedure to do this can be perceived with the help of the following flowchart.
15
Wait for a packet to arrive
Write Rules
Intercept the packet
Check header for source and destination addresses
Verify rules with the packet headers
Yes
No
Send packet to upper layers
Do a network address translation
Send packet to physical layer Figure 2-5: Flowchart
The architecture is pretty straightforward. A set of rules is first written to intercept the necessary packets and do the corresponding network address translation on them. The firewall waits for a packet to arrive. As soon as a packet arrives, the firewall intercepts
16
the packet. The firewall filters every packet that arrives with the set of rules already written for it. If a packet matches a rule or rule set, the packet hits the fate written in the rule. In this case, the packet is stopped from going up the network layers. It is then directed down to the physical layer towards its new destination. If the packet does not match the rule set, the packet is sent up the network layer. The firewall now waits for the next packet to arrive so that it can intercept.
3 Implementation 3.1 Environment information As mentioned earlier in this report, the working environment for MOSIX is Linux on x86 platforms. MOSIX is available in two parts. The first part is the MOSIX core itself, which adds on as a patch to the Linux kernel. It then requires the kernel to be compiled with MOSIX enabled as a configuration option. The Linux box can then be rebooted for use with MOSIX. The second part is a set of system administrator tools for MOSIX like manual migration of processes using their PIDs 4 , enabling or disabling auto- migration, MOSIX process monitor tool, etc. that can be downloaded separately and installed. The latest version of MOSIX that was available while running the tests for this report was MOSIX 1.8.0 for Linux kernel version 2.4.19. Hence, the choice of implementation environment in this report is restricted to Linux. Red Hat and Debian Linux distributions were used. Care was taken regarding the version of gcc.
4
Process ID = PID
17
The MOSIX distribution page specifies that “Note: for now do not use distributions that use gcc-3.2, such as RedHat-8 or Slackware-9. gcc-3.2 is unsuitable for compiling the kernel”. [mosix02web] The Linux OS has firewall support built inside its kernel. There has been continuous evolution of this firewall system for the Linux kernel. As mentioned above, the working environment is Linux kernel ve rsion 2.4.19. The firewalling system inbuilt in Linux for 2.4.x kernels is called IPTables. It is the re-designed and heavily improved successor of the previous IPChains (for 2.2.x kernels) and IPFwadm (for 2.0.x kernels) systems. In this section, IPTables will be explained in detail. It will also be explained as to how IPTables has been used for our purposes. IPTables is also called netfilter.
3.2 IPTables 3.2.1 About5
IPTables is a generic table structure for the definition of rule sets. Each rule within an IP table consists out of a number of classifiers (matches) and one connected action (target).
5
This section talks about IPTables in general. All information in this section, in most cases, are taken from
the documentation of IPTables, as it is, or modified a little for the purpose of this paper. The source for this information is in the documentation section of www.netfilter.org. The author of documentation for this website is Rusty Russell [rusty02linuxnetfilter], [rusty02linuxnat]. One more source of information is [americo02performance]. I would not like to take any credit for information in the following sections about IPTables, except section 3.2.7.
18
Netfilter is a set of hooks inside the Linux 2.4.x kernel's network stack, which allows kernel modules to register callback functions called every time a network packet traverses one of those hooks.
The main features of the netfilter system are: •
Stateful packet filtering (connection tracking)
•
All kinds of network address translation
•
Flexible and extensible infrastructure
Netfilter, IPTables and the connection tracking as well as the Network Address Translation subsystems together build the whole framework.
Basically, rules are instructions with pre-defined characteristics to match on a packet. When a match is found the firewall makes a decision to handle that packet. Each rule is executed in order until a match is found. A rule can be set like this:
iptables [table]
There are 3 default policies: INPUT – to check the headers of incoming packets, OUTPUT – for outgoing packets/connections, and FORWARD – if the machine is used as a router (e.g. as a Network Address Translator.) Each policy has its own set of rules.
Let us take the following examples:
#iptables –P INPUT ACCEPT
#iptables –A INPUT –p tcp –dport 23 –j DROP
19
(-P: policy; -A: append; -p: protocol; -dport: destination port; -j: jump)
The first rule states that the firewall system will allow any packet from any network to come in. The second rule states that, for all packets that come under the TCP protocol, only those packets that come in with port number 23 are matched and are then dropped. This rule is appended to the INPUT policy.
Figure 3-1: IPTables working flowchart (from [americo02performance])
20
As seen in the figure above, IPTables has a set of policies, namely INPUT, OUTPUT and FORWARD policies that are meant for incoming, outgoing and packets meant for a third machine respectively. For each of these policies, a set of chains can be created. These chains can be matched with packets and protocols, IP addresses, input/output interfaces, MAC addresses, etc. After matching packets in these policies, the fate of a packet can be decided as to whether it should be accepted, dropped, rejected, queued or returned. 3.2.2 Netfilter Architecture In more detail, Netfilter is a series of hooks in various points in a protocol stack (at this stage, IPv4, IPv6 and DECnet). The (idealized) IPv4 traversal diagram looks like the following:
[1]
[Route]
[3]
[4]
[Route ]
[2]
[5]
Figure 3-2: Packet traversing in Netfilter (from [rusty02linuxnat]) On the left is where packets come in: having passed the simple sanity checks (i.e., not truncated, IP checksum OK, not a promiscuous receive), they are passed to the Netfilter framework's NF_IP_PRE_ROUTING [1] hook.
21
Next they enter the routing code, which decides whether the packet is destined for another interface, or a local process. The routing code may drop packets that are unroutable. If it's destined for the box itself, the Netfilter framework is called again for the NF_IP_LOCAL_IN [2] hook, before being passed to the process (if any). If it's destined to pass to another interface instead, the Netfilter framework is called for the NF_IP_FORWARD [3] hook. The packet then passes a final Netfilter hook, the NF_IP_POST_ROUTING [4] hook, before being put on the wire again. The NF_IP_LOCAL_OUT [5] hook is called for packets that are created locally. Here you can see that routing occurs after this hook is called: in fact, the routing code is called first (to figure out the source IP address and some IP options).
3.2.3 NAT background There is more to IPTables than just accepting and dropping packets. This section will discuss about Network Address Translation in IPTables. Normally, packets on a network travel from their source (such as your home computer) to their destination (such as www.gnumonks.org) through many different links. None of these links really alter the packet: they just send it onward. If one of these links were to do NAT, then they would alter the source or destinations of the packet as it passes through. But, this is not how the system was designed to work. Usually the link doing NAT will remember how it mangled a packet, and when a reply
22
packet passes through the other way, it will do the reverse mangling on that reply packet, so everything works. Some of the most common uses of NAT can be divided into three categories. •
Most ISPs give you a single IP address when you dial up to them. You can send out packets with any source address you want, but only replies to packets with this source IP address will return to you. If you want to use multiple different machines (such as a home network) to connect to the Internet through this one link, you'll need NAT. This is commonly known as ‘masquerading’ in the Linux world.
•
Sometimes you want to change where packets heading into your network will go. Frequently this is because you have only one IP address, but you want people to be able to get into the boxes behind the one with the ‘real’ IP address. If you rewrite the destination of incoming packets, you can manage this. This type of NAT is called port-forwarding.
•
Sometimes you want to pretend that each packet which passes through your Linux box is destined for a program on the Linux box itself. This is used to make transparent proxies: a proxy is a program which stands between your network and the outside world, shuffling communication between the two. The transparent part is because your network won't even know it's talking to a proxy, unless of course, the proxy doesn't work.
This report will look at a method combining the various above mentioned usages of IPTables in a later section 3.3.
23
3.2.4
NAT Architecture in IPTables
In IPTables, NAT is divided into two different types: Source NAT (SNAT) and Destination NAT (DNAT). Source NAT is when you alter the source address of the first packet: i.e. you are changing where the connection is coming from. Source NAT is always done post-routing, just before the packet goes out onto the wire. Masquerading is a specialized form of SNAT. Destination NAT is when you alter the destination address of the first packet: i.e. you are changing where the connection is going to. Destination NAT is always done before routing, when the packet first comes off the wire. Port forwarding, load sharing, and transparent proxying are all forms of DNAT.
In IPTables, we need to create NAT rules which tell the kernel what connections to change, and how to change them. To do this, we use the IPTables tool to alter the NAT table by specifying the ‘-t nat’ option. The ‘-t’ option in IPTables specifies the type of table that should be used. In the section 3.2.1, we used the default table of IPTables, called filter. For dong NAT, we will use the ‘nat’ table.
The table of NAT rules contains three lists called ‘chains’: each rule is examined in order until one matches. The chains are called PREROUTING (for Destination NAT, as packets first come in), and POSTROUTING (for Source NAT, as packets leave). And the third is called OUTPUT. The OUTPUT chain will be discussed later.
24
PREROUTING DNAT
POST ROUTING SNAT
[Routing Decis ion]
Local Process
Figure 3-3: NAT Architecture, IPTables (from [rusty02linuxnat])
The IPTables NAT can be best described with the help of the diagram above. At each of the points above, when a packet passes we look up what connection it is associated with. If it's a new connection, we look up the corresponding chain in the NAT table to see what to do with it. The answer it gives will apply to all future packets on that connection.
3.2.5 NAT example usage
IPTables takes a number of standard options as listed below. All the double-dash options can be abbreviated, as long as IPTables can still tell them apart from the other possible options.
The most important option here is the table selection option, ‘-t’. For all NAT operations, we will want to use ‘-t nat’ for the NAT table. The second most important option to use is ‘-A’ to append a new rule at the end of the chain (e.g. ‘-A POSTROUTING’), or ‘-I’ to insert one at the beginning (e.g. ‘-I PREROUTING’).
25
We can specify the source (‘-s’ or ‘—source’) and destination (‘-d’ or ‘—destination’) of the packets you want to NAT. These options can be followed by a single IP address (e.g. 192.168.1.1), a name (e.g. www.gnumonks.org), or a network address (e.g. 192.168.1.0/24 or 192.168.1.0/255.255.255.0). If we omit the source address option, then any source address will do. If we omit the destination address option, then any destination address will do.
We can specify the incoming (‘- i’ or ‘--in- interface’) or outgoing (‘-o’ or ‘--outinterface’) interface to match, but which we can specify depends on which chain we are putting the rule into: at PREROUTING we can only select incoming interface, and at POSTROUTING we can only select outgoing interface. If we use the wrong one, IPTables will give an error.
We can also indicate a specific protocol (‘-p’ or ‘—protocol’), such as TCP or UDP; only packets of this protocol will match the rule. The main reason for doing this is that specifying a protocol of TCP or UDP then allows extra options: specifically the ‘- source-port’ and ‘--destination-port’ options (abbreviated as ‘--sport’ and ‘--dport’).
These options allow us to specify that only packets with a certain source and destination port will match the rule. This is useful for redirecting web requests (TCP port 80 or 8080) and leaving other packets alone.
These options must follow the ‘-p’ option (which has a side-effect of loading the shared library extension for that protocol). We can use port numbers, or a name from the /etc/services file.
26
We want to do Source NAT; change the source address of connections to something different. This is done in the POSTROUTING chain, just before it is finally sent out; this is an important detail, since it means that anything else on the Linux box itself (routing, packet filtering) will see the packet unchanged. It also means that the ‘-o’ (outgoing interface) option can be used.
Source NAT is specified using ‘-j SNAT’, and the ‘--to-source’ option specifies an IP address, a range of IP addresses, and an optional port or range of ports (for UDP and TCP protocols only).
## Change source addresses to 1.2.3.4.
# iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to 1.2.3.4
## Change source addresses to 1.2.3.4, 1.2.3.5 or 1.2.3.6
# iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to 1.2.3.4-1.2.3.6
## Change source addresses to 1.2.3.4, ports 1-1023
# iptables -t nat -A POSTROUTING -p tcp -o eth0 -j SNAT --to 1.2.3.4:1-1023
Destination NAT is done in the PREROUTING chain, just as the packet comes in; this means that anything else on the Linux box itself (routing, packet filtering) will see the packet going to its ‘real’ destination. It also means that the ‘- i’ (incoming interface) option can be used.
27
Destination NAT is specified using ‘-j DNAT’, and the ‘--to-destination’ option specifies an IP address, a range of IP addresses, and an optional port or range of ports (for UDP and TCP protocols only).
## Change destination addresses to 5.6.7.8
# iptables -t nat -A PREROUTING - i eth0 -j DNAT --to 5.6.7.8
## Change destination addresses to 5.6.7.8, 5.6.7.9 or 5.6.7.10.
# iptables -t nat -A PREROUTING - i eth0 -j DNAT --to 5.6.7.8-5.6.7.10
## Change destination addresses of web traffic to 5.6.7.8, port 8080.
# iptables -t nat -A PREROUTING -p tcp --dport 80 - i eth0 -j DNAT --to 5.6.7.8:8080
Though there is much more to the NAT of IPTables, the explanation above is sufficient enough for the explanation of the rest of the papers and section 3.3. 3.2.6 Performance Evaluation of IPTABLES6 In this section, we will discuss the performance evaluation done on IPTables by Américo J.
Melara
of
California
Polytechnic
State
University,
San
Luis
Obispo
[americo02performance].
This thesis paper tests the firewall’s performance with the help of the following parameters
6
All material in this section is referred from [americo02performance]
28
Transmission Protocol Type of filtering/matching INPUT policy Connection speed Payload size Number of rules
TCP UDP TCP, IP, MAC UDP, IP, MAC ACCEPT & DROP DROP 100Mbps 64 &1400 bytes No firewall,10, 40, 100
Table 3-1: Performance evaluation parameters of [americo02performance] In short, the performance test does a permutation and combination of various parameters specified above and quotes its results as follows.
(a) That the payload size impacts the performance before and after the firewall but not the firewall itself. (b) That the INPUT policy does not affect the performance of the firewall (c) That the firewall is affected only by type of filtering/matching and the number of rules, and (d) That the time to process a packet from the start time to the socket layer (Refer section 2.2) is affected by the parameters in (c) and also by the payload size.
The test is done by keeping timestamps at various points in the network processing layers of the system. (Refer section 2.2) •
Start time = T2 – T1
•
Firewall = (T3 – T1) – (T2 – T1) = T3 – T2
•
TCP layer = (T4 – T1) – (T3 – T1) = T4 – T3
•
Socket layer = (T5 – T1) – (T4 – T1) = T5 – T4
•
Total processing time = T5 - T1
29
These processing times are calculated for various combinations of the parameters specified above. The results of these performance tests are explained and plotted in graphs in the paper. The combination that is interesting to this paper is the test on TCP packets with 10, 40 & 100 firewall rules. 3.2.7 Importance of performance evaluation The section 2.2 and 2.3 talked about the reasoning behind the problem solution and the timing analysis. The timing analysis talks about splitting of the traversal of a network packet all the way from the lower layers till the user space and how the timing of the packet is distributed across various layers. The layers of interest for us will be the lower layers, the firewall layer, the TCP layer and the socket layer. If the timings for the above layers are known, the amount of time spent by the firewall layer on a set of rules can be found out. As an example, let us take the following parameters from the Performance Test [americo02performance]. A TCP packet of 1400 bytes, with 10 firewall rules. According to the test results, the following is true: Total time taken for processing this packet = (Time taken by packet to travel from the lower layers to the firewall layer, 11.94 µsec) + (Time taken at the firewall layer, 8.59 µsec) + (Time taken at the TCP layer, 24.22 µsec) + (Time taken at the socket layer, 2.9 µsec) Equation 3-1: Time taken for processing a TCP packet, 1400 bytes and 10 rules
30
Referring back to section 2.2 and 2.3, we can add to the above amount of time spent by a packet, the amount of time spent by MOSIX on the packet to decide on the fate of the packet. Hence, The amount of time that can be saved if the packets are redirected at the firewall layer = (Time taken at the TCP layer, 24.22 µsec) + (Time taken at the socket layer, 2.9 µsec) + (Time taken by MOSIX to decide on the fate of the packet, M µsec). Equation 3-2: Time saved if packets are redirected at the firewall However, in this case, the packet has to travel up and travel down the network layer. Hence, the packet spends the same amount of time in getting the header removed while coming back down the network layer. So, the above calculated time can be re-calculated as: Recalculated Time = 2 × { (Time taken at the TCP layer, 24.22 µsec) + (Time taken at the socket layer, 2.9 µsec) }+ (Time taken by MOSIX to decide on the fate of the packet, M µsec). Equation 3-3: Recalculated time saved for packets redirected at firewall
31
This gives us a clear idea on how much time can be saved while redirecting packets at the firewall layer.
3.3 IPTables for the problem at hand 3.3.1 How
The problem at hand requires a firewall that can identify packets, filter them using a matching technique, and then redirect it to another machine at the firewall layer itself. From our discussion on IPTables, from the discussion on the performance test on IPTables, it is pretty clear that IPTables can do the required. IPTables can be used to filter out incoming packets, it can be used to change the destination address of the intercepted packet and send it back down towards its new destination.
3.3.2 Actual Rules Referring back to the NAT architecture [section 3.2.4], we can set up the following set of rules for identifying and redirecting a packet. Rule 1: The first rule will catch the incoming packet (by matching its IP address, port number) at the PREROUTING chain of the ‘nat’ table. After filtering out such a packet, this packet needs to be redirected to its new destination using ‘-j DNAT’. The rule for this will look like the following: # iptables -t nat -A PREROUTING -s $CLIENT -d $FIREWALL _SYSTEM -p tcp \ --dport $SERVER_PORT -i eth0 -j DNAT --to-destination $NEW_DESTINATION
32
(where –s: source, -d: destination, --dport: destination port) Rule 2: The packet now goes through the POSTROUTING chain of the firewall. At this point, the packet has to go to its new destination with its source address changed to this system (where the firewall resides). Only if this source address is changed, the new destination will reply back to this system. Otherwise, it will contact the source system directly, from where the packet came. This rule will use the POSTROUTING chain and the SNAT option of IPTables. # iptables -t nat -A POSTROUTING -s $CLIENT -p tcp \ --dport $SERVER_PORT -o eth0 -j SNAT --to-source $FIREWALL_SYSTEM Rule3: When the new destination replies back to the firewall sys tem, the packet has to be redirected to the original source. This is another DNAT that will complete the cycle. # iptables -t nat -A PREROUTING -s $SERVER -d $FIREWALL_SYSTEM -p tcp \ --sport $SERVER_PORT -i eth0 -j DNAT --to-destination $CLIENT 3.3.3 How do these rules work? These rules can be pictographically represented as shown in the figure below:
Step 1
Node a, Process A
Node b, IPTables rule set.
Thinks Process B is in Node b. However process B is in Node c. Does not know location of process B
Gets packet from process A for process B Knows process B is in Node c Does DNAT on packets from A Does SNAT before sending it to Node c
33
Figure 3-4: How do these rules work? Step 1
34
Step 2 Node b, IPTables rule set Waits response from Node c
Node a, Process A
Node c, Process B
Awaits response from Node b Thinks Process B is in Node b
Thinks Process A is in Node b. Replies back to node b Does not know correct location of process A
Figure 3-5: How do these rules work? Step 2
35
Step 3
Node b, IPTables rule set Gets response from Process B. Does DNAT on packet from B Converts the DNAT to Node a Sends packet to Node a, process A
Node a, Process A
Node c, Process B
Awaits response from Node b Thinks Process B is in Node b
Awaits response from Node b Thinks Process A is in Node b
Figure 3-6: How do these rules work? Step 3
36
Step 4
Node b, IPTables rule set Waits for next packet from Node a
Node a, Process A
Node c, Process B
Gets response from Node b Thinks it is from process B, node b, happy Begins sending next packet to Node b
Awaits next packet from Node b Thinks Process A is in Node b
Figure 3-7: How do these rules work? Step 4
4 Testing 4.1 Purpose The primary purpose of the test is to compare the effect of using the IPTables rules on Node b (refer section 3.3.3) against MOSIX network communication technique and direct communication of processes between Node a and Node c. The test will measure the total execution time / latency, the bandwidth taken by the processes, the load average, %age CPU utilizations of the respective systems for a) MOSIX communication b) IPTables communication c) Direct communication 37
4.2 Environment The nodes used for the testing environment had the following configuration: •
Pentium P4 CPU
•
1.6 GHz Processor speed
•
Intel Ether express Network card
•
100 Mbps LAN
•
Two Red Hat 7.2 Linux boxes with Kernel 2.4.19
•
One Debian Linux box with Kernel 2.4.18
•
All nodes were connected on to the same LAN switch
4.3 Test Procedures 4.3.1 General The architecture that was maintained during the tests is exactly similar to the architecture explained in section 3.3.3, which is as follows
Node b
Node a
Node c
Figure 4-1: General Test Procedure 38
A server-client communicating pair was created in order to satisfy the test purpose. They communicate to each other using variable parameters. Some of the parameters that were used in creating this server-client pair were •
Buffer size for each send / receive.
•
The total amount of data that they would transmit. In other words, this would be total number of iterations for which the data would be sent. This parameter was used instead of specifying the time for which the data should be sent, because, the purpose of the test is to find time of execution and not to specify it.
•
The number of such communicating pairs.
•
The port number on which the communication service would run.
The servers starts first and waits on a port number. The client contacts the server on this port number on the server’s machine. A connection is established between them. The server then starts pumping data to the client according to the parameters specified above. At the end of data transfer, the server sends an end-signal to close the connection. The client prints out the time taken for execution in seconds and microseconds. Tests were first designed for different sizes of data transfer and also different amount of communicating pairs. However, as seen in section 3.2.6, the size of data does not really affect the performance of the system. So, the test was made for number of communicating pairs.
39
4.3.2 MOSIX For testing under MOSIX, a scenario has to be created where a process is migrated from its UHN to another node so that the “triangular” route of communication happens at the UHN. The following steps were taken in order to create this scenario. Step 1 Node b Server is created here. Server waits on a port number.
Node c Client will reside here Client is not yet created.
Node a
Figure 4-2: MOSIX Test Procedure: Step 1
40
Step 2
Node b Server is migrated manually using MOSIX admin tools to node c All processes think server is still in Node b
Node a Server is now here. But, it goes to Node b for system calls. Node b is its UHN
Node c Client is created. Contacts server in Node b. Is unaware that server is in Node a
Figure 4-3: MOSIX Test Procedure: Step 2
When more than one communicating processes was created, all of them were migrated to Node a. 4.3.3 IPTables For testing the IPTables procedure, the server was created in Node a and the client was created in Node c. Node b is where all the IPTables rules mentioned in section 3.3.2 reside. When the client contacts Node b, the request is forwarded to Node a and the server thinks that Node b is requesting service. When it returns back with a reply to the request, Node b forwards the reply to the client in Node c. Thus, the connection cycle is established and transfer of data occurs through the Node b.
41
Node b IPTables rules are written here. Forwards packets from Node a to Node c and vice-versa
Node c Client is created here Client contacts Node b requesting for service
Node a Server is created here. It will get request from Node b (which is in reality from Node c). It will reply back to Node b
Figure 4-4: IPTables Test Procedure When more than one communicating processes was created, more rules were added in Node b to cater to each communicating process. As we saw in section 3.3.2, there are three rules required for one communicating pair. So, for every communicating pair, an extra set of three rules needs to be written. 4.3.4 Direct communication Ideally, if the MOSIX network communicating processes were migrated, they should have contacted each other directly, instead of the communicating technique discussed in section 1.2.
This test was conducted to find out the actual performance (latency,
bandwid th, load average of the two systems on which the communication processes reside) so that it can be compared with the MOSIX method and the IPTables method.
42
Node b
Node c Client is created here Client contacts Node c requesting for service
Node a Server is created here. It will get request from Node c It will reply back to Node c
Figure 4-5: Direct Communication Test Procedure
5 Results 5.1 MOSIX As mentioned in the previous section, tests were conducted for increasing number of communicating pairs. The various results noted were: the total execution time for the communicating processes to finish off data transfer 7 , the bandwidth occupied, the load average on the MOSIX UHN while the processes were communicating (in the test above, the UHN will be Node b) and the %age CPU system utilization 8 on the UHN.
7
The amount of data transferred is a parameter given to the test. In these tests, it was 400 MB
8
System CPU %age is the amo unt of CPU used by kernel. Since MOSIX is in the kernel, system CPU is
noted.
43
No of
Time to
Bandwidth
% system
Load average
communicating
complete data
(Mbps)
CPU
(1.00 = full)
pairs
transfer *
utilization
(seconds) 3
203.73
15.71
58.8
0.52
6
275.56
11.61
85.7
1.34
9
390.28
8.19
85
1.73
12
513.64
6.23
87.3
1.83
15
640.91
4.99
87.5
1.8
25
1063.21
3.01
90
3.55
50
2130.70
1.50
90%
4.42
Table 5-1: MOSIX Test Result *The data in this table is an average on the number of connections. Please refer the appendix for complete data
The MOSIX test results show increased load average and %CPU utilization with increasing number of communicating pairs. A more detailed comparison can be made after reading the results from the other two tests.
44
5.2 IPTables Similar test results are shown here for IPTables rule set. The %age CPU utilization and load average are calculated for the node that has the rule set written, which, according to the previous section is Node b. Total Number
Time to complete
Bandwidth
% system
Load average
of
data transfer
(Mbps)
CPU
(1.00 = full)
connections
(seconds)
3
109.79
29.15
27.1
0.02
6
219.58
14.57
27.1
0.01
9
328.89
9.73
25.5
0.01
12
437.77
7.31
27.5
0.02
15
552.14
5.79
27.9
0.01
25
913.89
3.50
24.5
0.02
50
1840.99
1.74
28
0.02
utilization
Table 5-2: IPTables Test Result
45
5.3 Direct Communication For the direct communication test, there is no need for testing the load average and %age CPU utilization, because, there is no middle system existing. However, the total time for execution and latency are noted and are shown in the table below.
Total Number of
Time to complete
Bandwidth
connections
data transfer
(Mbps)
(seconds) 3
103.51
30.92
6
212.61
15.05
9
316.60
10.11
12
424.55
7.54
15
529.11
6.05
25
882.72
3.63
50
1746.64
1.83
Table 5-3: Direct Communication Test Result
5.4 Summary The tables shown above can be summarized and shown below for comparison on the basis of latency, bandwidth, %age CPU ut ilizations and load averages.
46
MOSIX
LATNENCY (sec) IPTABLES
NORMAL
3
203.73
109.79
103.51
6
275.56
219.58
212.61
9
390.28
328.89
316.60
12
513.64
437.77
424.55
15
640.92
552.14
529.11
25
1063.21
913.89
882.72
50
2130.70
1840.99
1746.64
No end-to-end connections
Table 5-4: Comparison of Latency
No end-to-end connections
MOSIX
BANDWIDTH (Mbps) IPTABLES NORMAL
3
15.71
29.15
30.92
6
11.61
14.57
15.05
9
8.19
9.73
10.11
12
6.23
7.31
7.54
15
4.99
5.79
6.05
25
3.01
3.50
3.63
50
1.50
1.74
1.83
Table 5-5: Comparison of Bandwidth
47
No end-to-end connections
%cpu utilization MOSIX IPTABLES
3
58.8
27.1
6
85.7
27.1
9
85
25.5
12
87.3
27.5
15
87.5
27.9
25
90
24.5
50
90
28
Table 5-6: Comparison of CPU Utilization
No end-to-end connections
load average
MOSIX
IPTABLES
3
0.52
0.02
6
1.34
0.01
9
1.73
0.01
12
1.83
0.02
15
1.8
0.01
25
3.55
0.02
50
4.42
0.02
Table 5-7: Comparison of Load Average
48
Execution Time Comparison 700
Time (secs)
600 500 400 300 200 100 0 0
2
4
6
8
10
12
14
16
No of Connections mosix
iptables
direct
Figure 5-1: Execution Time Comparison Chart
Bandwidth Comparison 35
Bandwidth (Mbps)
30 25 20 15 10 5 0 0
2
4
6
8
10
12
No of connections mosix
iptables
direct
Figure 5-2: Bandwidth Comparison Chart
49
14
16
%CPU Utlization Comparison
%CPU Utilization
100 80 60 40 20 0 0
10
20
30
40
50
60
No of connections mosix
iptables
Figure 5-3: %CPU Utilization Comparison Chart
Load Average Comparison
Load Average
5 4 3 2 1 0 0
10
20
30
40
50
No of Connections mosix
iptables
Figure 5-4: Load Average Comparison Chart
50
60
6 Conclusion 6.1 Observations •
From the graph and table of comparison of latency, it is clear that the total execution time for IPTables is pretty close to the total execution time of direct communication, while MOSIX takes a huge toll, huge difference in total execution time. On an average, MOSIX takes 33% more time on execution than direct communication while IPTables takes only 4% more. On an average, MOSIX takes 28% more time on execution than IPTables.
•
The bandwidth comparison chart and table shows that the bandwidth occupied by MOSIX is considerably less as compared to IPTables and direct communication. On an average, it is 20% less than IPTables. However, as number of end-to-end communications increase, the bandwidth difference between the three methods of testing does not vary much.
•
However, while the bandwidth graph converges, the load average and CPU utilization show a drastic difference. The CPU utilization and load average on the IPTables system is considerably low as compared to that of the MOSIX system, which completely hogs the system. MOSIX, on an average, takes 212% more CPU utilization and at least 138 times more load average than IPTables.
6.2 Inferences The observations made in the previous section shows that using MOSIX to manually schedule the network communicating process has actually slowed down the time of execution of the network communicating processes. An interesting point here is that if
51
MOSIX had done an auto migration on these network communicating processes, the MOSIX system would still have a huge percentage CPU load on them. The IPTables test on the other hand has shown that it takes very less execution time of the two communicating processes and it does not add to any CPU utilization or load of the system, which handles the redirection of the rules.
The observations and inferences make this point very clear that the MOSIX methodology of structuring the network communicating processes is time consuming and resource crunching. However, if the same structure is used for a pair of communicating processes with the IPTables rule set defined, the cost effectiveness and resource efficiency is greatly increased. It is so much increased that it is as if the communicating processes are connected directly and are not routed through a middle system.
6.3 Future Work Naturally, the integration of IPTables methodology inside MOSIX will enable MOSIX to be more efficient. In such a case, the basic structure of MOSIX is not changed, i.e. MOSIX maintains it UHN, remote & deputy concept, but still improves performance. This integration could be done possible by a step-wise approach. In a broad sense, these steps could be: a) Identify the double-redirection that was created by migrating the process from its UHN. b) Create IPTables rule set on the fly using an API / library. c) The library sits on every MOSIX workstation and updates the creation of new rule sets.
52
d) Remove rules after the processes are done with communication. There are many limitations associated with NAT itself. These can be found in more detail in [hain00architectural]. There could be another approach to the whole situation using IPTables rule set. If there is a way to re-direct the packets that are generated locally to a system by doing a local DNAT on them instead of doing in a middle system, this problem could be solved. However, from the IPTables documentation [rusty02linuxnat], “The NAT code allows you to insert DNAT rules in the OUTPUT chain, but this is not fully supported in 2.4 (it can be, but it requires a new configuration option, some testing, and a fair bit of coding, so unless someone contracts Rusty to write it, I wouldn't expect it soon). The current limitation is that you can only change the destination to the local machine (e.g. `j DNAT --to 127.0.0.1'), not to any other machine, otherwise the replies won't be translated correctly.” Enabling of DNAT on locally generated packets could be a possible future work on IPTables that could prove to be an efficient solution to the problem. On the backside, there are some inherent drawbacks in the NAT system, which are discussed in detail in [hain00architectural, holdrege01protocol, and sebue02network]. Since MOSIX works on Linux on x86 platforms, these NAT problems do not come into picture.
7 Related Research A variety of different approaches have been taken in resolving the problem discussed in section 1.2 of this paper. These approaches can be classified into two categories: one,
53
which addresses the problem of NAT at source, the other which address the problem of socket migration. We discuss various research related to these approaches.
Mobile communication with Virtual Network Address translation [gong02mobile] is an architecture that allows transparent migration of end-to-end live network connections associated with various computation units. Such computation units can be either a single process, or a group of processes, or an entire host. VNAT virtualizes network connections perceived by transport protocols so that identification of network connections is decoupled from stationary hosts. Such virtual connections are then remapped into physical connections to be carried on the physical network using network address translation.
However,
VNAT
is
tailored
specifically
for
the
ZAP
project
[steven02design].
MIGSOCK [bryan02migsock] is a project at the Carne gie Mellon University, Information Networking Institute, that implements the migration of TCP sockets in the Linux operating system. MIGSOCK provides a kernel module that re- implements TCP to make migration possible. The implementation requires modificatio ns to the kernel files (patches) and migration option available to user applications. The remainder of the functionality exists in the kernel module which can be loaded on demand by the kernel. This seems like a good patch that can be made to MOSIX so that the problem discussed in section 1.2 can be eradicated. However, the source code for this software was available only on request to the authors. E- mail requests were sent without any response. Also, this software has not yet been integrated with MOSIX.
54
[alex00end] presents an architecture that allows suspending and resuming TCP connections. However, it does not support migration of TCP connections where both the end points move simultaneously.
MSOCKS [david98msocks] presents an architecture called Transport Layer Mobility that allows mobile nodes to not only change their point of attachment to the Internet, but also to control which network interfaces are used for the different kinds of data leaving from and arriving at the mobile node. MSOCK implements transport layer mobility scheme using a split-connection proxy architecture and a new technique called TCP Splice that gives split-connection proxy systems the same end-to-end semantics as normal TCP connections. However, MSOCKS handles a mobile client and a stationary server. So, it does not match well with the problem in section 1.2.
There is a mention of socket migration in the MOSIX web page [mosix02web] as an ongoing project.
8 References [alex00end]
Alex C. Snoeren and Hari Balakrishnan, An End-to-End Approach to Host Mobility, Proceedings of the 6th International Conference on Mobile Computing and Networking (MobiCom ’00), Boston, MA, August 2000.
[americo02performance]
55
Américo J. Melara, Performance analysis of the Linux firewall in a host, Masters Thesis, California Polytechnic State University, San Luis Obispo, June 2002. [barak98mosix]
Barak A. and La'adan O., The MOSIX Multicomputer Operating System for High Performance Cluster Computing , Journal of Future Generation Computer Systems, Vol. 13, No. 4-5, pp. 361372, March 1998.
[barak99scalable]
Barak A., La'adan O. and Shiloh A., Scalable Cluster Computing with MOSIX for LINUX, Proc. Linux Expo '99, pp. 95-100, Raleigh, N.C., May 1999.
[bryan02migsock]
Bryan Kuntz and Karthik Rajan, MIGSOCK: Migratable TCP socket in Linux, Master’s Thesis, Carnegie Mellon University, Information Networking Institute, February 2002.
[david98msocks]
David A. Maltz and Pravin Bhagwat, MSOCKS: An Architecture for Transport Layer Mobility, Proceedings of the IEEE INFOCOM ’98, San Francisco, CA, 1998.
[gong02mobile]
Gong Su and Jason Nieh, Mobile Communication with Virtual Network Address Translation, Technical Report CUCS-003-02, Department of Computer Science, Columbia University, February 2002.
[hain00architectural] T. Hain, Architectural Implications of NAT, RFC 2993, IETF, November 2000.
56
[holdrege01protocol] M. Holdrege and P. Srisuresh, Protocol Complications with the IP Network Address Translator, RFC 3027, IETF, January 2001. [mosix02web]
http://www.mosix.org
[rusty02linuxnat]
Rusty Russell, Linux 2.4 NAT HOWTO, Linux Netfilter core Team, http://www.netfilter.org/documentation/HOWTO/NATHOWTO.html, January 2002.
[rusty02linuxnetfilter] Rusty Russell and Harald Welte, Linux netfilter Hacking HOWTO, Linux
Netfilter
core
Team,
http://www.netfilter.org/documentation/HOWTO//netfilterhacking- HOWTO.html, July 2002 [sebue02network]
D. Senie, Network Address Translator (NAT)-friendly Application Design Guidelines, RFC 3235, IETF, January 2002.
[stevem02design]
Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh, "The Design and Implementation of Zap: A System for Migrating Computing Environments", Proceedings of the Fifth Symposium on Operating Systems Design and Implementation (OSDI 2002), Boston, MA, December 9-11, 2002
57