CSE-402 E Distributed Operating System UNIT-1 Introduction to ...

CSE-402 E Distributed Operating System

UNIT-1 Introduction to Distributed Systems: Why do we develop distributed systems? •

availability of powerful yet cheap microprocessors (PCs, workstations), continuing advances in communication technology

What is a distributed system? A distributed system is a collection of independent computers that appear to the users of the system as a single system. Examples: •

Network of workstations

•

Distributed manufacturing system (e.g., automated assembly line)

•

Network of branch office computers

Advantages of Distributed Systems over Centralized Systems: •

Economics: a collection of microprocessors offer a better price/performance than mainframes. Low price/performance ratio: cost effective way to increase computing power.

•

Speed: a distributed system may have more total computing power than a mainframe. Ex. 10,000 CPU chips, each running at 50 MIPS. Not possible to build 500,000 MIPS single processor since it would require 0.002 nsec instruction cycle. Enhanced performance through load distributing.

•

Inherent distribution: Some applications are inherently distributed. Ex. a supermarket chain.

•

Reliability: If one machine crashes, the system as a whole can still survive. Higher availability and improved reliability.

•

Incremental growth: Computing power can be added in small increments. Modular expandability World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in

E-mail : [email protected]


•

Another deriving force: the existence of large number of personal computers, the need for people to collaborate and share information. Advantages of Distributed Systems over Independent PCs: •

Data sharing: allow many users to access to a common data base

•

Resource Sharing: expensive peripherals like color printers

•

Communication: enhance human-to-human communication, e.g., email, chat

•

Flexibility: spread the workload over the available machines

Disadvantages of Distributed Systems: •

Software: difficult to develop software for distributed systems

•

Network: saturation, lossy transmissions

•

Security: easy access also applies to secrete data

Hardware Concepts: MIMD (Multiple-Instruction Multiple-Data) Tightly Coupled versus Loosely Coupled •

•

Tightly coupled systems (multiprocessors) o

shared memory

o

intermachine delay short, data rate high

Loosely coupled systems (multicomputers) o

private memory

o

intermachine delay long, data rate low

Bus versus Switched MIMD:

World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



•

Bus: a single network, backplane, bus, cable or other medium that connects all machines. E.g., cable TV

•

Switched: individual wires from machine to machine, with many different wiring patterns in use. Multiprocessors (shared memory) –

Bus

–

Switched

Multicomputers (private memory) –

Bus

-

Switched

Bus-based Multiprocessors: •

Bus-based multiprocessors

•

cache memory

•

hit rate

•

cache coherence

•

write-through cache: propagate write immediately

•

snoopy cache: monitor when its entry becomes obsolete

Switched Multiprocessors: •

for connecting large number (say over 64) of processors

•

crossbar switch: n**2 switch points

•

omega network: 2x2 switches for n CPUs and n memories, log n switching stages, each with n/2 switches, World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



•

total (n log n)/2 switches

•

delay problem: E.g., n=1024, 10 switching stages from CPU to memory. a total of 20 switching stages. 100 MIPS 10 nsec instruction execution time need 0.5 nsec switching time

•

NUMA (Non-Uniform Memory Access): placement of program and data

•

building a large, tightly-coupled, shared memory multiprocessor is possible, but is difficult and expensive

Multicomputers: Bus-Based Multicomputers •

easy to build

•

communication volume much smaller

•

relatively slow speed LAN (10-100 MIPS, compared to 300 MIPS and up for a backplane bus)

Switched Multicomputers interconnection networks: E.g., grid, hypercube •

hypercube: n-dimensional cube

Software Concepts: •

Software more important for users

•

Three types: 1. Network Operating Systems 2. (True) Distributed Systems 3. Multiprocessor Time Sharing

Network Operating Systems: •

loosely-coupled software on loosely-coupled hardware World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



•

A network of workstations connected by LAN

•

each machine has a high degree of autonomy o

rlogin machine

o

rcp machine1:file1 machine2:file2

•

Files servers: client and server model

•

Clients mount directories on file servers

•

Best known network OS: o

•

Sun’s NFS (network file servers) for shared file systems

a few system-wide requirements: format and meaning of all the messages exchanged

NFS: NFS Architecture •

Server exports directories

•

Clients mount exported directories

NSF Protocols •

For handling mounting

•

For read/write: no open/close, stateless

NSF Implementation (True) Distributed Systems: tightly-coupled software on loosely-coupled hardware provide a single-system image or a virtual uniprocessor




a single, global interprocess communication mechanism, process management, file system; the same system call interface everywhere Ideal definition: “ A distributed system runs on a collection of computers that do not have shared memory, yet looks like a single computer to its users.” Multiprocessor Operating Systems: •

Tightly-coupled software on tightly-coupled hardware

•

Examples: high-performance servers

•

shared memory

•

single run queue

•

traditional file system as on a single-processor system: central block cache

Design Issues of Distributed Systems: •

Transparency

•

Flexibility

•

Reliability

•

Performance

•

Scalability

1. Transparency How to achieve the single-system image, i.e., how to make a collection of computers appear as a single computer. Hiding all the distribution from the users as well as the application programs can be achieved at two levels: a. hide the distribution from users b. at a lower level, make the system look transparent to programs. 1) and 2) requires uniform interfaces such as access to files, communication. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Types of transparency: –

Location Transparency: users cannot tell where hardware and software resources such as CPUs, printers, files, data bases are located.

–

Migration Transparency: resources must be free to move from one location to another without their names changed. E.g., /usr/lee, /central/usr/lee

–

Replication Transparency: OS can make additional copies of files and resources without users noticing.

–

Concurrency Transparency: The users are not aware of the existence of other users. Need to allow multiple users to concurrently access the same resource. Lock and unlock for mutual exclusion.

–

Parallelism Transparency: Automatic use of parallelism without having to program explicitly. The holy grail for distributed and parallel system designers.

Users do not always want complete transparency: a fancy printer 1000 miles away 2. Flexibility Make it easier to change Monolithic Kernel: systems calls are trapped and executed by the kernel. All system calls are served by the kernel, e.g., UNIX. Microkernel: provides minimal services. 1) IPC 2) some memory management 3) some low-level process management and scheduling 4) low-level i/o E.g., Mach can support multiple file systems, multiple system interfaces. 3. Reliability Distributed system should be more reliable than single system. Example: 3 machines with .95 probability of being up. 1-.05**3 probability of being up. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



a. Availability: fraction of time the system is usable. Redundancy improves it. b. Need to maintain consistency c. Need to be secure d. Fault tolerance: need to mask failures, recover from errors. 4. Performance Without gain on this, why bother with distributed systems. Performance loss due to communication delays: a. fine-grain parallelism: high degree of interaction b. coarse-grain parallelism Performance loss due to making the system fault tolerant. 5. Scalability Systems grow with time or become obsolete. Techniques that require resources linearly in terms of the size of the system are not scalable. e.g., broadcast based query won't work for large distributed systems. Examples of bottlenecks a. Centralized components: a single mail server b. Centralized tables: a single URL address book c. Centralized algorithms: routing based on complete information Communication Networks: Computers are connected through a communication network •

Wide Area Networks (WAN) connect computers spread over a wide geographic area point-to-point or store-and-forward -- data is transferred between computers through a series of switches switch -- a special purpose computer responsible for routing data (to avoid network congestion) World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



data can be lost due to: switch crashes, communication link failures, limited buffers at switches, transmission errors, etc. •

Packet Switching versus Circuit Switching i) circuit switching -- a dedicated path between a source and a destination e.g., telephone connection. wastes bandwidth (bandwidth = amount of data transmitted in a given time period) ii) packet switching -- message or data is broken into packets packets are routed independently better network utilization disassemble and assembler overheads The ISO OSI Reference Model

Local Area Networks (LAN) Layered protocols: • • • • • •

A layered protocol architecture provides a conceptual framework for dividing the complex task of exchanging information between remote hosts into simpler tasks. Each protocol layer has a narrowly defined responsibility. A protocol layer provides a standard interface to the next higher protocol layer. Consequently, it hides the details of the underlying physical network infrastructure. Benefit: The same user-level (application) program can be used over very diverse communication networks. Example: The same WWW browser can be used when you are connected to the internet via a LAN or a dial-up line.




ATM Networks: We are moving towards a digital revolution where every aspect of life will be pondered by the computer networks, reading books, watching film, collecting photographs, paying bills or buying a real state. Computer systems are very complex and are able of linking astonishingly millions of computers and billions of telephone lines. The clicking mouse is more powerful and swift than any other mean of communication. The valiant explorer in an isolated corner of the earth can contact his or her family with the help of a satellite or a cellular telephone. How these marvelous webs of interconnectivity are buildup? What lies at the foundation of a computer system and why they work the way they do? We are in these posts studying the underlying principles of internet, telephone networks and asynchronous transfer mode networks. ATM networks are the future of computer networking. They are interesting as they base on experience with the telephone network and the Internet to construct an incorporated services network that gives end to end quality of service. The goal leads to an exclusive set of design decisions that have impressed the products of networking research and commercial networking products. Many of you will mingle this ATM with Automatic Teller Machine that is used by banks to World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



transect money. Asynchronous Transfer Mode (ATM), a dedicated connection switching technology, uses digital signal technology to manage digital data into 53 byte cell units and to transmit on a physical medium. A cell is individually asynchronously processed relative to other adjacent cells. Before being multiplexed over the transmission, the call is queued. ATMs are designed by keeping in mind that they can be easily used by the hardware instead of software. Its processing is faster where switch speed is also possible. An ATM network may have a speed of 10 Gbps. The pre-defined bit rate could be 155.520 Mbps or 622.080 Mbps. ATM is a vital part of ISDN (BISDN) along with SONET (Synchronous Optical Network) and several other technologies. Types of ATM Networks Connection: There are two ways in which ATM connections between endpoints can be distinguished by the Quality of Service parameters and formats of their addressing schemes. The type of ATM is determined by ATM signaling. The signaling components situates at the end station and at the ATM switch. The creation, management and termination of SVCs (switched virtual circuits) are determined by the signaling layer of ATM’s software. UNI (User Network Interface) is the ATM standard wire protocol applied by the signaling software. The manner in which an ATM switch signals another ATM switch is composed of a second signaling standard known as NNI (Network Network Interface). ATM has two types:

o

Point to point connections

o

Point to multipoint connections

Point-to-Point Connection: If an ATM-aware process wants to connect to another process on some other network then it needs to establish an SVC which can be asked from signaling software. The signaling software sends request of creating a SVC by the virtue of ATM adapter and the reserved signaling VC to the ATM switch. The TAM switches keep forwarding the request until it reaches to its destination. Considering the ATM address for the connection and the internal network database of the switch (routing table), ATM switch ahs the right to determine which switch to propagate the request next. It is determined by each switch whether the service category and Quality of Service needs of the request are achieved. If the request of virtual circuit is supported by all the switches, the end World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



station of the destination receives a packet that has VC number. From now onwards, the ATMaware process can directly interact with the destination by sending packets to the VPI/VCI that recognize the specified VC. Point-to-Multipoint Connection: Contrary to a LAN environment, ATM has no innate capability to broadcast or multicast packets. It is a connection oriented medium. To make it capable, the sending node can produce a VC to all destinations and send a copy of the data on each virtual circuit. It is highly ineffective. Point to multi point connections is an efficient way to achieve the target. It connects a single source endpoint called root node to multiple destination endpoints called leaves. On splitting into two or more branches, the ATM switches copy cells to multiple destinations. The process is unidirectional. The root can transmit to the leaves but leaves are unable to transmit to the roots or to each other on same connection. Leaf to node and node to leaf transmission needs a unique connection. The reason behind this limitation is AAL5’s simplicity and the incapability to interleave cells from multiple payloads on a single connection. Although complex but they play vital role in boosting networks, ATM is changing the way we communicate and are probably the future of our digital world. Client – Server model: The Client-Server paradigm is the most prevalent model for distributed computing protocols. It is the basis of all distributed computing paradigms at a higher level of abstraction. It is service-oriented, and employs a request-response protocol. A server process, running on a server host, provides access to a service. A client process, running on a client host, accesses the service via the server process. The interaction of the process proceeds according to a protocol.

service request a client process a server process World Institute Of Technology a service

Server host 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana.

...

Website : www.wit.net.in


Client host


Client-server applications and services: An application based on the client-server paradigm is a client-server application. On the Internet, many services are Client-server applications. These services are often known by the protocol that the application implements. Well known Internet services include HTTP, FTP, DNS, finger, gopher, etc. User applications may also be built using the client-server paradigm. Remote Procedure Calls and Group Communication: Remote Procedure Call (RPC) provides a different paradigm for accessing network services. Instead of accessing remote services by sending and receiving messages, a client invokes services by making a local procedure call. The local procedure hides the details of the network communication. When making a remote procedure call: 1. The calling environment is suspended, procedure parameters are transferred across the network to the environment where the procedure is to execute, and the procedure is executed there. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



2. When the procedure finishes and produces its results, its results are transferred back to the calling environment, where execution resumes as if returning from a regular procedure call. The main goal of RPC is to hide the existence of the network from a program. As a result, RPC doesn't quite fit into the OSI model: 1. The message-passing nature of network communication is hidden from the user. The user doesn't first open a connection, read and write data, and then close the connection. Indeed, a client often doesn not even know they are using the network! 2. RPC often omits many of the protocol layers to improve performance. Even a small performance improvement is important because a program may invoke RPCs often. For example, on (diskless) Sun workstations, every file access is made via an RPC. RPC is especially well suited for client-server (e.g., query-response) interaction in which the flow of control alternates between the caller and callee. Conceptually, the client and server do not both execute at the same time. Instead, the thread of execution jumps from the caller to the callee and then back again. RPC Steps The following steps take place during an RPC : 1. A client invokes a client stub procedure, passing parameters in the usual way. The client stub resides within the client's own address space. 2. The client stub marshalls the parameters into a message. Marshalling includes converting the representation of the parameters into a standard format, and copying each parameter into the message. 3. The client stub passes the message to the transport layer, which sends it to the remote server machine. 4. On the server, the transport layer passes the message to a server stub, which demarshalls the parameters and calls the desired server routine using the regular procedure call mechanism. 5. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



When the server procedure completes, it returns to the server stub (e.g., via a normal procedure call return), which marshalls the return values into a message. The server stub then hands the message to the transport layer. 6. The transport layer sends the result message back to the client transport layer, which hands the message back to the client stub. 7. The client stub demarshalls the return parameters and execution returns to the caller. RPC Issues Issues that must be addressed: Marshalling: Parameters must be marshalled into a standard representation. Parameters consist of simple types (e.g., integers) and compound types (e.g., C structures or Pascal records). Moreover, because each type has its own representation, the types of the various parameters must be known to the modules that actually do the conversion. For example, 4 bytes of characters would be uninterpreted, while a 4-byte integer may need to the order of its bytes reversed. Semantics: Call-by-reference not possible: the client and server don't share an address space. That is, addresses referenced by the server correspond to data residing in the client's address space. One approach is to simulate call-by-reference using copy-restore. In copy-restore, call-by-reference parameters are handled by sending a copy of the referenced data structure to the server, and on return replacing the client's copy with that modified by the server. However, copy-restore doesn't work in all cases. For instance, if the same argument is passed twice, two copies will be made, and references through one parameter only changes one of the copies. Binding: How does the client know who to call, and where the service resides? The most flexible solution is to use dynamic binding and find the server at run time when the RPC is first made. The first time the client stub is invoked, it contacts a name server to determine the transport address at which the server resides. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Transport protocol: What transport protocol should be used? Exception handling: How are errors handled? Binding We'll examine one solution to the above issues by considering the approach taken by Birrell and Nelson. Binding consists of two parts: Naming refers to what service the client wants to use. In B&N, remote procedures are named through interfaces. An interface uniquely identifies a particular service, describing the types and numbers of its arguments. It is similar in purpose to a type definition in programming languauges. For example, a ``phone'' service interface might specify a single string argument that returns a character string phone number. Locating refers to finding the transport address at which the server actually resides. Once we have the transport address of the service, we can send messages directly to the server. In B&N's system, a server having a service to offer exports an interface for it. Exporting an interface registers it with the system so that clients can use it. A client must import an (exported) interface before communication can begin. The export and import operations are analogous to those found in object-oriented systems. Interface names consists of two parts: 1. A unique type specifies the interface (service) provided. Type is a high-level specification, such as ``mail'' or ``file access''. 2. An instance specifies a particular server offering a type (e.g., ``file access on wpi''). World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Name Server B&N's RPC system was developed as part of a distributed system called Grapevine. Grapevine was developed at Xerox by the same research group the developed the Ethernet. Among other things, Grapevine provides a distributed, replicated database, implemented by servers residing at various locations around the internet. Clients can query, add new entries or modify existing entries in the database. The Grapevine database maps character string keys to entries called RNames. There are two types of entries: Individual: A single instance of a service. Each server registers the transport address at which its service can be accessed and every instance of an interface is registered as an individual entry. Individual entries map instances to their corresponding transport addresses. Group: The type of an interface, which consists of a list of individual RNames. Group entries contain RNames that point to servers providing the service having that group name. Group entries map a type (interface) to a set of individual entries providing that service. For example, if A and B both offered file access, the group entry ``file access'' would consists of two individual RNames, one for A and B's servers. Semantics of RPC Unlike normal procedure calls, many things can go wrong with RPC. Normally, a client will send a request, the server will execute the request and then return a response to the client. What are appropriate semantics for server or network failures? Possibilities: 1. Just hang forever waiting for the reply that will never come. This models regular procedure call. If a normal procedure goes into an infinite loop, the caller never finds out. Of course, few users will like such semantics. 2. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Time out and raise an exception or report failure to the client. Of course, finding an appropriate timer value is difficult. If the remote procedure takes a long time to execute, a timer might time-out too quickly. 3. Time out and retransmit the request. While the last possibility seems the most reasonable, it may lead to problems. Suppose that: 1. The client transmits a request, the server executes it, but then crashes before sending a response. If we don't get a response, is there any way of knowing whether the server acted on the request? 2. The server restarts, and the client retransmits the request. What happens? Now, the server will reject the retransmission because the supplied unique identifier no longer matches that in the server's export table. At this point, the client can decide to rebind to a new server and retry, or it can give up. 3. Suppose the client rebinds to the another server, retransmits the request, and gets a response. How many times will the request have been executed? At least once, and possibly twice. We have no way of knowing. Operations that can safely be executed twice are called idempotent. For example, fetching the current time and date, or retrieving a particular page of a file. Is deducting $10,000 from an account idempotent? No. One can only deduct the money once. Likewise, deleting a file is not idempotent. If the delete request is executed twice, the first attempt will be successful, while the second attempt produces a ``nonexistent file'' error. RPC Semantics While implementing RPC, B&N determined that the semantics of RPCs could be categorized in various ways: World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Exactly once: The most desirable kind of semantics, where every call is carried out exactly once, no more and no less. Unfortunately, such semantics cannot be achieved at low cost; if the client transmits a request, and the server crashes, the client has no way of knowing whether the server had received and processed the request before crashing. At most once: When control returns to the caller, the operation will have been executed no more than once. What happens if the server crashes? If the server crashes, the client will be notified of the error, but will have no way of knowing whether or not the operation was performed. At least once: The client just keeps retransmitting the request until it gets the desired response. On return to the caller, the operation will have be performed at least one time, but possibly multiple times. Transport Protocols for RPC Can we implement RPC on top of an existing transport protocol such as TCP? Yes. However, reliable stream protocols are designed for a different purpose: high throughput. The cost of setting up and terminating a connection is insignificant in comparison to the amount of data exchanged. Most of the elapsed time is spent sending data. With RPC, low latency is more important than high throughput. If applications are going to use RPC much like they use regular procedures (e.g., over and over again), performance is crucial. RPC can be characterized as a specific instance of transaction-oriented communication, where: • •

A transaction consists of a single request and a single response. A transaction is initiated when a client sends a request and terminated by the server's response.

How many TCP packets would be required for a single request-response transaction? A minimum of 5 packets: 3 for the initial handshake, plus 2 for the FIN and FIN ACK (assuming that we can piggy back data and a FIN on the third packet of the 3-way handshake). World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



A transaction-oriented transport protocol should efficiently handle the following cases: 1. Transactions in which both the request and response messages fit in a single packet. The response can serve as an acknowledgment, and the client handles the case of lost packets by retransmitting the original request. 2. Large multi-packet request and response messages, where the data does not necessarily fit in a single packet. For instance, some systems use RPC to fetch pages of a file from a file server. A single-packet request would specify the file name, the starting position of the data desired, and the number of bytes to be read. The response may consist of several pages (e.g. 8K bytes) of data. Group Communication Three types of communication: 1. unicast--point-to-point 2. broadcast--one-to-all 3. multicast--one-to-some (group) Multicast is the most general and can subsume the other two. How is it supported? • • •

multiple unicasts broadcast with each machine filtering hardware directly (Ethernet has 223 multicast addresses)

Design Issues World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



• • • •

Closed versus Open Groups--can a nonmember send to the group? Peer groups versus central coordinator (may have a hybrid where one member of a peer group temporarily takes over coordination) Group membership--joining and leaving a group. Central vs. distributed. Group addressing--distributed game (temporary addressing). Set of name servers (wellknown group address). Predicate addressing. A predicate is evaluated by the receiver on whether or not it should actually receive the message. Compare to my work of using the query to actually compute a multicast address.

• •

Send/Receive Primitives--RPC does not work so naturally. How to deal with multiple replies. May not know how many replies. Reliability Atomicity/atomic broadcast--reliability in that the message gets to all members of the group or none. Message Ordering--all nodes see messages in the same order.

• •

Overlapping groups--synchronization between groups Scalability--depend on a single LAN for example.

Systems: ISIS, research project from Cornell. Toolkit for building distributed applications. MBONE, Internet Multicast backbone. Middleware and Distributed Operating Systems: Middleware is a class of software technologies designed to help manage the complexity and heterogeneity inherent in distributed systems. It is defined as a layer of software above the operating system but below the application program that provides a common programming abstraction across a distributed system. In doing so, it provides a higher-level building block for programmers than Application Programming Interfaces (APIs) such as sockets that are provided by the operating system. This significantly reduces the burden on application programmers by relieving them of this kind of tedious and error-prone programming. Middleware is sometimes informally called “plumbing” because it connects parts of a distributed application with data pipes and then passes data between them. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Middleware frameworks are designed to mask some of the kinds of heterogeneity that programmers of distributed systems must deal with. They always mask heterogeneity of networks and hardware. Most middleware frameworks also mask heterogeneity of operating systems or programming languages, or both. A few such as CORBA also mask heterogeneity among vendor implementations of the same middleware standard. Finally, programming abstractions offered by middleware can provide transparency with respect to distribution in one or more of the following dimensions: location, concurrency, replication, failures, and mobility. The classical definition of an operating system is “the software that makes the hardware useable.” Similarly, middleware can be considered to be the software that makes a distributed system programmable. Just as a bare computer without an operating system could be programmed with great difficulty, programming a distributed system is in general much more difficult without middleware, especially when heterogeneous operation is required. Likewise, it is possible to program an application with an assembler language or even machine code, but most programmers find it far more productive to use high-level languages for this purpose, and the resulting code is of course also portable. Distributed Systems: A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. UNIT:2 Synchronization in Distributed System: • • •

•

•

Synchronization: coordination of actions between processes. Processes are usually asynchronous, (operate without regard to events in other processes) Sometimes need to cooperate/synchronize – For mutual exclusion – For event ordering (was message x from process P sent before or after message y from process Q?) Synchronization in centralized systems is primarily accomplished through shared memory – Event ordering is clear because all events are timed by the same clock Synchronization in distributed systems is harder – No shared memory – No common clock World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Clock synchronization: Clock synchronization deals with understanding the temporal ordering of events produced by concurrent processes. It is useful for synchronizing senders and receivers of messages, controlling joint activity, and the serializing concurrent access to shared objects. The goal is that multiple unrelated processes running on different machines should be in agreement with and be able to make consistent decisions about the ordering of events in a system. For these kinds of events, we introduce the concept of a logical clock, one where the clock need not have any bearing on the time of day but rather be able to create event sequence numbers that can be used for comparing sets of events, such as a messages, within a distributed system. Another aspect of clock synchronization deals with synchronizing time-o f-day clocks among groups of machines. In this case, we want to ensure that all machines can report the same time, regardless of how imprecise their clocks may be or what the network latencies are between the machines. A consistent view of time: The common-sense reaction to making time-based decisions is to rely upon a time-o f-day clock. Most computers have them and it would seem to be a simple matter to throw on a time-o f-day timestamp to any message or other event where we would need to mark its time and possibly compare it with the time of other events. This method is known as global time ordering. There are a couple of problems with this approach. The first is that that we have no assurance that clocks on different machines are synchronized. If machine A generates a message at 4:15:00 and machine B generates a message at 4:15:20, it’s quite possible that machine B’s message was generated prior to that of machine A if B’s clock was over 20 seconds too fast. Even if we synchronize periodically, it’s quite possible (even likely) that the clocks may run at different speeds and drift apart to report different times.The second problem is that two events on two different systems may actually occur at exactly the same time (to the precision of the clock, at least) and thus be tagged with identical timestamps. If we have algorithms that compare messages to pick one over another and rely on them coming up with the same answer on all systems, we have a problem as there will be no unique way to select one message over another consistently. Logical clocks: Let’s again consider cases that involve assigning sequence numbers (“time -stamps”) to events upon which all cooperating processes can agree. What matters in these cases is not the time of World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



day at which the event occurred but that all processes can agree on the order in which related events occur. Our interest is in getting event sequence numbers that make sense system-wide. These clocks are called logical clocks. If we can do this across all events in the system, we have something called total ordering: every event is assigned a unique timestamp (number), every such time -stamp is unique. However, we don’t always need total ordering. If processes do not interact then we don’t care when their events occur. If we only care about assigning time -stamps to related (causal) events then we have something known as partial ordering. Mutual Exclusion: • • •

Processes communicate only through messages – no shared memory or clock. Processes must expect unpredictable message delays. Processes coordinate access to shared resources (printer, file, etc.) that should only be used in a mutually exclusive manner. Example Use – A Variation of the Readers/Writers Problem:

• • • •

Consider a system where a file server is replicated at several sites in the network. – Users access most readily available version Multiple readers are okay For replica consistency, updates should be done at all sites at the same time One way: enforce mutual exclusion on a write broadcast: no writes anywhere until previous writes complete. – This may not be the best way to solve the problem Example: Overlapped Access to Shared File:

• • •

Airline reservation systems maintain records of available seats. Suppose two people buy the same seat, because each checks and finds the seat available, then each buys the seat. Overlapped accesses generate different results than serial accesses – race condition.

Election algorithm: For simplicity, we assume the following: • Processes each have a unique, positive identifier. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



– Processor ID – IP number • All processes know all other process identifiers. • The process with the highest valued identifier is duly elected coordinator. Many Distributed Systems require a single process to act as coordinator (for various reasons). – Time server in the Berkley’s algorithm – Coordinator in the two-phase commit protocol – Master process in distributed computations – Master database server • Coordinator may fail → the distributed group of processes must execute an election algorithm to determine a new coordinator process. Goal of Election Algorithms: The overriding goal of all election algorithms is to have all the processes in a group agree on a coordinator. Bully: “the biggest guy in town wins”. The “Bully” Election Algorithm (1) Assumes: • Reliable message delivery (but processes may crash) • The system is synchronous (timeouts to detect a process failure) • Each process knows which processes have higher identifiers and can communicate with them The “Bully” Election Algorithm (2) When any process, P, notices that the coordinator is no longer responding it initiates an election: World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



• P sends an ELECTION message to all processes with higher id numbers. • If no one responds, P wins the election and becomes coordinator. • If a higher process responds, it takes over. Process P’s job is done. The “Bully” Election Algorithm (3) • At any moment, a process can receive an ELECTION message from one of its lowernumbered colleagues. • The receiver sends an OK back to the sender and conducts its own election. • Eventually only the bully process remains. The bully announces victory to all processes in the distributed group. The “Bully” Election Algorithm (4) • When a process “notices” that the current coordinator is no longer responding (4 deduces that 7 is down), it sends out an ELECTION message to any higher numbered process. • If none responds, it (ie. 4) becomes the coordinator (sending out a COORDINATOR message to all other processes informing them of this change of coordinator). • If a higher numbered process responds to the ELECTION message with an OK message, the election is cancelled and the higher-up process starts its own election The “Bully” Election Algorithm (5) • 6 wins the election • When the original coordinator (ie. 7) comes back on-line, it simply sends out a COORDINATOR message, as it is the highest numbered process (and it knows it). • Simply put: the process with the highest numbered identifier bullies all others into submission Ring Algorithm: •

The ring algorithm assumes that the processes are arranged in a logical ring and each process is knows the order of the ring of processes. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



•

Processes are able to “skip” faulty systems: instead of sending to process j, send to j + 1.

•

Faulty systems are those that don’t respond in a fixed amount of time.

•

P thinks the coordinator has crashed; builds an ELECTION message which contains its own ID number.

•

Sends to first live successor

•

Each process adds its own number and forwards to next.

•

OK to have two elections at once.

•

When the message returns to p, it sees its own process ID in the list and knows that the circuit is complete.

•

P circulates a COORDINATOR message with the new high number.

•

Here, both 2 and 5 elect 6: [5,6,0,1,2,3,4] [2,3,4,5,6,0,1]




Atomic Transaction: Transaction: •

A sequence of operations that perform a single logical function

•

Examples

•

•

Withdrawing money from your account

•

Making an airline reservation

•

Making a credit-card purchase

•

Registering for a course

Usually used in context of databases

Atomic Transaction: •

•

A transaction that happens completely or not at all •

No partial results

•

Cash machine hands you cash and deducts amount from your account

Example:




•

•

Airline confirms your reservation and –

Reduces number of free seats

–

Charges your credit card

–

(Sometimes) increases number of meals loaded on flight

Fundamental principles – A C I D •

Atomicity – to outside world, transaction happens indivisibly

•

Consistency – transaction preserves system invariants

•

Isolated – transactions do not interfere with each other

•

Durable – once a transaction “commits,” the changes are permanent

Programming in a Transaction System: •

Begin_transaction •

•

End_transaction •

•

•

•

Mark the start of a transaction

Mark the end of a transaction and try to “commit”

Abort_transaction •

Terminate the transaction and restore old values

•

Read data from a file, table, etc., on behalf of the transaction

•

Write data to file, table, etc., on behalf of the transaction

Read

Write

•

As a matter of practice, separate transactions are handled in separate threads or processes

•

Isolated property means that two concurrent transactions are serialized World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



• •

•

I.e., they run in some indeterminate order with respect to each other

Nested Transactions •

One or more transactions inside another transaction

•

May individually commit, but may need to be undone

•

Planning a trip involving three flights

•

Reservation for each flight “commits” individually

•

Must be undone if entire trip cannot commit

Example

Tools for Implementing Atomic Transactions (single system): •

•

•

Stable storage •

i.e., write to disk “atomically”

•

i.e., record actions in a log before “committing” them

•

Log in stable storage

Log file

Locking protocols •

•

•

•

Serialize Read and Write operations of same data by separate transactions

Begin_transaction •

Place a begin entry in log

•

Write updated data to log

Write

Abort_transaction •

Place abort entry in log World Institute Of Technology

8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



•

•

End_transaction (i.e., commit) •

Place commit entry in log

•

Copy logged data to files

•

Place done entry in log

Crash recovery – search log •

If begin entry, look for matching entries

•

If done, do nothing (all files have been updated)

•

If abort, undo any permanent changes that transaction may have made

If commit but not done, copy updated blocks from log to files, then add done entry Distributed Atomic Transactions: •

Atomic transactions that span multiple sites and/or systems

•

Same semantics as atomic transactions on single system •

•

ACID

Failure modes •

Crash or other failure of one site or system

•

Network failure or partition

•

Byzantine failures

General Solution-Two-Phase Commit: •

One site is elected coordinator of the transaction T •

•

See Election algorithms (ppt, html)

Phase 1: When coordinator is ready to commit the transaction •

Place Prepare(T) state in log on stable storage World Institute Of Technology

8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



• •

Phase 2: Coordinator –

–

•

•

Send Vote_request(T) message to all other participants Wait for replies

If any participant replies Abort(T) •

Place Abort(T) state in log on stable storage

•

Send Global_Abort(T) message to all participants

•

Locally abort transaction T

If all participants reply Ready_to_commit(T) •

Place Commit(T) state in log on stable storage

•

Send Global_Commit(T) message to all participants

•

Proceed to commit transaction locally

Phase I: Participant gets Vote_request(T) from coordinator •

Place Abort(T) or Ready(T) state in local log

•

Reply with Abort(T) or Ready_to_commit(T) message to coordinator

•

If Abort(T) state, locally abort transaction

Phase II: Participant •

Wait for Global_Abort(T) or Global_Commit(T) message from coordinator

•

Place Abort(T) or Commit(T) state in local log

•

Abort or commit locally per message

Two-Phase Commit States:



CSE-402 402 E Distributed Operating System

coordinator

participant

Failure Recovery – Two-Phase Phase Commit: •

•

Failure modes (from coordinator’s point of view) –

Own crash

–

Wait state: No response from some participant to Vote_request message

Failure modes (from participant’s point of view) –

Own crash

–

Ready state: No message from coordinator to Global_Abort(T) Global_Abort or Global_Commit Global_Commit(T)

Lack of Response to Coordinator Vote_Request(T) message •

•

E.g., –

participant crash

–

Network failure

Timeout is considered equivalent to Abort –

Place Abort(T)) state in log on stable storage World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH NH-71 71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in

E-mail mail : [email protected]


–


Locally abort transaction T Coordinator Crash: •

Inspect Log

•

If Abort or Commit state

•

•

–

Resend corresponding message

–

Take corresponding local action

If Prepare state, either –

Resend Vote_request(T) to all other participants and wait for their responses; or

–

Unilaterally abort transaction •

I.e., put Abort(T) in own log on stable store

•


If nothing in log, abort transaction as above

No Response to Participant’s Ready_to_commit(T) message: •

Re-contact coordinator, ask what to do

•

If unable to contact coordinator, contact other participants, ask if they know

•

If any other participant is in Abort or Commit state •

•

Take equivalent action

Otherwise, wait for coordinator to restart! –

Participants are blocked, unable to go forward or back

–

Frozen in Ready state!

Participant Crash: World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



•

Inspect local log –

Commit state: •

–

Abort state: •

–

Undo/abort the transaction

No records about T: •

–

Redo/replay the transaction

Same as local_abort(T)

Ready State: •

Same as no response to Ready_to_commit(T) message

Three-Phase Commit: •

Minor variation

•

Widely quoted in literature

•

Rarely implemented •

Because indefinite blocking due to coordinator failures doesn’t happen very often in real life!



CSE-402 402 E Distributed Operating System

•

There is no state from which a transition can be made to either Commit or Abort

•

There is no state where it is not possible to make a final decision and from which transition can be made to Commit.

•

Coordinator sends Vote_Request (as before)

•

If all participants respond affirmatively,

•

•

•

Put Precommit state into log on stable storage

•

Send out Prepare_to_Commit message to all

After all participants acknowledge, •

Put Commit state in log

•

Send out Global_Commit

Coordinator blocked in Ready state •

•

Safe to abort transaction

Coordinator blocked in Precommit state •

Safe to issue Global_Commit World Institute Of Technology

8km milestone ,Sohna Palwal Road , NH NH-71 71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in

E-mail mail : [email protected]


• •

•

Any crashed or partitioned participants will commit when recovered

Participant blocked in Precommit state •

Contact others

•

Collectively decide to commit

Participant blocked in Ready state •

Contact others

•

If any in Abort, then abort transaction

•

If any in Precommit, the move to Precommit state

Deadlock in Distributed Systems: The same conditions for deadlocks in uniprocessors apply to distributed systems. Unfortunately, as in many other aspects of distributed systems, they are harder to detect, avoid, and prevent. Proposes four strategies for dealing with distributed deadlocks: 1. Ignorance: ignore the problem (this is the most common approach). 2. Detection: let deadlocks occur, detect them, and then deal with them. 3. Prevention: make deadlocks impossible. 4. Avoidance: choose resource allocation carefully so that deadlocks will not occur. The last of these, deadlock avoidance through resource allocation is difficult and requires the ability to predict precisely the resources that will be needed and the times that they will be needed. This is difficult and not practical in real systems. The first of these is trivially simple. We will focus on the middle two approaches. In a conventional system, the operating system is the component that is responsible for resource allocation and is the one to detect deadlocks. Deadlocks are resolved by killing a process. This, of course, could create unhappiness for the owner of the process. In distributed systems employing a transaction model things are a bit different. Transactions are designed to withstand being aborted and, as such, it is perfectly reasonable to abort one or more transactions to break a deadlock. Hopefully, the transaction can be restarted later without creating another deadlock. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Centralized deadlock detection Centralized deadlock detection attempts to imitate the nondistributed algorithm through a central coordinator. Each machine is responsible for maintaining a resource graph for its processes and resources. A central coordinator maintains the resource utilization graph for the entire system. This graph is the union of the individual graphs. If this coordinator detects a cycle, it kills off one process to break the deadlock. In the non-distributed case, all the information on resource usage lives on one system and the graph may be constructed on that system. In the distributed case, the individual subgraphs have to be propagated to a central coordinator. A message can be sent each time an arc is added or deleted. If optimization is needed, a list of added or deleted arcs can be sent periodically to reduce the overall number of messages sent. Here is an example Suppose machine A has a process P0, which holds the resource S and wants resource R, which is held by P1. The local graph on A is shown in Figure . Another machine, machine B, has a process P2, which is holding resource T and wants resource S. Its local graph is shown in Figure Both of these machines send their graphs to the central coordinator, which maintains the union. All is well. There are no cycles and hence no deadlock. Now two events occur. Process P1 releases resource R and asks machine B for resource T. Two messages are sent to the oordinator: message 1 (from machine A): “releasing R” message 2 (from machine B): “waiting for T” This should cause no problems (no deadlock). However, if message 2 arrives first, the coordinator would then construct the graph in Figure and detect a deadlock. Such a condition is known as false deadlock. A way to fix this is to use Lamport’s algorithm to impose global time ordering on all machines. Alternatively, if the coordinator suspects deadlock, it can send a reliable message to every machine asking whether it has any release messages. Each machine will then respond with either a release message or a negative acknowledgement to acknowledge receipt of the message.




Distributed deadlock detection An algorithm for detecting deadlocks in a distributed system was proposed by Chaudy, Misra, and Haas in 1983. It allows that processes to request multiple resources at once (this speeds up World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



the growing phase). Some processes may wait for resources (either local or remote). Crossmachine arcs make looking for cycles (detecting deadlock) hard. The algorithm works this way: When a process has to wait for a resource, a probe message is sent to the process holding the resource. The probe message contains three components: the process that blocked, the process that is sending the request, and the destination. Initially, the first two components will be the same. When a process receives the probe: if the process itself is waiting on a resource, it updates the sending and destination fields of the message and forwards it to the resource holder. If it is waiting on multiple resources, a message is sent to each process holding the resources. This process continues as long as processes are waiting for resources. If the originator gets a message and sees its own process number in the blocked field of the message, it knows that a cycle has been taken and deadlock exists. In this case, some process (transaction) will have to die. The sender may choose to commit suicide or a ring election algorithm may be used to determine an alternate victim (e.g., youngest process, oldest process, ...). Distributed deadlock prevention An alternative to detecting deadlocks is to design a system so that deadlock is impossible. One way of accomplishing this is to obtain a global timestamp for every transaction (so that no two transactions get the same timestamp). When one process is about to block waiting for a resource that another process is using, check which of the two processes has a younger timestamp and give priority to the older process. If a younger process is using the resource, then the older process (that wants the resource) waits. If an older process is holding the resource, the younger process (that wants the resource) kills itself. This forces the resource utilization graph to be directed from older to younger processes, making cycles impossible. This algorithm is known as the wait-die algorithm. An alternative method by which resource request cycles may be avoided is to have an old process preempt (kill) the younger process that holds a resource. If a younger process wants a resource that an older one is using, then it waits until the old process is done. In this case, the graph flows from young to old and cycles are again impossible. This variant is called the woundwait algorithm. UNIT-3 World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Processes and Processors in distributed systems: In most traditional OS, each process has an address space and a single thread of control. It is desirable to have multiple threads of control sharing one address space but running in quasiparallel. Introduction to threads Thread is a light weighted process. The analogy: thread is to process as process is to machine. •

Each thread runs strictly sequentially and has its own program counter and stack to keep track of where it is.

•

Threads share the CPU just as processes do: first one thread runs, then another does.

•

Threads can create child threads and can block waiting for system calls to complete.

•

All threads have exactly the same address space. They share code section, data section, and OS resources (open files & signals). They share the same global variables. One thread can read, write, or even completely wipe out another thread’s stack.

•

Threads can be in any one of several states: running, blocked, ready, or terminated.

•

There is no protection between threads:

•

(1) it is not necessary (2) it should not be necessary: a process is always owned by a single user, who has created multiple threads so that they can cooperate, not fight.

Advantages of using threads: 1. Useful for clients: if a client wants a file to be replicated on multiple servers, it can have one thread talk to each server. 2. Handle signals, such as interrupts from the keyboard. Instead of letting the signal interrupt the process, one thread is dedicated full time to waiting for signals. 3. Producer-consumer problems are easier to implement using threads because threads can share a common buffer. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



4. It is possible for threads in a single address space to run in parallel, on different CPUs. Design Issues for Threads Packages A set of primitives (e.g. library calls) available to the user relating to threads is called a thread package. Static thread: the choice of how many threads there will be is made when the program is written or when it is compiled. Each thread is allocated a fixed stack. This approach is simple, but inflexible. Dynamic thread: allow threads to be created and destroyed on-the-fly during execution. System Models: The workstation model: the system consists of workstations scattered throughout a building or campus and connected by a high-speed LAN. • • • • • • •

The systems in which workstations have local disks are called diskful workstations. Otherwise, diskless workstations. If the workstations are diskless, the file system must be implemented by one or more remote file servers. Diskless workstations are cheaper. Ease of installing new release of program on several servers than on hundreds of machines. Backup and hardware maintenance is also simpler. Diskless does not have fans and noises. Diskless provides symmetry and flexibility. You can use any machine and access your files because all the files are in the server. Advantage: low cost, easy hardware and software maintenance, symmetry and flexibility. Disadvantage: heavy network usage; file servers may become bottlenecks.

The disks in the diskful workstation are used in one of the four ways: 1.

Paging and temporary files (temporary files generated by the compiler passes).




Advantage: reduces network load over diskless case Disadvantage: higher cost due to large number of disks needed 2. Paging, temporary files, and system binaries (binary executable programs such as the compilers, text editors, and electronic mail handlers). Advantage: reduces network load even more Disadvantage: higher cost; additional complexity of updating the binaries 3.Paging, temporary files, system binaries, and file caching (download the file from the server and cache it in the local disk. Can make modifications and write back. Problem is cache coherence). Advantage: still lower network load; reduces load on file servers as well Disadvantage: higher cost; cache consistency problems 4.Complete local file system (low network traffic but sharing is difficult). Advantage: hardly any network load; eliminates need for file servers Disadvantage: loss of transparency The processor pool model A processor pool is a rack full of CPUs in the machine room, which can be dynamically allocated to users on demand. Why processor pool? Input rate v, process rate u. mean response time T=1/(u-v). If there are n processors, each with input rate v and process rate u. If we put them together, input rate will be nv and process rate will be nu. Mean response time will be T=1/(nu-nv)=(1/n)T. A hybrid model




A possible compromise is to provide each user with a personal workstation and to have a processor pool in addition. For the hybrid model, even if you can not get any processor from the processor pool, at least you have the workstation to do the work. Processor Allocation: determine which process is assigned to which processor. Also called load distribution. Two categories: Static load distribution-nonmigratory, once allocated, can not move, no matter how overloaded the machine is. Dynamic load distribution-migratory, can move even if the execution started. But algorithm is complex. The goals of allocation: Maximize CPU utilization Minimize mean response time/ Minimize response ratio Response ratio-the amount of time it takes to run a process on some machine, divided by how long it would take on some unloaded benchmark processor. E.g. a 1-sec job that takes 5 sec. The ratio is 5/1. Design issues for processor allocation algorithms: •

Deterministic versus heuristic algorithms

•

Centralized versus distributed algorithms

•

Optimal versus suboptimal algorithms

•

Local versus global algorithms

•

Sender-initiated versus receiver-initiated algorithms

How to measure a processor is overloaded or underloaded? World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



1 Count the processes in the machine? Not accurate because even the machine is idle there are some daemons running. 2 Count only the running or ready to run processes? Not accurate because some daemons just wake up and check to see if there is anything to run, after that, they go to sleep. That puts a small load on the system. 3 Check the fraction of time the CPU is busy using time interrupts. Not accurate because when CPU is busy it sometimes disable interrupts. Scheduling in Distributed System: Scheduling parallel applications in a distributed environment, such as a cluster of workstations, remains an important and unsolved problem. One of the main research issues is effectively exploiting idle resources and to timeshare the system fairly among the processes. Local scheduling, where each workstation independently schedules its processes, is an attractive time-sharing option for its ease of construction, scalability, fault-tolerance, etc. Meanwhile, coordinated scheduling of parallel jobs across the nodes of a multiprocessor (coscheduling) is also indispensable in a distributed system. Without coscheduling, the processes constituting a parallel job might suffer high communication latencies because of processor thrashing. By coordinated scheduling across cooperating processes, each local scheduler is able to make independent decisions that tend to schedule the processes of a parallel application in a coordinated manner across processors, in order to fully exploit the computing resource of a distributed system. Overview of the Scheduling Mechanism Before going into these new specified approaches, let us see how the distributed system runs a job across the whole system, and what role a scheduler plays here. In general, job scheduling is composed of at least two inter-dependent steps: the allocation of processes to workstations (space-sharing) and the scheduling of the processes over time (timesharing), while there exist several optional complementary steps to further improve the performance. When a job is submitted to the system, job placement will be done, i.e., to decide which workstations to run the job cooperatively (space-sharing). Along with the job submission, a description of the attributes of the job is also submitted to the system in order to specify the resource requirement, such as memory size requirement, expected CPU time, deadline time, etc. In the meantime, the system always maintains an information table, either distributed or centralized, to record the current resource status of each workstation, e.g., CPU load, free memory size, etc. Then, a matchmaking frame will do matching work to find the most suitable set of workstations to meet the requirement of the job. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



This job is then decomposed into small components, i.e. processes, which are distributed to those assigned workstations. On each individual workstation, the local scheduler allocates some timeslices to the process based on some policies so as to achieve the scheduling requirement, such as response time and fairness. These decomposed components may require synchronization among themselves. For example, process A requires input from process B to proceed; then, A blocks until the input from B arrives. Therefore, a coordinated scheduling is needed to minimize the time waiting for messages from other processes. Besides the mechanisms mentioned above, process migration is introduced to improve load sharing, which allows the process to run on the most suitable workstation. During the lifetime of a job, the resource status of the system is always changing, recommending that the process run on another more suitable workstation. For example, a workstation may become light-loaded when finishing its assigned job, so, processes on a heavy-loaded workstation can be migrated onto such a light-loaded workstation, which may let the job finish earlier and improve the overall performance of the system. Space-sharing approaches will achieve a less interactive response time but probably also smaller throughput; on the contrary, time-sharing approaches have a higher throughput but also lengthen the response time. Therefore, a good approach should be a mixed approach, utilizing both spacesharing and time-sharing, with the complementary coscheduling and process migration. In this paper, we will only discuss the local scheduling and coscheduling, i.e., how to get the best performance after the set of workstations is assigned to a job. Properties of a Good Scheduler Many research activities are being conducted to develop a good scheduling approach among a set of distributed hosts. The activities vary widely in a number of dimensions, e.g. support for heterogeneous resources, placement objective function(s), scalability, coscheduling methods, and assumptions about system configuration. Based on the experience accumulated during these activities, it is believed that a good scheduler should have the following properties: General purpose: a scheduling approach should make few assumptions about and have few restrictions to the types of applications that can be executed. Interactive jobs, distributed and parallel applications, as well as non-interactive batch jobs, should all be supported with good performance. This property is a straightforward one, but to some extent difficult to achieve. Because different kinds of jobs have different attributes, their requirements to the scheduler may contradict. For example, a real-time job, requiring short-time response, prefers space-sharing scheduling; a noninteractive batch job, requiring high-throughput, may prefer time-sharing scheduling. To achieve the general purpose, a tradeoff may have to be made. Efficiency: it has two meanings: one is that it should improve the performance of scheduled jobs as much as possible; the other is that the scheduling should incur reasonably low overhead so that it won’t counterattack the benefits.




Fairness: sharing resources among users raises new challenges in guaranteeing that each user obtains his/her fair share when demand is heavy. In a distributed system, this problem could be exacerbated such that one user consumes the entire system. Dynamic: the algorithms employed to decide where to process a task should respond to load changes, and exploit the full extent of the resources available. Transparency: the behavior and result of a task’s execution should not be affected by the host(s) on which it executes. In particular, there should be no difference between local and remote execution. No user effort should be required in deciding where to execute a task or in initiating remote execution; a user should not even be aware of remote processing, except maybe better performance. Further, the applications should not be changed greatly. It is undesirable to have to modify the application programs in order to execute them in the system. UNIT-4 Distributed file systems: •

A Distributed File System ( DFS ) is simply a classical model of a file system ( as discussed before ) distributed across multiple machines. The purpose is to promote sharing of dispersed files.

•

This is an area of active research interest today.

•

The resources on a particular machine are local to itself. Resources on other machines are remote.

•

A file system provides a service for clients. The server interface is the normal set of file operations: create, read, etc. on files.

Clients, servers, and storage are dispersed across machines. Configuration and implementation may vary a) Servers may run on dedicated machines, OR b) Servers and clients can be on the same machines. c) The OS itself can be distributed (with the file system a part of that distribution. d) A distribution layer can be interposed between a conventional OS and the file system.




Clients should view a DFS the same way they would a centralized FS; the distribution is hidden at a lower level. Performance is concerned with throughput and response time. DFS implementation Major issues common to most distributed file systems are: *User interface *Naming *Inter process communication *File access mechanism *File consistency *Synchronization *Locking *Atomic transactions *Error recovery User interface: The DFS presents users with two command interfaces: the shell interface and the data base interface. The shell interface: The shell interface is a Unix like command interface that allows manipulation of files (unrestricted streams of bytes) within the hierarchical logical file structure. In general terms a command is expressed as command argl [arg2,arg3 ...argn] Two modes of operations are supported in this interface: default mode and explicit mode. The default mode is the normal mode of operation wherein the system makes use of a version number




associated with each file to access the most up to date file image. This mode allows complete location transparency thus relieving the user of having to remember where files are located. To afford more flexibility over file manipulation, the explicit mode of operation allows users to override this version number mechanism and specify the site at which a file image is located in the command line. Explicit mode is specified by : command arg1@site1 [arg2@site2, arg3@site3 ...] The system can be queried at any time to find out where the files images are located. This mode is useful if the file replication mechanism is to be avoided. Files created in default mode are reated at more than one randomly selected site whereas in explicit mode the file is only created at the location specified by the user. The explicit mode can also be used to copy images from one location to another. This is useful in the event that file images become inconsistent due to one or more sites crashing while the file is being updated Under normal circumstances, the system attempts to make the images consistent during the next file access, but this can also be done manually using the explicit mode of operation. The data base interface: The data base interface is also a command interface much like the shell interface and allows access to special files called data base files. Data base files created in this interface are typed DFS files with all access taking place one record at a time. Commands to create, read, write, remove and rewind a data base file are supported. This interface has no concept of current working directories or different modes of operation. Default mode is the only mode of operation and all data base files are created in the root directory. In addition an atomic transaction mechanism has been implemented that allows a series of reads and writes to be considered as one atomic action. Transactions are bracketed by 'begin trans' and 'end trans' or 'abort' system calls. All writes to the database are made to a temporary file and written to the data base only in the event of a 'end trans' call and discarded otherwise. The transaction mechanism is enforced in this interface and it is considered an error if the 'begin trans' system call ('bt' command) does not precede each database session. The transaction has to be restarted if more operations on the data base are to follow. File name mapping: World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



An identifier or a name is a string of symbols, usually bits or characters, used to designate or refer to an object. These names are used for a wide variety of purposes like referencing, locating, allocating, error controlling, synchronizing and sharing of objects or resources and exist in different forms at all levels of any system [WATS 85]. If an identified object is to be manipulated or accessed, the identifier must be 'mapped' using an appropriate mapping function and context into another identifier and ultimately into the object. A name service provides the convenience of a runtime mapping from string names to data. An important use of such a service is to determine address information, for example mapping a host name to an IP address. Performing this kind of address look up at run time enables the name server to provide a level of indirection that is crucial to the efficient management of distributed systems. Without such a run time mapping, changes to the systemtopology would require programs with hardwired addresses to be recompiled, thus severely restricting the scope for systems to evolve. These name to data mappings may consist of a logically centralized service eg. ClearingHouse [OPEN 83] or a replicated data base e.g. Yellowpages [WALS 85]. The DFS follows the second approach for file name mapping. To attain location transparency, one must avoid having resource location to be part of the resource name. This is done by dynamically mapping global logical user defined file names to multiple physical file names. Mapping of logical to physical file names is maintained in a DFS regular replicated file called the name server. Since this file contains system information, it is typed and contains only fixed length records (one record for each file created within the system). Each record in the name server maintains information about a DFS file such as the file type (whether it is a file or a directory) , the number of images, the files logical name, the locations of the file images and the version numbers of the file images. Since all access to any file must go through the CSS, after opening a file, the file descriptors obtained at the US and the CSS uniquely specify the file's global low level name and is used for most of the low level communication about open files. MAXCOPY images of the name server are created at system boot time at potential CSS locations. Since access to the name server is only through the CSS, each potential CSS will know the locations of the name server images. Reads from the name server may take place from any image with the highest version number. However, since the name server is heavily used, writes are made to the local image when possible thus avoiding network traffic and speeding up access. An error during the write results in failure of the system call and no attempt is made to contact any other image. This enables the CSS to always store the most recently updated image and enables it to keep it consistent. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Other operations on the name server include deleting records. In addition to the name server, the CSS also maintains a cache of the most recently read record in main memory for fast access and in the event the name server becomes inaccessible after a file has been opened. This cache is in the form of a shared memory segment, since it will be referenced by future calls. On an open, once the name server has been read, and the record cached, no further services of the name server are required until the file is closed. Calls that refer to this cache are the ones associated with locking and transactions. On a close, the cache is deallocated, the version number incremented (if the file was opened for writing) and the record written back to the data base. Since the name server is a regular replicated file, it may fall out of consistency due to site failure. Consistency of the name server is maintained by a 'watch dog' program running in the background by checking the version vector associated with the name server at regular intervals. The first record in the name server data base stores version number information about the remaining name server images and this record is read when it is required to be made consistent. If the versions are found to be inconsistent, a copy of the local file image is sent out to the other lower version numbered image sites and consistency is restored. Inter process communication: Machines converse in the DFS by sending messages using an underlying protocol for reliable transmission. The DFS makes use of the Transport Level Interface (TLI) [AT & T 7]mechanism of the Transmission Control Protocol / Internet Protocol (TCP/IP) suite of protocols [POST 81a] and [POST 81b] implemented on the AT&T 3B1 machines, to allow processes on different machines to communicate with each other. The TLI layer provides the basic service of reliable end to end data transfer needed by applications. The kernel structure consists of three parts: the transport interface layer, the protocol layer and the device layer as shown in Figure 5.3. The




transport level interface

Transport interface model provides the system call interface between application programs and the lower layers. The protocol layer contains the protocol modules used for communications and the device layer contains the device drivers that control the network devices. All network messages in the system require an acknowledgement response message from the serving site. The response message, in addition to telling the requesting site that the call was successful, also returns additional information that is produced as a result of the call. Moreover extensive file transfer is carried out by the system in keeping the name server and user files consistent. The connection mode of communication, being circuit oriented, seemed particularly attractive for this kind of data stream interaction and would enable data to be transmitted over the World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



connection in a reliable and sequenced manner. Hence the connection mode service of the transport level interface was chosen for use in the DFS so that higher level system routines are assured of reliable communication. Every host that wishes to offer service to users (remote or local) has a process server (PS) through which all services must be requested [TANE 81]. Whenever the process server is idle, it listens on a well known address. Potential users of any service must begin by establishing a connection with the process server. Once the connection has been established, the user sends the PS a message telling it which program it wishes to run (either the CSS or the SS). The process server then chooses an idle address and spawns a new process, passing it the parameters of the call, terminates the connection and goes back to listening on its well known address. The new process then executes either the CSS or SS programs, executes the appropriate system call and sends its reply back to the calling process. Addresses in the Internet domain are composed of two parts, the host network address (identifying the machine) and a port number which is simply a sixteen bit integer that allows multiple simultaneous conversations to occur on the same machine to machine link. These two parts are commonly referred to as a socket and uniquely identifies an end point for communication. In the current prototype implementation, all process servers are assigned one unique port number which is hardcoded in the program. In an operational implementation, the requesting site or US would look up a (Unix) system data base usually found in /etc/services for the port number of the process server. All requesting sites can also read a (DFS) system file to find out the current location of the CSS. This string name is used to query the Unix system and the appropriate network address of the process server is thus obtained. The two parts of the Internet address are now known and this enables the requesting site to construct the complete address of the process server and thus establish a connection.




Communication sequence To obtain service, the requesting process packages up a request message on behalf of the user and sends it to the process server and then blocks, awaiting the outcome to be communicated in the opposite direction by means of a response message. This one to one response interchange is the fundamental access mechanism. The model is of the form shown in Figure. All send and receive messages are of fixed length with each message containing control information which specifies the type of message and related system call parameters. These message formats [BACH 86] are shown in Figure World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Message formats The request message consists of a token that specifies the logical site to which the message is destined (CSS or SS), a token that specifies the system call the process server should execute on behalf of the client, parameters of the system call and environmental data such as user id, current directory, user hosts address etc. The remainder of the messages contains space for 512 bytes of user data. The responding site waits for requests from the requesting site. When it receives a request, it decodes the message, determines what system call it should invoke, executes the system call and encodes results of the call into a response message. The response message contains the return values to be returned to the calling process as a result of the system call, an error code which tells the requesting site whether the call completed successfully and a signal number which passes the requesting site additional information on the call and a fixed length (512 byte) data array. The requesting site then decodes the response and returns the results to the user. File access: All files which are to be read or written, locked or unlocked need to be opened before any operations can proceed. Since there are three logical sites taking part during file access, state information on the call needs to be maintained at all three sites. In fact this is one of the inherent problems associated with the integrated model of distributed systems as error recovery becomes all the more difficult since this state information needs to be reconstructed in the event of site failure. The data structures at the US and CSS keeping track of open files are very similar to those used by Unix. In Unix, four levels of data structures are used to record information on open files. Each World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



application process refers to open files using a small non-negative integer called a file descriptor. The file descriptor is used as an index into the first data structure, the user file descriptor table. The user file descriptor table is used to map process file descriptors into entries in a global file table which is the second data structure level. The global file table holds the open mode (read or write) and a pointer to a table of inodes which is the focus of all activity relating to the file. Multiple file table entries could refer to the same inode. The inode stores all file attributes including page pointers, file protection modes, directory link counts etc. The final data structure is the buffer cache which stores modified pages. The file system data structures in Unix are shown in Figure. The user file descriptor and global file tables are similarly implemented in the DFS. At the US, the file table points to a table of 'unodes' while at the CSS, the file table points to a table of 'cnodes'. Both these tables store information that is relevant to the logical site. The file descriptors obtained at the US and the CSS uniquely specify the file and is used as the low level name of the open file for most internal references to that file. To read a file, users can issue the 'open' command at the shell interface level. The shell packages up the request and sends the request to the US. The US allocates a slot in its file descriptor table, its file table and its unode table, thus obtaining a local file descriptor (Ifd). A message is then sent to the CSS and the US blocks until it receives a response from the CSS. The CSS location is obtained by examining a system file (csslist) which contains a list of potential CSS locations. The first entry is the current CSS. If unode information is already available at the US (if the file was recently opened), the last modification time of the file is included in its message to the CSS. If the file has not been modified since it was last accessed, the cached copy of the file is used for subsequent reads.




Data structures in Unix Just as at the US, the CSS first allocates slots in its file descriptor table, file table and cnode tables, thus obtaining another file descriptor for the file (cfd). It then reads the name server and obtains the locations of the file images. The version vector associated with the file is examined and if any inconsistency is detected, an attempt is made to make the images consistent. A message is then sent to each image site requesting the SS to open the file, including the pair in its message. The SS maps the logical file name to the pair, checks to see whether the file is locked in a conflicting mode by any other process and replies appropriately to the CSS. On receipt of replies from all the image sites, the CSS then completes the information in its cnode table and returns a list of successfully opened image sites to the US. The US in turn completes the information at its unode and returns the local file descriptor to the user. This file descriptor is used in all future references to the file. The open protocol is shown in




After the file is opened, the user can issue read and write commands. These are directed to any one of the opened images. The read protocol is shown in figure 5.8. Reads and writes refer to the site vector to obtain site locations. In the event of a site crashing while the the US is reading or writing, it simply locates an alternate SS (provided one exists) and restarts the operation. On a write, the SS that was successfully written to is marked as the primary site. All subsequent reads are made only from the primary site. After one of the sites has been designated as the primary, the system call fails if the primary fails.

Once the file is done issuing reads and writes, the 'close' command is sent to the primary site. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



The close call deallocates data structures at the SS and the request is then sent to the CSS. The CSS deallocates data structures and sends the request to the remaining file images. The 'create' command works very similarly to the open call except for the fact that newly created files are made temporary by attaching a ".tmp" extension to its physical name. These temporary files are produced by processes creating files and then crashing and can be cleaned up by a garbage collection routine as part of the system administrators duties. All writes to a newly created file are committed on the close call.

File consistency: World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Since there are several images of the same file, it is important that these images remain consistent across site failure. Inconsistencies in the images arise when one or more image sites crash after the file has been opened for update or during update. It is up to the system to keep these images consistent. For the purposes of the consistency algorithm, all files may be in one of two states : If the CSS accessing a particular (possibly replicated) file set is in contact with at least a majority of the file images, then the files are said to be in the STABLE state. When the CSS accessing a file set is in contact with less than a majority of the files then the files it cannot contact are said to be in the RECLUSE state. File image consistency in the DFS is maintained by a simple version number mechanism consisting of an tuple. The original version number keeps track of partitioned updates. The original version number is initialized to zero at file creation time and indicates that either no updating has taken place while the file was in the RECLUSE state or any partitioned updates have either been reconciled or resolved with a majority of the file copies. In the event of a partition, if the original number is zero, it is set equal to the current version number. All files in the DFS have the current version number initialized to unity at file creation time and this number is incremented every time the file is updated. Thus each file image has its own version vector with the name server keeping track of the version vector of each file. On each file access (open), the name server is first read and the version vector of the file is obtained. The system then opens the images with the highest version number, thus presenting the user with the most up to date version of the file. The consistency algorithm is illustrated by an example: Assume that a file is triplicated so that any two copies form a majority. Denote these copies by A, B and C. The version vector associated with these three images is A < 0,2>, B and C after one update. Now assume C is detached and goes into the RECLUSE state. Two cases arise: i. If A is updated, then this results in a version vector of A, B and C . This presents no conflict and C can be made consistent when it comes back on line using the file images at AorB. ii. If both A and C are updated independent of each other it results in a version vector of A,B andC . Since the original version number of C is greater than zero but less than the current version number of A and B, both partitions have been updated and a conflict arises. Conflicts can then be resolved by the user. The original version number is unused in the current version of the implementation since it is assumed that the network does not partition into independently operating subnetworks. Resolution of inconsistency: An attempt is made to resolve any inconsistency in user's files at the time of access. When the file is opened for reading or writing, the current version number associated with the file is checked. If any inconsistencies exist, a copy of the image with the highest version number is sent World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



to the remaining images. If any of the images are un-reachable, the system will attempt to make the files consistent on the next access. The same mechanism is used to keep directory files consistent since they are treated in the same way as regular files. Directory files are made consistent during the 'Is' command. The name server is also a replicated DFS file, and also has to be kept consistent. Since it is a system file, and since a consistent name server is crucial to the correct working of the system in the event of CSS failure, it is made consistent in background independent of user access. The 'watch dog' process checks the version vector associated with the name server at regular intervals and sends a copy of the highest version numbered image to the other sites. Synchronization: Synchronization in the face of multi-user operations is handled in the DFS at two levels: the system level and the application level. System level synchronization: Each US executes only one command at a time and hence the issue of synchronization does not arise. There could however be more than one user attempting to get service from the US potentially at the same time. The shell and data base interfaces communicate with the US through a named pipe (FIFO) and the semantics of the Unix pipe mechanism takes care of synchronization here by queuing simultaneous requests at the writing end of the pipe. The shell / dbms attempts to open the pipe for writing, a finite number of times in non blocking mode. If the open fails either because the US is busy servicing other client requests or if the pipe is full, the call returns with an error. Thus they never block indefinitely waiting for service. Once a write has been made to the pipe however, the calling process blocks on a read waiting for the US to return. The shell / dbms includes its return address in its message to the US so that the US knows to whom it must send its reply. An alarm signal (currently set at 120 sees) prevents the calling process from waiting indefinitely in the event of SS or CSS crashes. However, since more than one user could potentially be accessing the CSS at the same time, the process server forks off a CSS process for each request it receives and goes back to listening on its well known address. Since some of the system calls need to refer to state information produced by a previous call, all data structures take the form of shared memory segments. These segments are protected by semaphores. The semaphore protecting a segment is required to be obtained before access to the data structure can take place. Processes block on a queue until the semaphore can be obtained, again relying on an alarm signal to prevent any deadlock or indefinite waiting caused by a process acquiring a semaphore and then crashing. Once a slot has been obtained in the data structure, it is marked as 'used' until the 'close' call is issued and the data structure deallocated. Subsequent accesses to the data structure by different processes thus do not disturb the contents of the slot while it is being used. At the SS, state information is substantially less than at the CSS World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



since it only keeps track of objects that are local to it and hence data is simply written to a file. Here also, since there could be multiple SS processes running concurrently, the data structures are protected by semaphores similar to that at the CSS. Application level synchronization: A synchronization mechanism is also required by user level processes since more than one user could be accessing a file or data base at the same time. The DFS provides a distributed implementation of file locking with file level granularity to allow concurrent access to files. In standard Unix, multiple processes are allowed to have the same file open concurrently. These processes can issue read and write system calls with the system only guaranteeing that each successive operation sees the effects of the ones that precede it. In other words a read on a file could see intermediate data if another process is writing to the same file at the same time. This lack of synchronization is implemented in the DFS as the default locking policy and is called Unix mode. Users can issue 'shared' or 'exclusive' lock commands on files after opening them. The semantics of the locking mechanism allow multiple readers or a single writer. Although locking may be seen to be advisory (ie. checked only during the lock system call) it is in reality checked on all reads and writes to prevent processes from accessing a file in the event that some other process has set an exclusive lock on it or a previously set lock has timed out. While locking at the shell interface is left to the user, it is mandatory at the dbms interface level with the system automatically locking files when they are opened, transparently to the user. In the data base environment, the atomic transaction mechanism built on top of the concurrency control mechanism (locking) provides failure atomicity to data base files. Locking: The DFS provides a distributed implementation of file locking with file level granularity to allow concurrent access to files [LEVI 87]. Locking resolves contention among multiple processes wishing to access the same file at the same time. The mechanism that provides concurrency control in the DFS is referred to as the distributed lock manager. The lock manager offers the means by which a process may obtain access to and control over an object so as to block other processes that wish to use the object in an incompatible way. An instance of the lock manager exists on every network node and each lock manager serves to control concurrency for the objects local to it. Each lock manager keeps its own lock data base that records all locks on local objects, with the CSS keeping track of locks on replicated object images. The lock manager facility supports two locking modes: SHARED locks (multiple readers) and EXCLUSIVE locks (single writer). Other operations include UNLOCK and operations necessary to inquire about a particular lock on an object currently locked at a particular node. Special commands at the shell interface allow users to set shared or exclusve locks on files. In the data base interface, locking is automatic and is the platform on which the transaction mechanism is built. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



The lock synchronization rules [WALK 83B] are shown in figure 5.10. The lock manager enforces compatible use of an object by refusing to grant conflicting lock requests. Locks are either granted immediately or refused. Processes never wait for locks to become available, so there is no possibility of deadlock (though indefinite postponement is possible). While this may be suited to human time span locking uses such as file editing, it may not be ideally suited for data base types of transactions. Process A Unix mode Shared lock Exclusive lock Unix mode read/write read no access Process B Shared lock read read no access Exclusive lock no access no access no access Lock synchronization rules To acquire a lock on a previously opened file, the US sends a lock request to the primary SS. The SS consults its lock manager as to whether the lock is to be granted. If the file is already locked in a conflicting mode by another process, it denies the lock request and replies to the US. Otherwise it sets the lock on the local object and sends the lock request to the CSS. The CSS obtains the locations of the remaining images by reading the cached copy of the name server record associated with the file. It then sends the request to the remaining sites and awaits the response from them. If all sites agree to grant the lock request, the CSS grants the lock request and replies appropriately to the SS. If however any on the sites refuse the request, the lock request is denied and a response is sent back to theSS. The SS then unlocks the file it had previously locked and replies to the US. Figure shows the locking protocol structure.




Each SS that stores a copy of the file keeps track of open/ locked files by means of a 'lock list' table. Each record in the table holds the file pointer which keeps track of the number of bytes read or written, the physical file name and an array of 'lock nodes'. An array of lock nodes is needed because multiple processes may have the same file locked at the same time. Each process that wishes to lock the file has its 'uid' added to the lock node, provided there is no conflict in the lock mode. When the process unlocks the file, its uid is removed from the lock node. An existing lock can be upgraded from a shared lock to an exclusive lock provided the users uid matches that existing for the shared lock and provided no other processes have shared locks on the file. A file could alse be permanently locked by a process that locks the file and then crashes. To World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



prevent this from happening, all locks are time stamped and if a timer expires after a given interval, the file is automatically unlocked and the next reference to the file returns an error message. Atomic transactions: The DFS is not by itself a data base system; it provides no built in access mechanisms for files other than random access to sequences of bytes. Since the DFS is intended to support applications dealing with data bases, it provides mechanisms that they can use. The atomic property for file actions is the most important of these mechanisms. A sequence of reads and writes on some set of files (not necessarily residing at the same site) can be performed as an indivisible atomic operation. Such an action is called an atomic transaction. Atomic transactions are useful in distributed systems as a means of providing operations on a collection of computers even in the face of adverse circumstances such as concurrent access to the data involved in the actions, and crashes of some computers involved due to hardware failures [LAMP 85]. Consider the problem of crash recovery in a data base storage system, which is constructed from a number of independent computers. The portion of the system that is running on some individual computer may crash and then be restarted by some crash recovery procedure. This may result in the loss of some information present just before the crash. The loss of this information may in turn lead to an inconsistent state for the information permanently stored in the system. Atomic transactions are meant for just these kinds of problems and enable the stored data to remain consistent. A transaction is a sequence of reads and write commands sent by a client to the file system. The write commands may depend on the result of previous read commands in the same transaction. The system guarantees that after recovery from a system crash, for each transaction, either all of the write commands will have been executed or none will have been. In addition, transactions appear indivisible with respect to other transactions which may be executing concurrently ie. there exists some serial order of execution which would give the same results. This defines the atomic property of transactions. A transaction can be made atomic by performing it in two phases; first record the information necessary to do the writes in a set of intentions without changing the data stored by the system and second, when the transaction is committed, do the actual writes changing the stored data. If a crash occurs after the transaction commits but before all the changes to the stored data are done, the second phase is restarted. The restart is done as often as necessary to ensure that all changes have been made. The atomic transaction mechanism centers around the concept of an atomic restartable action. An atomic action is characterized by : procedure A = begin save state; R; World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



~~ reset; end; If R is a restartable action then A is an atomic action as can be seen by the following case analysis. If A crashes before save state, nothing has been done. If A crashes after reset, R has been done and will not be redone because the saved state has been reset. If A crashes between the save_state and reset, A will resume after save state and restart R thus ensuring that A is completed successfully. Error recovery: The philosophy behind the error recovery protocols is to take action only when needed. In other words no state information abut crashed sites is maintained, so service requesting sites find out that a site has crashed only after explicitly trying to send it a message and receiving no response. While this may not be the best approach for error recovery in distributed systems, it is certainly one of the simplest. This results in some loss in transparency but it considerably eases the design of the protocols. Error recovery depends on the logical site that has crashed and the actions that were under way at the time the failure occurred. At the US, during the course of an open, if it is discovered that the CSS has crashed, it runs a CSS recovery procedure which elects another CSS and continues with the call. If however the file has already been opened and the CSS subsequently crashes, calls that depend on a previously opened file will fail. This is because as part of the CSS recovery procedure, all files opened with the crashed CSS are closed automatically. If a call discovers that an SS has crashed, an alternate site is obtained from the files site vector and the call continues transparently to the user. At the SS, distinction has to be made whether the crashed site is the US or the CSS. In either case all open files that belong to the crashed site are unlocked and closed. No CSS recovery procedure is run at the SS since it is assumed to be a passive repository for user files and takes no action on objects other than those existing local to it. At the CSS, if the crashed site is the SS, an alternate site is located ( if one is present) and the call continues. In the case of an open the system checks the sites where the files images exist. In the case of a create, the CSS reads a system file to obtain a list of potential SS's. It then polls these sites and creates the file at sites that are up. If the crashed site is the US, all open files that belong to the US are unlocked and closed. This entails sending messages to the SS's which hold the files images. All opens at the CSS are timestamped. If too much time elapses after the open, before any other operation, the file is automatically closed. This is necessary since sites could conceivably open a file and then crash, leaving the file permanently open. Closing of open files needs to be done at regular intervals by the system administrator. All locks at the SS are similarly time stamped. Provided they are not retained locks, they will be unlocked after the timer times out. Retained World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



locks result from a transaction locking a file and then being unable to commit the transaction. The retained flag is only reset after the transaction has been committed. The complete error recovery scenario is summarized in Table. One problem faced is the problem of detecting workstation crashes since the transport provider has no way of detecting this case. While the transport provider can detect the case of the process server being down, processes rely on an alarm signal to prevent them from waiting indefinitely if a machine crashes. When the alarm goes off at any logical site, it is assumed that the responding site is down (though it may just be a very slow server) and an error recovery procedure is run. An optimum value for the alarm is chosen depending on average network response characteristics.

Distributed Shared Memory: Distributed Shared Memory (DSM), in Computer Architecture is a form of memory architecture where the (physically separate) memories can be addressed as one (logically shared) address space. Here, the term shared does not mean that there is a single centralized memory but shared essentially means that the address space is shared (same physical address on two processors refers to the same location in memory).[1] Alternatively in computer science it is known as (DGAS), a concept that refers to a wide class of software and hardware World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



implementations, in which each node of a cluster has access to shared memory in addition to each node's non-shared private memory. Software DSM systems can be implemented in an operating system, or as a programming library. Software DSM systems implemented in the operating system can be thought of as extensions of the underlying virtual memory architecture. Such systems are transparent to the developer; which means that the underlying distributed memory is completely hidden from the users. In contrast, Software DSM systems implemented at the library or language level are not transparent and developers usually have to program differently. However, these systems offer a more portable approach to DSM system implementation. Software DSM systems also have the flexibility to organize the shared memory region in different ways. The page based approach organizes shared memory into pages of fixed size. In contrast, the object based approach organizes the shared memory region as an abstract space for storing shareable objects of variable sizes. Another commonly seen implementation uses a tuple space, in which the unit of sharing is atuple. Shared memory architecture may involve separating memory into shared parts distributed amongst nodes and main memory; or distributing all memory between nodes. A coherence protocol, chosen in accordance with a consistency model, maintains memory coherence. n computing, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Depending on context, programs may run on a single processor or on multiple separate processors. Using memory for communication inside a single program, for example among its multiple threads, is generally not referred to as shared memory. consistency model: In computer science, consistency models are used in distributed systems like distributed shared memory systems or distributed data stores (such as a filesystems, databases, optimistic replication systems or Web caching). The system supports a given model, if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



High level languages, such as C, C++, and Java, partially maintain the contract by translating memory operations into low-level operations in a way that preserves memory semantics. To hold to the contract, compilers may reorder some memory instructions, and library calls such as pthread_mutex_lock() encapsulate required synchronization. Verifying sequential consistency is undecidable in general, even for finite-state cache-coherence protocols. Consistency models define rules for the apparent order and visibility of updates, and it is a continuum with tradeoffs. Page based distributed shared memory: Basic Design The idea behind DSM is simple: Try to emulate the cache of a multiprocessor using the MMU and operating system software In a DSM system, the address space is divided up into chunks, with the chunks being spread over all the processors in the system. When a processor references an address that is not local, a trap occurs, and the DSM software fetches the chunk containing the address and restarts the faulting instruction, which now completes successfully. Replication One improvement to the basic system that can improve performance considerably is to replicate chunks that are read only, read-only constants, or other read-only data structures. Another possibility is to replicate not only read-only chunks, but all chunks. As long as reads are being done, there is effectively no difference between replicating a read-only chunk and replicating a read-write chunk. However, if a replicated chunk is suddenly modified, inconsistent copies are in existence. The inconsistency is prevented by using some consistency protocols . Finding the Owner The simplest solution for finding the owner is by doing a broadcast, asking for the owner of the specified page to respond. An optimization is not just to ask who the owner is, but also to tell whether the sender wants to read or write and say whether it needs a copy of the page. The owner can then send a single message transferring ownership and the page is well, if needed. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Broadcasting has the disadvantage of interrupting each processor, forcing it to inspect the request packet. For all the processors except the owner's, handling the interrupt is essentially wasted time. Broadcasting can use up considerable bandwidth, depending on the hardware. There could be several other possibilities as well. In one of these, a process is designated as the page manager. It is the job of the manager to keep track of who owns each page. When a process, P, wants to read a page it does not have or wants to write a page it does not own, it sends a message to the page manager telling which operation it wants to perform and on which page. The manager then sends back a message telling the ownership, as required. A problem with this protocol is the potentially heavy load on the page manager, handling all the incoming requests. This problem can be reduced by having multiple page managers instead of just one. Another possible algorithm is having each process keep track of the probable owner of each page. Requests for ownership are sent to the probable owner, which forwards them if ownership has changed. If ownership has changed several times, the request message will also have to be forwarded several times. At the start of execution and every n times ownership changes, the location of the new owner should be broadcast, to allow all processors to update their tables of probable owners. Finding the Copies Another important detail is how all the copies are found when they must be invalidated. Again, two possibilities present themselves. The first is to broadcast a message giving the page number and ask all processors holding the page to invalidate it. This approach works only if broadcast messages are totally reliable and can never be lost. The second possibility is to have the owner or page manager maintain a list or copyset telling which processors hold which pages. When a page must be invalidated, the old owner, new owner, or page manager sends a message to each processor holding the page and waits for an acknowledgment. When each message has been acknowledged, the invalidation is complete. Page Replacement In a DSM system, as in any system using virtual memory, it can happen that a page is needed but that there is no free page frame in memory to hold it. When this situation occurs, a page must be evicted from memory to make room for the needed page. Two subproblems immediately arise: which page to evict and where to put it. To a large extent, the choice of which page to evict can be made using traditional virtual memory algorithms, such as some approximation to the least recently used algorithm. As with World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



conventional algorithms, it is worth keeping track of which pages are 'clean' and which are 'dirty'. In the context of DSM, a replicated page that another process owns is always a prime candidate to evict because it is known that another copy exists. Consequently, the page does not have to be saved anywhere. If a directory scheme is being used to keep track of copies, the owner or page manager must be informed of this decision, however. If pages are located by broadcasting, the page can just be discarded. The second best choice is a replicated page that the evicting process owns. It is sufficient to pass ownership to one of the other copies but informing that process, the page manager, or both, depending on the implementation. The page itself need not be transferred, which results in a smaller message. Shared Variable Distributed Shared Memory Page-based DSM takes a normal linear address space and allows the pages to migrate dynamically over the network on demand. A more structured approach is to share only certain variables and data structures that are needed by more than one process. In this way, the problem changes from how to do paging over the network to how to maintain a potentially replicated, distributed data base consisting of the shared variables. Different techniques are applicable here, and these often lead to major performance improvements. Using shared variables that are individually managed also provides considerable opportunity to eliminate false sharing. If it is possible to update one variable without affecting other variables, then the physical layout of the variables on the pages is less important. The most important example of such system is Munin. UNIT-5 MACH CASE STUDY Introduction The Mach project [Acettaet al. 1986, Loepere 1991, Boykinet al. 1993] was based at CarnegieMellon University in the USA until 1994. Its development into a real-time kernel continued there [Leeet al. 1996], and groups at the University of Utah and the Open Software Foundation continued its development. The Mach project was successor to two other projects, RIG [Rashid 1986] and Accent [Rashid and Robertson 1981, Rashid 1985, Fitzgerald and Rashid 1986]. RIG was developed at the University of Rochester in the 1970s, and Accent was developed at Carnegie-Mellon during the first half of the 1980s. In contrast to its RIG and Accent predecessors, the Mach project never set out to develop a complete distributed operating system. Instead, the Mach kernel was developed to provide direct compatibility with BSD UNIX. It was World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



designed to provide advanced kernel facilities that would complement those of UNIX and allow a UNIX implementation to be spread across a network of multiprocessor and single-processor computers. From the beginning, the designers’ intention was for much of UNIX to be implemented as user level processes. Despite these intentions, Mach version 2.5, the first of the two major releases, included all the UNIX compatibility code inside the kernel itself. It ran on SUN-3s, the IBM RT PC, multiprocessor and uniprocessor VAX systems, and the Encore Multimax and Sequent multiprocessors, among other computers. From 1989, Mach 2.5 was incorporated as the base technology for OSF/1, the Open Software Foundation’s rival to System V Release 4 as the industry-standard version of UNIX. An older version of Mach was used as a basis for the operating system for the NeXT workstation. The UNIX code was removed from the version 3.0 Mach kernel, however, and it is this version that we describe. Most recently, Mach 3.0 is the basis of the implementation of MkLinux, a variant of the Linux operating system running on Power Macintosh computers [Morin 1997]. The version 3.0 Mach kernel also runs on Intel x86-based PCs. It ran on the DECstation 3100 and 5000 series computers, some Motorola 88000-based computers and SUN SPARCStations; ports were undertaken for IBM’s RS6000, Hewlett-Packard’s Precision Architecture and Digital Equipment Corporation’s Alpha. Version 3.0 Mach is a basis for building user-level emulations of operating systems, database systems, language run-time systems and other items of system software that we call subsystems. The emulation of conventional operating systems makes it possible to run existing binaries developed for them. In addition, new applications for these conventional operating systems can be developed. At the same time, middleware and applications that take advantage of the benefits of distribution can be developed; and the implementations of the conventional operating systems can also be distributed. Two important issues arise for operating system emulations. First, distributed emulations cannot be entirely accurate, because of the new failure modes that arise with distribution. Second, the question is still open of whether acceptable performance levels can be achieved for widespread use.




Design goals and chief design features The main Mach design goals and features are as follows: Multiprocessor operation: Mach was designed to execute on a shared memory multiprocessor so that both kernel threads and user-mode threads could be executed by any processor. Mach provides a multi-threaded model of user processes, with execution environments called tasks. Threads are pre-emptively scheduled, whether they belong to the same tasks or to different tasks, to allow for parallel execution on a shared-memory multiprocessor. Transparent extension to network operation: In order to allow for distributed programs that extend transparently between uniprocessors and multiprocessors across a network, Mach has adopted a location-independent communication model involving ports as destinations. The Mach kernel, however, is designed to be 100% unaware of networks. The Mach design relies totally on user-level network server processes to ferry messages transparently across the network . This is a controversial design decision, given the costs of context switching. However, it allows for absolute flexibility in the control of network communication policy. User-level servers : Mach implements an object-based model in which resources are managed either by the kernel or by dynamically loaded servers. Originally, only user-level servers were allowed but later Mach was adapted to accommodate servers within the kernel’s address space. As we have mentioned, a primary aim was for most UNIX facilities to be implemented at user level, while providing binary compatibility with existing UNIX. With the exception of some kernel-managed resources, resources are accessed uniformly by message passing, however they are managed. To every resource, there corresponds a port managed by a server. The Mach Interface Generator World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



(MiG) was developed to generate RPC stubs used to hide message-based accesses at the language level [Draves et al . 1989]. Operating system emulation: To support the binary-level emulation of UNIX and other operating systems, Mach allows for the transparent redirection of operating system calls to emulation library calls and thence to userlevel operating system servers – a technique known as trampolining. It also includes a facility that allows exceptions such as address space violations arising in application tasks to be handled by servers Flexible virtual memory implementation: Much effort was put into providing virtual memory enhancements that would equip Mach for UNIX emulation and for supporting other subsystems. This included taking a flexible approach to the layout of a process’s address space. Mach supports a large, sparse process address space, potentially containing many regions. Both messages and open files, for example, can appear as virtual memory regions. Regions can be private to a task, shared between tasks or copied from regions in other tasks. The design includes the use of memory mapping techniques, notably copy-on-write, to avoid copying data when, for example, messages are passed between tasks. Finally, Mach was designed to allow servers, rather than the kernel itself, to implement backing storage for virtual memory pages. Regions can be mapped to data managed by servers called external pagers. Mapped data can reside in any generalized abstraction of a memory resource such as distributed shared memory, as well as in files. Portability: Mach was designed to be portable to a variety of hardware platforms. For this reason, machinedependent code was isolated as far as possible. In particular, the virtual memory code was divided between machine-independent and machine-dependent parts [Rashid et al . 1988]. Overview of the main Mach abstractions We can summarize the abstractions provided by the Mach kernel as follows: Tasks: A Mach task is an execution environment. This consists primarily of a protected address space, and a collection of kernel-managed capabilities used for accessing ports. Threads: Tasks can contain multiple threads. The threads belonging to a single task can execute in parallel at different processors in a shared-memory multiprocessor. Ports: A port in Mach is a unicast, unidirectional communication channel with an associated message queue. Ports are not accessed directly by the Mach programmer and are not part of a task. Rather, the programmer is given handles to port right . These are capabilities to send messages to a port or receive messages from a port. Port sets: A port set is a collection of port receive rights local to a task. It is used to receive a World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



message from any one of a collection of ports. Port sets should not be confused with port groups, which are multicast destinations but are not implemented in Mach. Messages: A message in Mach can contain port rights in addition to pure data. The kernel employs memory management techniques to transfer message data efficiently between tasks. Devices: Servers such as file servers running at user level must access devices. The kernel exports a lowlevel interface to the underlying devices for this purpose. Memory object: Each region of the virtual address space of a Mach task corresponds to amemory object. This is an object that in general is implemented outside the kernel itself but is accessed by the kernel when it performs virtual memory paging operations. A memory object is an instance of an abstract data type that includes operations to fetch and store data that are accessed when threads give rise to page-faults in attempting to reference addresses in the corresponding region. Memory cache object: For every mapped memory object, there is a kernel-managed object thatcontains a cache of pages for the corresponding region that are resident in main memory. This is called a memory cache object. It supports operations needed by the external pager that implements the memory object. We shall now consider the main abstractions. The abstraction of devices is omitted in the interests of brevity.

Tasks and threads A task is an execution environment: tasks themselves cannot perform any actions; only the threads within them can. However, for convenience we shall sometimes refer to a task performing actions when we mean a thread within the task. The major resources associated directly with a task are its address space, its threads, its port rights, port sets and the local name World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



space in which port rights and port sets are looked up. We shall now examine the mechanism for creating a new task, and the features related to the management of tasks and the execution of their constituent threads. Creating a new task The UNIX fork command creates a new process by copying an existing one. Mach’s model of rocess creation is a generalization of the UNIX model. Tasks are created with reference to what we shall call a blueprint task (which need not be the creator). The new task resides at the same computer as the blueprint task. Since Mach does not provide a task migration facility, the only way to establish a task at a remote computer is via a task that already resides there. The new task’s bootstrap port right is inherited from its blueprint, and its address space is either empty or is inherited from its blueprint (address space inheritance is discussed in the subsection on Mach virtual memory below). A newly created task has no threads. Instead, the task’s creator requests the creation of a thread within the child task. Thereafter, further threads can be created by existing threads within the task. See Figure 18.4 for some of the Mach calls related to task and thread creation. Invoking kernel operations When a Mach task or thread is created, it is assigned a so-called kernel port. Mach ‘system calls’ are divided into those implemented directly as kernel traps and those implemented by message passing to kernel ports. The latter method has the advantage of allowing network-transparent operations on remote tasks and threads as well as local ones. A kernel service manages kernel resources in the same way that a user-level server manages other resources. Each task has send rights to its kernel port, which enables it to invoke operations upon itself (such as to create a new thread). Each of the kernel services accessed by message passing has an interface definition. Tasks access these services via stub procedures, which are generated from their interface definitions by the Mach Interface Generator. Exception handling In addition to a kernel port, tasks and (optionally) threads can possess an exception port. When certain types of exception occur, the kernel responds by attempting to send a message describing the exception to an associated exception port. If there is no exception port for the thread, the kernel looks for one for the task. The thread that receives this message can attempt to fix the problem (it might, for example, grow the thread’s stack in response to an address space violation), and it then returns a status value in a reply message. If the kernel finds an exception port and receives a reply indicating success, it then restarts the thread that raised the exception. Otherwise, the kernel terminates it. For example, the kernel sends a message to an exception port when a task attempts an address space access violation or to divide by zero. The owner of the exception port could be a debugging task, which could execute anywhere in the network by virtue of Mach’s locationindependent communication. Page faults are handled by external pagers. Task and thread management World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



About forty procedures in the kernel interface are concerned with the creation and management of tasks and threads. The first argument of each procedure is a send right to the corresponding kernel port, and message passing system calls are used to request the operation of the target kernel. Some of these task and thread calls are shown in Figure. In summary, thread scheduling priorities can be set individually; threads and tasks can be suspended, resumed and terminated; and the execution state of threads can be externally set, read and modified. The latter facility is important for debugging and also for setting up software interrupts. Yet more kernel interface calls are concerned with the allocation of a task’s threads to particular processor sets. A processor set is a subset of processors in a multiprocessor. By assigning threads to processor sets, the available computational resources can be crudely divided between different types of activity. inherit_memory specifies whether the child should inherit the address space of its parent or be assigned an empty address space, child_task is the identifier of the new task. thread_create(parent_task, child_thread) parent_task is the task in which the new thread is to be created, child_thread is the identifier of the new thread. The new thread has no execution state and is suspended. thread_set_state(thread, flavour, new_state, count) thread is the thread to be supplied with execution state, flavour specifies the machine architecture, new_state specifies the state (such as the program counter and stack pointer), count is the size of the state. thread_resume(thread) This is used to resume the suspended thread identified by thread.

Communication model Mach provides a single system call for message passing: mach_msg. Before describing this, we shall say more about messages and ports in Mach. Messages A message consists of a fixed-size header followed by a variable-length list of data items The fixed-size header contains: World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



The destination port: For simplicity, this is part of the message rather than being specified as a separate parameter to themach_msg system call. It is specified by the local identifier of the appropriate send rights. A reply port: If a reply is required, then send rights to a local port (that is, one for which the sending thread has receive rights) are enclosed in the message for this purpose. An operation identifier: This identifies an operation (procedure) in the service interface and is meaningful only to applications. Extra data size: Following the header (that is, contiguous with it) there is, in general, a variable-sized list of typed items. There is no length limit to this, except the number of bits in this field and the total address space size. Each item in the list after the message header is one of the following (which can occur in any order in the message): Typed message data: individual, in-line type-tagged data items; Port rights: referred to by their local identifiers; Pointers to out-of-line data: data held in a separate non-contiguous block of memory. Mach messages consist of a fixed-size header and multiple data blocks of variable sizes, some of which may be out of line (that is, non-contiguous). However, when out-of-line message data are sent, the kernel – not the receiving task – chooses the location in the receiving task’s address space of the received data. This is a side effect of the copy-on-write technique used to transfer this data. Extra virtual memory regions received in a message must be de-allocated explicitly by the receiving task if they are no longer required. Since the costs of virtual memory operations outweigh those of data copying for small amounts of data, it is intended that only reasonably large amounts of data are sent out of line. The advantage of allowing several data components in messages is that this allows the programmer to allocate memory separately for data and for metadata. For example, a file server might locate a requested disk block from its cache. Instead of copying the block into a message buffer, contiguously with header information, the data can be fetched directly from where they reside by providing an appropriate pointer in the reply message. This is a form of what is known as scatter-gather I/O, wherein data is written to or read from multiple areas of the caller’s address space in one system call. The UNIX readv and writev system calls also provide for this [Leffler etal. 1989]. The type of each data item in a Mach message is specified by the sender. This enables user-level network servers to marshal the data into a standard format when they are transmitted across a network. However, this marshalling scheme has performance disadvantages compared with marshalling and unmarshalling performed by stub procedures generated from interface definitions. Stub procedures have common knowledge of the data types concerned, need not include these types in the messages and may marshal data directly into the message. A network server may have to copy the sender’s typed data into another message as it marshals them. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



Ports A Mach port has a message queue whose size can be set dynamically by the task with receive rights. This facility enables receivers to implement a form of flow control. When a normal send right is used, a thread attempting to send a message to a port whose message queue is full will be blocked until room becomes available. When a thread uses a send-once right, the recipient always queues the message, even if the message queue is full. Since a send-once right is used, it is known that no further messages can be sent from that source. Server threads can avoid blocking by using send-once rights when replying to clients. Sending port rights When port send rights are enclosed in a message, the receiver acquires send rights to the same port. When receive rights are transmitted, they are automatically de-allocated in the sending task. This is because receive rights cannot be possessed by more than one task at a time. All messages queued at the port and all subsequently transmitted messages can be received by the new owner of receive rights, in a manner that is transparent to tasks sending to the port. The transparent transfer of receive rights is relatively straightforward to achieve when the rights are transferred within a single computer. The acquired capability is simply a pointer to the local message queue. In the inter-computer case, however, a number of more complex design issues arise. These are discussed below. Monitoring connectivity The kernel is designed to inform senders and receivers when conditions arise under which sending or receiving messages would be futile. For this purpose, it keeps information about the number of send and receive rights referring to a given port. If no task holds receive rights for a particular port (for example, because the task holding these rights failed), then all send rights in local tasks’ port name spaces become dead names. When a sender attempts to use a name referring to a port for which receive rights no longer exist, the kernel turns the name into a dead name and returns an error indication. Similarly, tasks can request the kernel to notify them asynchronously of the condition that no send rights exist for a specified port. The kernel performs this notification by sending the requesting thread a message, using send rights given to it by the thread for this purpose. The condition of no send rights can be ascertained by keeping a reference count that is incremented whenever a send right is created and decremented when one is destroyed. It should be stressed that the conditions of no senders/no receiver are tackled within the domain of a single kernel at relatively little cost. Checking for these conditions in a distributed system is, by contrast, a complex and expensive operation. Given that rights can be sent in messages, the send or receive rights for a given port could be held by any task, or even be in a message, queued at a port or in transit between computers. Port sets ◊ Port sets are locally managed collections of ports that are created within a single task. When a thread issues a receive from a port set, the kernel returns a message that was delivered to World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



some member of the set. It also returns the identifier of this port’s receive rights so that the thread can process the message accordingly. Ports sets are useful because typically a server is required to service client messages at all of its ports at all times. Receiving a message from a port whose message queue is empty blocks a thread, even if a message that it could process arrives on another port first. Assigning a thread to each port overcomes this problem but is not feasible for servers with large numbers of ports because a thread is a more expensive resource than a port. By collecting ports into a port set, a single thread can be used to service incoming messages without fear of missing any. Furthermore this thread will block if no messages are available on any port in the set, so avoiding a busy-waiting solution in which the thread polls until a message arrives on one of the ports. Mach_msg The Mach_msg system call provides for both asynchronous message passing and requestreplystyle interactions, which makes it extremely complicated. We shall give only an overview of its semantics. The complete call is as follows: mach_msg(msg_header, option, snd_siz, rcv_siz, rcv_name, timeout, notify) msg_header points to a common message header for the sent and received messages, option specifies send, receive or both, snd_siz and rcv_siz give the sizes of the sent and received message buffers, rcv_name specifies the port or port set receive rights (if a message is received), timeout sets a limit to the total time to send and/or receive a message, notify supplies port rights which the kernel is to use to send notification messages under exceptional conditions. Mach_msg either sends a message, receives a message, or both. It is a single system call that clients use to send a request message and receive a reply, and servers use to reply to the last client and receive the next request message. Another benefit of using a combined send/receive call is that in the case of a client and server executing at the same computer the implementation can employ an optimization called handoff scheduling. This is where a task about to block after sending a message to another task ‘donates’ the rest of its timeslice to the other task’s thread. This is cheaper than going through the queue of RUNNABLE threads to select the next thread to run. Messages sent by the same thread are delivered in sending order, and message delivery is reliable. At least, this is guaranteed where messages are sent between tasks hosted by a common kernel – even in the face of lack of buffer space. When messages are transmitted across a network to a failure-independent computer, at-most-once delivery semantics are provided.




The timeout is useful for situations in which it is undesirable for a thread to be tied up indefinitely, for example awaiting a message that might never arrive, or waiting for queue space at what turns out to be a buggy server’s port. UNIX emulation in MACH Mach and Chorus are designed to emulate operating systems, notably UNIX; UNIX emulation has also been implemented on the V kernel. Strict binary compatibility requires that all binary files compiled to run on a conventional implementation of a version of UNIX (for example, Linux, 4.3BSD or SVR4) should run correctly and without modification on the emulation, for the same machine architecture. This implies that the following list of requirements should be met: Address space layout: The emulation must provide the regions expected by the program. If the code is non-relocatable, the machine instructions assume that regions such as the program text and heap occupy certain expected address ranges. Address space regions such as the stack must be grown as necessary. System call processing: Whenever a program executes a system call with a valid set of arguments, the emulation must handle this correctly according to the defined call semantics; it must handle the associated TRAP instruction and obey the parameter-passing conventions expected by the program. Error semantics: Whenever a program presents invalid arguments to a system call, the emulation must reproduce correctly the error semantics defined for the system call. In particular, if the user World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



program provides an invalid memory address, the emulation should simply return an error status, and not raise a hardware exception at user-level. Failure semantics: The emulation should not introduce new system call failure modes. An example of a failure mode applicable to a conventional UNIX implementation is the inability to complete a call due to lack of system resources such as table space (for example, fork may fail in this way). Protection: User data and the UNIX emulation system itself must not be compromised. Signals: Signals must be generated and user-level handlers called as appropriate when a UNIX program causes an exception such as an address space violation. Emulation software is required at every computer that can run UNIX processes. One of the aims when emulating UNIX in a distributed system is to implement a single UNIX image across several computers, so that, for example, UNIX processes have globally unique process identifiers, and signals can be transmitted transparently between computers. It becomes difficult or impossible to meet the requirement of reproducing UNIX failure semantics – in so far as they are defined – in these circumstances. Effectively, many UNIX system calls would have to be implemented as transactions, because of the independent failure modes of computers and networks. Moreover, suitable protection mechanisms are required, strictly speaking, when user data are transferred across a network.




The Mach emulation A UNIX process is implemented using a Mach task with a single thread. 4.3BSD UNIX services are provided by two software components: an emulation library and a 4.3BSD server. The emulation library is linked as a distinct region into every task emulating a UNIX process. This region is inherited from /etc/init by all UNIX processes. There is one 4.3BSD server (that is, one such task) for every computer running the emulation. This server both handles requests sent by clients and acts as an external pager when clients fault on mapped UNIX files, as we shall discuss. Applications do not invoke the code in the emulation library directly. Mach provides a call task_set_emulation, which assigns the address of a handler in the emulation library to a given system call number; this is called for each UNIX system call when the emulation is initialized. When a UNIX process executes a system call, the TRAP instruction causes the Mach kernel to transfer control back to the thread in the UNIX task, so that this same thread executes the corresponding emulation library handler (Figure 1). The handler then either sends a message requesting the required service to the 4.3BSD server task and awaits a reply or, in some cases, performs the UNIX system service using data accessible to the emulation library itself. Each UNIX process and its local 4.3BSD server share two regions, of size one page. One of these is read-only for the process, and it contains information such as the process’s identifier, user World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in



identifier and group identifier. If a process calls getpid, for example, then the emulation library may read the process identifier directly from this page and return it without communicating with the 4.3BSD server. The other shared region can be written by the UNIX process; it contains signalrelated information and an array of data structures relating to the process’s open files. When a UNIX process opens a file it is mapped into a region of its address space, with the 4.3BSD server acting as the external pager. The emulation assigns a region of 64 kilobytes for each file; if the file is larger, then the region is used as a movable window onto the file. When the process calls read or write on the file, the corresponding emulation library procedure copies the data between the mapped region and the user’s buffer and updates the file pointer. The data copying requires no explicit communication with the 4.3BSD server. However, the library may generate page faults as it accesses the file region, which will result in the kernel communicating with the 4.3BSD server. The emulation library has to synchronize with the 4.3BSD server before it accesses the file data if it needs the file window to be moved, or if the open file is shared (for example, with a child or parent process). In the latter case, the file’s read-write pointer must be consistently updated. A token is used to obtain mutual exclusion over its use. The emulation library is responsible for requesting the token from the 4.3BSD server and releasing it. (See the subsection on distributed mutiual exclusion in Chapter 10 for a description of centralized token management. The Mach exception handling scheme facilitates the implementation of UNIX signals arising from exceptional conditions. A thread belonging to a 4.3BSD server can arrange to be sent messages pertaining to exceptions, and it can respond to these by adjusting the victim task’s state so as to call a signal handler, before replying to the kernel. The exception handling scheme also facilitates automatic stack growth and task debugging. The 4.3BSD server requires internal concurrency in order to handle the calls made upon it efficiently. Recall that UNIX processes undergo a context switch and execute within the kernel to handle their system calls in a conventional implementation; there is thus a process in the kernel for every system call. The 4.3BSD server uses many C threads to receive requests and process them. Most of the threads are kept in a pool and assigned dynamically to requests from emulation library calls. There are also a few dedicated threads: the Device reply thread requests device activity from the kernel; the Softclock thread implements timeouts; the Netinput thread handles network device interactions with the kernel; and the Inode pager thread implements an external pager corresponding to mapped UNIX files. World Institute Of Technology 8km milestone ,Sohna Palwal Road , NH-71 B ,Sohna , Gurgaon ,Haryana. Website : www.wit.net.in


CSE-402 E Distributed Operating System UNIT-1 Introduction to ...

CSE-402 E Distributed Operating System UNIT-1 Introduction to ...

Suggest Documents

Distributed Operating System - Google Sites

Introduction to Operating System Concepts - Glam.ac.uk

An Introduction to Windows Operating System

Middleware and Distributed Systems Introduction - Operating Systems ...

Survey: Distributed Operating Systems Introduction 1 Network ...

Introduction to Operating Systems

Introduction to Operating Systems

Introduction to Operating Systems

Operating System Techniques for Distributed ... - Semantic Scholar

A Microkernel based Distributed Operating System - CiteSeerX

The Clouds distributed operating system - Computer

The LOCUS Distributed Operating System - Department of ...

Operating System Support for Distributed Applications ...

Comparisons of Distributed Operating System ... - CiteSeerX

Operating System Techniques for Distributed ... - Semantic Scholar

Guardian 90: A Distributed Operating System Optimized ...

The Amoeba Distributed Operating System - Vrije Universiteit ...

Middleware and Distributed Systems System Models - Operating ...

The Clouds distributed operating system - Computer

Deadlock in Distributed Operating System - Google Sites

Introduction to Distributed Systems

Today: Introduction to Operating Systems

1 Introduction to Operating Systems

Introduction to Operating Systems - Courses