Advanced Operating Systems course

137 downloads 548549 Views 9MB Size Report
May 5, 2016 - We can seamlessly add functionality to existing services using the ...... .com/beginners-guide/why-you-need-a-cdn-for-your-wordpress-blog-.
CS6460: Educational Technology: Spring 2016 Illustrated Notes for CS6410: Advanced Operating Systems course Bhavin Thaker: [email protected] Date: 05th May, 2016 Abstract: This document provides succinct yet detailed notes on the lecture videos for the CS6410: Advanced Operating Systems (AOS) course at Georgia Institute of Technology. These illustrated notes were developed as part of the CS6460: Education Technology course with the purpose of contributing to field of Education. In some cases, additional material presented is meant to be complementary to the lecture material and is clearly marked so as optional reading and colored in pink background. Future revisions to this document will be made available here. You can provide feedback by completing a short survey here. A student of the AOS course is expected to complete watching and understanding the AOS videos and only then utilize these notes as an aid in deeper understanding and improved memorization of the material. Additionally, it is highly recommended that the student use these notes to make their own personalized notes since the process of doing so has been proven to help in better comprehension and memory recall of new information. 1. Problem One of the main problems with the Georgia Tech’s AOS (Advanced Operating Systems) course that I have heard informally from students is information overload. I believe that the huge amount of information in the AOS course could prevent a student from understanding the essence of material taught and the interdependent relationships between concepts. Students are not particularly good at taking complete notes and estimates of student accuracy in lecture notetaking hover around a low value of 40% [1] [2]. Several studies indicate that students have difficulty organizing lecture material and identifying main points [2]. Thus, the large volume of information in the AOS course compounds the problem of not having complete and succinct lecture notes. 2. Solution I would like to design succinct, complete and illustrated lecture notes and, while doing so, learn higher-level learning processes that typically help a student understand and remember complex concepts for real-life applications [4] [5].

Page 1 of 197

This document is a set of readymade notes that utilize the following guiding principles, based on various modern learning concepts, such that the notes have the following properties: 1. are succinct: use chunk-based [8], atomic [10] pieces of information, 2. are complete: cover all important information imparted in the lectures, 3. use illustrations: have sufficient number of illustrations to help visualize the concept, 4. use examples: a. use “show, don’t tell” principle [10] to demonstrate a concept before explaining its English description, b. use “specifics before generics”, i.e. use examples with specific numbers before generalizing them to examples with variables, c. use a complete mini-representation of an actual example instead of using a big example with ellipsis in the diagram. 5. use experiments [6]: that the students can try out on their own, 6. use Q&A style: utilize Question and Answer style to encourage student curiosity [7], 7. use conversational style: use stimulating, casual, first-person, conversational style [9], 8. use emotions: like humor, surprise, and interest [9], 9. use color: utilize color or different font styles as appropriate, 10. use spatial contiguity: related concepts on same page and in succession [11], 11. use redundancy: repeat mnemonics to improve memory recall of the information. 3. Evaluation The evaluation of the Illustrated Notes will be performed by seeking feedback from EdTech and AOS students on whether the Illustrated Notes helped them to understand and remember AOS concepts. Finally, I plan to publish my learning experiences in a paper that will summarize effective ways to structure illustrated notes. “Learning results from what the student does and thinks and only from what the student does and thinks. The teacher can advance learning only by influencing what the student does to learn.” --- Herb Simon, a Nobel laureate in Economics, ACM Turing award winner. Acknowledgments I would like to thank the following people: 1. Dr. Ramachandran for creating the excellent course on Advanced Operating Systems, 2. Dr. Joyner for creating the EdTech course that inspired me to create the Illustrated Notes, 3. My EdTech mentor, Bobbie Eicher, for her guidance and support, 4. Various students for their encouragement and feedback on this interesting and important problem of information overload in the new information age, and 5. My family who graciously allowed me to use our family time over many weekends for producing the Illustrated Notes. Page 2 of 197

References [1] R. Eric Landrum, Faculty and Student Perceptions of Providing Instructor Lecture Notes [2] DeZure, et al, Research on Student Note taking [3] Article: Problem-based Learning helps bridge the gap between classroom and the real world: [4] YouTube video, Dr. Mehran Sahami, Programming Invitational Talk (see time: 47:40) [5] YouTube, Dr. Sahami, Engaging Undergraduates in Learning, Teaching and Research [6] Lee, et al, Generative Learning: Principles and Implications for Making Meaning [7] George Loewenstein, The Psychology of Curiosity: A Review and Re-interpretation, 1994. [8] Wikipedia: Chunking trick as a mnemonic in psychology [9] The Head First formula from O’Reilly’s Head First series of books [10] Bruce Eckel, Atomic Scala: http://www.atomicscala.com/free-sample/#.VqQd7vkrJhE [11] John Medina, Brain Rules: 12 principles for surviving & thriving at work, home, & school. [12] Peter Brown, Make it Stick: The science of successful learning. [13] Daniel Kahneman, Thinking Fast and Slow. [14] Donald Norman, The design of everyday things [15] Amar Chitra Katha comic series: https://en.wikipedia.org/wiki/Amar_Chitra_Katha [16] Cartoon guide comic series: http://www.larrygonick.com/ [17] Feynman Diagram: https://en.wikipedia.org/wiki/Feynman_diagram [18] Simon, David, et al, UC San Diego, NoteBlogging: Taking Note Taking Public.

Page 3 of 197

Illustrated Notes for CS6410: Advanced Operating Systems course at Georgia Institute of Technology. Bhavin Thaker: [email protected] This e-book is dedicated to W Richard Stevens for his illustrative and lucid style of explaining complex concepts simply in his various books on UNIX and TCP/IP.

Table of Contents No. Lesson No. 1. L06a 2. L06b 3. L06c 4. L07a 5. L07b 6. L07c 7. L08a 8. L08b 9. L08c 10. L09a 11. L09b 12. L09c 13. L10a 14. L10b 15. L11a 16. L11b

Lesson Name

Page Number

Spring OS Java RMI: Remote Method of Invocation EJB: Enterprise Java Beans GMS: Global Memory System DSM: Distributed Shared Memory DFS: Distributed File System LRVM: Lightweight Recoverable Virtual Memory RioVista: Performant Persistent Memory QuickSilver: Transactional Operating System GSS: Giant Scale Services MR: Map-Reduce framework CDN: Content Delivery Networks TS-Linux: Time-Sensitive Linux PTS: Persistent Temporal Streams Security: Principles of Information Security Security in AFS: Andrew File System

Page 05 Page 18 Page 24 Page 32 Page 48 Page 73 Page 86 Page 97 Page 104 Page 113 Page 129 Page 138 Page 159 Page 168 Page 179 Page 184

Notes: 1. While I have made every attempt to retain the accuracy and completeness of information from the lecture videos, please use the Illustrated Notes as an aid after watching the videos and not as a substitute to the lecture videos. The accuracy of the notes is not verified by the course owner. In case of any conflicting information, lecture videos override the information in these notes. Sections of these notes where I differ from the lectures are highlighted in pink color. 2. This e-book is derived work based on the video lecture transcripts of the CS6410 Advanced Operating Systems course at Georgia Tech. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without any fee, provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on this page. 3. Click on the hyperlinked questions to view the lecture videos on YouTube (This is work in progress).

Page 4 of 197

Illustrated Notes for L06a: Spring OS L06a-Q1. What is the context and background of the Spring OS? What are the critical design choices for building a new OS? 1. In general, Distributed Object Technology is used to design for the Continuous and Incremental Evolution of complex, distributed software systems, both in terms of functionality and performance. 2. The Spring Network Operating System was designed and implemented in Sun Microsystems as a Network Operating System for use in local area networks. 3. Later on, Spring OS was marketed as Sun’s Solaris MC (Multi-Computer) Operating System. 4. Spring OS used Object-oriented principles in building the OS kernel in order to have state isolation and strong interfaces for each subsystem of the OS. The critical design choices are building a new OS are: 1. Marketplace demand says that there are many legacy applications running on a current OS and therefore building a brand new OS may not be viable in an industrial setting. 2. So, Sun Microsystems took the approach of keeping the external interface to the legacy applications the same as earlier version of the OS, but under the covers of the OS, do innovation where it makes sense. Also, introduce new external APIs in a way that does not break legacy applications. In other words, make sure that innovation allows extensibility and flexibility. This is similar to Intel’s approach of keeping the x86 processor instruction set interface the same across multiple processors, but continue to do microarchitecture innovations incrementally and continuously. In summary, the mantra is do innovation under the covers, but keep the external interface the same for backward compatibility with legacy applications.

Page 5 of 197

L06a-Q2. Can you compare Procedural Design Vs Object-based Design? Procedural Design has the following properties: 1. Code is written as one monolithic entity. 2. The code has shared state in terms of global variables and has private state in caller and callee. That is, state is strewn all over the place in code. 3. Shared State can be manipulated from several different subsystems that are part of a big monolith. Object-based Design has the following properties: 1. The state is contained entirely within an Object and is not visible outside. 2. The Object Methods can read and write to the Object state but this state is not accessible directly to anybody else. 3. So, Object-based Design provides Strong Interfaces and complete isolation of the state of an object from everything else.

Page 6 of 197

L06a-Q3. Could you describe in detail the Spring OS’s approach in building the OS? 1. The Spring approach in building an operating system is to adopt strong interfaces for each subsystem. That is, the only thing that is exposed outside a sub-system is what services are provided by the sub-system but not how the services are actually implemented. 2. At the same time, the Strong interfaces should be open, flexible and extensible and hence the system components can be implemented in a variety of programming languages using the IDL (Interface Definition Language) from OMG (Object Management Group). 3. The IDL from OMG allows a programmer to define the interfaces using IDL. 4. 3rd party IDL compilers compile the IDL interfaces definition into a particular language code, depending on the IDL compiler used. This generated code is compiled further by the language compiler to build subsystems that can integrate with the Spring system. 5. The Spring OS uses the approach of Extensibility and Extensibility naturally leads to microkernel-based approach. 6. Spring OS follows Liedtke’s principle of that the Micro-kernel provides abstractions of Threads and IPC, and Virtual Memory Manager, and all other services (e.g. display server, file system, etc.) are outside this Micro-Kernel. 7. The Spring OS Micro-kernel (uK) has 2 parts: 7a. The Nucleus of the Micro-kernel provides the abstractions of Threads and IPC among threads. 7b. The Virtual Memory Manager (VMM). Mnemonic: uK = Nucleus (Threads, IPC) + VMM 8. Thus, Sun Microsystems transitioned from a single node OS to a Network OS keeping the external UNIX interfaces the same as earlier versions of the OS. In summary, the key design choices in building the Spring OS are: 1. Strong Interfaces for each subsystem. 2. Open, Flexible and Extensible Interfaces using OMG’s IDL so that the approach is not tied to a particular language implementation. 3. State isolation by hiding state within an object and by using object methods to manipulate the state of the object.

Page 7 of 197

L06a-Q4. Describe the Nucleus of the Spring Micro-Kernel. Recall: Mnemonic: uK = Nucleus (Threads, IPC) + VMM. The Spring Micro-Kernel contains 2 components: Nucleus + VMM (Virtual Memory Manager) The Nucleus of the Spring Micro-Kernel (uK) manages two entities: 1. Threads (Computing) 2. IPC among the Threads (Networking) The VMM (Virtual Memory Manager) of the Spring Micro-Kernel manages Memory. Note that a computer consists mainly of components: Computing, Networking and Memory. The Spring Micro-kernel architecture uses a subset of Liedtke’s prescription of uK architecture. The Nucleus of the Spring Micro-Kernel provides 2 Abstractions: 1. A Domain: is a Container or Address Space (AS) used to run Threads in it. 2. A Door: is a Software Capability to enter a Target Domain and to invoke the Entry Points to the Objects in the Target Domain. A Door is represented by a pointer to a C++ object that represents the Target Domain. The Door software capability of a Domain can be passed from one Domain to another Domain. New Mnemonic: uK = Nucleus (Threads run in Domains , IPC thru Doors) + VMM. A Door Table: is a table, unique to every Domain, and contains all doors accessible to the Domain. Doors are used to make object invocations from the Source Domain into the Target Domain. Each entry in the Door Table contains a DoorID and a corresponding DoorPointer, which is a C++ pointer that represents the Target Domain. A DoorID is also known as DoorHandle. The DoorID allows the Source Domain to invoke the Entry Point associated with the Door. There can be many DoorIDs in the DoorTable pointing to different Target Domains. The Nucleus is involved in invoking every Door call. On a Door call invocation by a Source Domain into a Target Domain, the Nucleus allocates a Server Thread on the Target Domain and executes the PPC (Protected Procedure Call) allocated with the DoorID. On receiving a Door call, the Nucleus deactivates the Client Thread in the Source Domain, activates the Server Thread in the Target Domain, and after the Server Thread completes its work, the Nucleus deactivates the Server Thread in the Target Domain and reactivates the Client Thread in the Source Domain. The PPC is similar to the communication mechanism in the LRPC (Lightweight RPC) paper and is a very fast Cross-Domain Call mechanism using the Door abstraction provided by the Nucleus. This design ensures that the architecture has all the good attributes of object-orientation but at the same time it is performant as well.

Page 8 of 197

L06a-Q5. How is the Object invocation across the Network accomplished? A Source Domain on a node can access any Target Domain on the same node using Doors. A Source Domain on a node can access any Target Domain on another node using Network Proxies on both the nodes, as shown in the figure below. The Target Proxy A exports a NetHandle that embeds Door-X. The Source Proxy B uses this NetHandle to connect Nucleus-B to Nucleus-A. The Client/Source Domain B uses the local Nucleus B to get transparently connected to remote target Nucleus A and then to the Server Domain A. The Network Proxies are outside the Micro-Kernel and are invisible to the Client and Server Domains. The Network Proxies can employ different network protocols (e.g. LAN or WAN specialized protocols) since they are invisible to the Client and Server Domains. This is a key property of a Network Operating System that such decisions are not ingrained or hard-coded in the operating system of a single node and that the network connectivity is flexible and extensible.

Page 9 of 197

L06a-Q6. How is the Object Invocation made Secure? A Server Object can provide different access privilege levels to different clients. Every Server Object has an associated Front Object that controls the access privileges to the underlying Server Object. The Security Model associates Policies in Front Objects that govern access to underlying Objects. The Front Object registers the Door for accessing it with the Nucleus. The Client Domain goes through this Door to access the Front Object [1 and 2 in figure below], and the Front Object checks the ACL [3] to ensure that the Client Domain has appropriate privilege to access the underlying Object and if the check passes, then it accesses [4] the Server Object. The Client Domain can reduce the privilege of a DoorID before passing it to another Domain. For example, a Client Domain having full Read-Write access to a file can pass the associated DoorID to a Printer Object by reducing its privilege to one-time, read-only privilege that is sufficient to print the file. It is possible to have Multiple Front Objects to an underlying Server Object with distinct Doors registered with the Nucleus (see Front Objects Y and Z in figure below) for different implementation of Control Policies for a particular service.

Page 10 of 197

L06a-Q7. Can you summarize the architecture of Spring OS and compare it to Liedtke’s prescription of the Micro-Kernel architecture? The key properties of the Spring Network Operating System are: 1. Object Technology as a structuring mechanism. 2. Extensible through Micro-Kernel Nucleus. 3. Strong Interfaces using OMG’s IDL. 4. Client-Server Dynamic Relationship through the Subcontract mechanism. 5. Remote invocation on objects across the Network happens: 5a. Transparently through Network Proxy, 5b. Securely through the Front Object’s use of ACL/Policy, and 5c. Efficiently through the Door PPC (Protected Procedure Calls). (The sub-contract mechanism is explained in subsequent sections) Liedtke’s Micro-Kernel architecture suggests that Threads, IPC and Address Space Memory should be contained in the Micro-Kernel. Spring OS’s Nucleus contains only Threads and IPC, but the Spring Micro-Kernel contains Nucleus and Virtual Memory Management and thus the architecture of the Spring OS complies with the Liedtke’s architecture – it’s just that the things are named differently in Spring OS as compared to Liedtke’s terminology. Mnemonic: uK = Nucleus (Threads run in Domains , IPC thru Doors) + VMM.

Page 11 of 197

L06a-Q8. Ok, so now, can you describe the remaining component of the Micro-Kernel: The Virtual Memory Management (VMM) in Spring OS? 1. The Virtual Memory Manager manages the Linear Address Space by breaking into a Set of Pages called Memory Regions. 2. Memory Regions are mapped to Memory Objects. 3. A Memory Object is an Abstraction for Backing File on Disk that allows a Memory Region to be associated with the Backing File on Disk. 4. Backing Files could be Memory-mapped Files OR Swap Files on Disk. 5. It is possible that multiple Memory Objects may map to the same Backing File on Disk, and It is possible that multiple Memory Regions may map to the same Memory Object.

Page 12 of 197

L06a-Q9. Describe the Virtual to Physical mapping for Memory Objects of VMM in Spring OS. 1. The Virtual Memory Manager manages the Linear Address Space by breaking into a Set of Pages called Memory Regions. Memory Regions are mapped to Memory Objects, which are then mapped to Backing Files on Disk. 2. A Pager Object is responsible for bringing the Memory Object into the DRAM from the associated Backing File on Disk. The Pager Object establishes the connection between Virtual Memory and Physical Memory. 3. The Pager Object creates a COR (Cached Object Representation) in DRAM for the Memory Object and maps the Memory Object to the COR in DRAM. 4. Different Pager Objects are associated with each of the Memory Regions that correspond to a particular Memory Object. 5. All the associations between Memory Regions and Memory Objects can be dynamically created. 6. The Coherence of the COR (Cached Objects) is NOT provided by the Spring OS. The Coherence is the responsibility of the Pager Objects to coordinate access. 7. External Pagers (Pager Objects) establish the mapping between Virtual Memory (indicated by the Memory Objects) and the Physical Memory (represented by the Cached Objects). 8. For a single linear address space, multiple Pager Objects can manage different Memory Regions of the Address Space. 9. This is the flexibility and power that is made available by the use of the Object Technology in the Spring system.

Page 13 of 197

L06a-Q10. System?

Can you summarize the key features of the Spring Network Operating

1. In the Spring Network Operating System, Object Technology is used as a System Structuring Mechanism in constructing a Network Operating System. 2. Spring uK = Nucleus (Threads run in Domains , IPC thru Doors) + VMM. 3. Liedtke’s prescription of Micro-Kernel (uK) is accomplished by the combination of Nucleus + Address Space Management that is part of the Spring System’s kernel boundary. 4. All services of the OS including filesystem, network communication, etc. are all provided as objects that live outside the kernel. 5. The objects living outside the kernel are access through the Doors abstraction. 6. Every Domain (i.e. Container or Address Space to run Threads) has a Door Table containing DoorIDs that provide capabilities for accessing Doors on particular Domains. That is, Doors and DoorTable are the basis for cross-domain calls. 7. Object-orientation and Network Proxies allow object invocation to be implemented as PPC (Protected Procedure Calls) both on the same node and across machines. 8. The Virtual Memory Manager provides various features like Linear Address Space, Memory Regions, Memory Objects, Pager Objects and COR (Cached Object Representation). 9. To compare, Tornado uses a Clustered Object as an Optimization for implementing Services (e.g. Singleton Representation, Multiple Representation, etc.), whereas in Spring Network Operating System, Object Technology permeates the entire OS design because Object Technology is used as a System Structuring Mechanism and NOT just an Optimization Mechanism in constructing a Network Operating System.

Page 14 of 197

L06a-Q11. How is the client-server relationship managed dynamically in Spring NOS? 1. The Client and the Server are impervious to where they are in the network. That is, the Client-Server interaction is freed from the physical location of the clients and servers. 2. Client requests are dynamically routed to different, replicated Servers based on physical proximity between the client and the server to reduce the distance, or based on server load to choose the least-loaded server. 3. Proxy Cache is used for the Server to avoid Load on the Origin Server. The decision of routing a client to a cached copy is dynamically taken. 4. Thus, the Client requests are dynamically routed to Replicated Servers or Proxy-Cached Servers to improve the overall load distribution.

Page 15 of 197

L06a-Q12. What is the Subcontract mechanism? 1. The dynamic relation between Clients and Servers is made possible through the secret sauce of the Subcontract mechanism. 2. The Subcontract mechanism is sort of like the real-life analogy of offloading work to a 3rd-party, i.e. giving subcontract to somebody to get the work done. 3. The contract between the Client and the Server is established through the IDL. IDL = Interface Definition Language. 4. The Subcontract is the interface that is provided for realizing the IDL contract between the Client and the Server. The Subcontract mechanism is a mechanism to hide the runtime behavior of an object from its actual interface. All the details of how the client’s IDL interface is satisfied is hidden in the details of the subcontract itself. This makes the generation of the client-side stub very simple. 5. The Subcontract lives under the covers of the IDL contract and we can change the Subcontract at any time. 6. Subcontract is something that we can discover and install at runtime. We can also dynamically load new subcontracts. 7. The Client uses the Server’s IDL interface to make invocation calls on the Server. An implementation of this IDL interface is accomplished through the Subcontract mechanism. 8. We can seamlessly add functionality to existing services using the Subcontract mechanism.

Page 16 of 197

L06a-Q13. How does the Subcontract interface for Stubs work? 1. The dynamic relation between Clients and Servers is made possible through the secret sauce of the Subcontract mechanism. 2. The Subcontract is the interface for realizing the IDL contract between the Client and the Server. The Subcontract hides the runtime behavior of an Object from its Interface. 3. The Subcontract lives under the covers of the IDL Contract and can be changed anytime. 4. The Subcontract can be discovered, installed and dynamically loaded at runtime. 5. We can seamlessly add functionality to existing services using the Subcontract mechanism. We can get new instances of Servers instantiated and advertise new services through the Subcontract mechanism so that clients can dynamically bind to new instances of Servers, without changing anything on the client-side application or the client-side stub. 6. The Client-side stub marshals the arguments from the client by making calls to the Subcontract mechanism. Once the arguments are marshaled, the client-side makes the invocation and the Subcontract mechanism does the actual work. 7. Thus the key properties of the Spring Network Operating System are: a. Spring NOS is Micro-kernel based and NOT a monolithic OS. b. Spring NOS uses Object Technology as a Structuring Mechanism in constructing a Network OS. c. Using Object Technology, Spring NOS provides Strong Interfaces and is Open, Flexible and Extensible because of using the Micro-Kernel architecture, where all OS services are provided through object mechanisms living on top of the Micro-Kernel. d. Clients and Servers do NOT have to be Location-Aware. e. Object invocations across the network are handled through Network Proxies. f. The Subcontract mechanism allows the Clients and Servers to dynamically change the relationship in terms of who they are talking to. See point 5 above. 8. The Subcontract mechanism invented as part of the Spring NOS forms the basis for Java RMI (Remote Method of Invocation). 9. This is how Spring NOS retains the interface for backward compatibility with legacy applications but provides new powers and does innovation under the covers.

Page 17 of 197

Illustrated Notes for L06b: Java RMI: Remote Method of Invocation L06b-Q1. What is the background context for the history of Java? 1. Java was originally invented as a language for use in embedded devices in the early 90s. 2. Java was invented by James Gosling at Sun Microsystems. 3. It was originally called as Oak and was originally intended to be used in PDAs. PDA = Personal Digital Assistant. 4. Java was then targeted for programming set-top boxes for the Video-on-demand industry, but it did not work out. 5. Java got a new life when the World Wide Web became popular and Java’s framework of supporting containment for the Java applications applied well to Web. Today, a lot of internet e-commerce depends heavily on the Java framework. This lesson will focus mainly on the distributed object model of the Java framework. L06b-Q2. How is the Distributed Object Model used in the Java programming language? 1. The Java Distributed Object Runtime system does all the heavylifting under the covers that an application programmer has to do in building a client-server system using RPC, like: a. marshaling and un-marshaling of arguments, b. publishing the remote objects on the network for clients to access, etc. 2. The Subcontract mechanism of Spring Network Operating System was the origin for Java RMI, in some sense. 3. The Distributed Object Model of Java contains the following components: a. Remote Object: refers to objects that are accessible from different address spaces. b. Remote Interfaces: refers to the declarations for methods in a Remote Object. c. Failure Semantics: use RMI exceptions that Clients have to deal with. d. Similarities to Local Object Model: Object references are passed as parameters to object invocation. e. Differences with Local Object Model: In Local Object Model, the Object references passed as object invocation parameters are passed as a Pure Reference – i.e. if client changes the object, server will see the change. In Distributed Object Model, the Object references passed as object invocation parameters are passed as Value-Result – i.e. if client changes the object, server will not see the change because a copy of the object is actually sent to the invoked method.

Page 18 of 197

L06b-Q3. Could you take an example to compare Local Vs Remote implementation? 1. Say, we take an example of the Bank Account Server having the APIs: Deposit(), Withdraw(), Balance(). 2. There are 2 choices for implementation: Choice 1: Reuse Local Implementation: a. Extend a Local Implementation of the Account Class to implement Bank Account. b. Then, use the Built-in Class called Remote Interface to make the methods in Bank Account visible remotely on the network. c. Only the interface is visible to the client and NOT the actual implementation or the instantiated objects. d. The actual location of the object is NOT visible to the client and so the Implementer has to do the heavy-lifting of finding a way to make the location of the service visible to clients on the network (i.e. Instantiated Objects). Remote Interface Extends Account Interface



Account Class Extends Account Implementation

Choice 2: Reuse the Remote Object Class: a. Extend the Remote Interface so that the Account Interface now becomes visible for any client that wants to access the Object. b. Extend the Remote Object Class and Remote Server Class in order to get the Account Implementation Object. c. Now, when we instantiate the Account Implementation Object, it becomes magically visible to the network clients through the Java Runtime System. Remote Interface Extends Account Interface



Remote Object Class and Remote Server Class Extends Account Implementation

Page 19 of 197

L06b-Q4. Which choice is preferable -- Choice 1 or Choice 2? Choice 1: Reuse Local Implementation: Derive the Service by extending Local Implementation. In this case, we used the Local Implementation and used only the Remote Interface to make the object instances remotely accessible. So, the heavy lifting of making the object instances remotely accessible needs to be done by the implementer – NOT preferable. Though, this approach has the advantage of providing fine-grained control on selective sharing of services. Choice 2: Reuse the Remote Object Class: Derive the Service by extending Remote Implementation. The Java RMI system does the heavy-lifting of making the Server Object Instance visible to network clients and hence this is the more preferred way of building network services and making them available for remote clients anywhere on the network.

Page 20 of 197

L06b-Q5. How does Java RMI work on the server-side and client-side? On the server-side, the server object is made visible on the network using the 3-step procedure: 1. Instantiate the Object, 2. Create a URL, and 3. Bind the URL to the Object Instance created. This allows the clients to be able to discover the existence of the new service on the network. On the client-side, any arbitrary client can easily discover and access the server object on the network using the following procedure: 1. Lookup the service provider URL by contacting a bootstrap name server in the Java RMI system and get a local access point for that remote object on the client-side. 2. Use the local access point for the remote object on the client-side by simply calling the invocation methods, which look like normal procedure calls. The Java Runtime System knows how to locate the server object in order to do the invocation. The client does NOT know or care about the location of the server object. 3. If there are failures in any of execution of the methods (functions), then Remote Exceptions will be thrown by the server through the Java Runtime System back to the client. A problem with Remote Exceptions is that the client may have no way of knowing at what point in the call invocation did the failure happen.

Page 21 of 197

L06b-Q6. How is the Java RMI functionality actually implemented? The Java RMI functionality is implemented using the RRL: Remote Reference Layer. 1. The Client-side Stub initiates the remote method invocation call, which causes RRL to marshal the arguments in order to send them over the network, and when the reply is received back from the server, the RRL unmarshals the results for the client. 2. Similarly, the Server-side Skeleton uses RRL to unmarshal the arguments from client message, makes the call to the server implementing the Remote Object, and then marshals the results from the server into a message to be sent to the client. Marshaling and Un-marshaling are also called as Serializing and Deserializing Java objects and is done by the RRL layer. The RRL layer is similar to Subcontract mechanism in the Spring Network Operating System. Java RMI derives a lot from the Subcontract mechanism. To summarize, the Remote Reference Layer (RRL) does the following: 1. RRL hides details/location of the server (whether it is replicated or singleton server, etc.) 2. RRL supports different transport protocols between the client and the server. 3. RRL marshals/serializes information to be sent across the network.

Page 22 of 197

L06b-Q7. How is the RMI Transport Layer implemented? The RMI Transport Layer provides the following 4 Abstractions: 1. Endpoint: is a Protection Domain or a Sandbox like JVM for execution of a server call or client call within the Sandbox. The Endpoint has a Table of Remote Objects that it can access (similar to Door Table in Spring NOS). 2. Transport: The RRL layer decides which transport to use: TCP or UDP Transport, and gives that command to the Transport layer. The Transport listens on Channel for incoming connections. The Transport is responsible for locating dispatcher that invokes the remote method. The Transport mechanism sits below the RRL layer and allows the object invocations to happen through the transport layer. 3. Channel: The type of the Transport decides the type of the Channel: TCP or UDP Channel. Two Endpoints make a connection on the Channel and do I/O using the Connection on the Channel. 4. Connection: Connection Management is part of the Transport Layer and is concerned with: a. Setup and listening for client connections, establishing the connections, locating the dispatcher for a remote method, and teardown of the connections. b. Liveness Monitoring is part of Connection Management and is typically done via periodic heartbeats. c. The RRL layer decides the correct transport mechanism: TCP or UDP Transport, and gives that command to the Transport layer. Thus, the Distributed Object Model of Java is a powerful vehicle for constructing network services: 1. It dynamically decides how to make the client-server relationship. 2. It provides flexible connection management in choosing different transports, Depending on network conditions, client-server locations, etc. Some subtle issues in the implementation of the Java RMI system are: 1. Distributed Garbage Collection 2. Dynamic Loading of Stubs on the Client-side. 3. Sophisticated Sandboxing mechanisms on client-side and server-side to ward off security threats. Key insight: Many raw ideas that start out as research become usable technology when time is ripe.

Page 23 of 197

Illustrated Notes for L06c: EJB: Enterprise Java Beans L06c-Q1. What is a “Java Bean”? What are some challenges for different enterprises to provide together a common service? A Java Bean is a “reusable software component” that has many Java objects in a bundle so that the Java Bean can be passed around easily from one application to another application for reuse. Some of the challenges for different enterprises to provide together a common service are to maintain and evolve: 1. Service Interoperability and compatibility, 2. Service Scalability, 3. Service Reliability, and 4. Service cost of operation. Such cross-enterprise services, like the airline reservation system, gmail, internet search engine, etc. are referred to as Giant Scale Services (GSS). In later sections, we will see how Object Technology facilitates structuring of an operating system at different levels and also how Object Technology facilitates structuring of Distributed Services, providing customers various options based on cost, convenience, guarantees, etc. The Object Technology also handles resource conflicts that might occur between simultaneous requests across space and time coming from several different clients. The main benefit of Object Technology is in the power of reuse of components. Intra-Enterprise View

Inter-Enterprise View

Page 24 of 197

L06c-Q2. What is an example of such cross-enterprise application? An example of such cross-enterprise application is purchasing a round trip airline ticket from Atlanta (US) to Chennai (India) at an web-site like Expedia. Expedia contacts various airline web-sites running in different enterprises and provides you with best options among the various choices possible. You take your own time and decide which option to pick, based on cost/convenience/guarantees. You talk to your spouse or relatives and then decide which option you want. Meanwhile, another customer is planning an exactly similar trip at around the same time as you but an airline seat can be given to only one customer. So, there are resource conflicts that might occur between simultaneous requests across space and time coming from different clients. In such cases, Object Technology facilitates structuring of Distributed Services, providing customers various options based on cost, convenience, guarantees, etc. The Object Technology also handles resource conflicts that might occur between simultaneous requests across space and time coming from several different clients. The main benefit of Object Technology is in the power of reuse of components.

Page 25 of 197

L06c-Q3. What are N-tier applications? Distributed Systems applications like airline reservation systems, hotel booking web-sites, etc. are called as N-tier applications because the software stack of an application comprises of several different layers: 1. Presentation Layer: is responsible for painting the browser screen and generating the web-page based on your request. 2. Application Layer: is responsible for the application logic that corresponds to what the service is providing. 3. Business Logic Layer: corresponds to the way airfares are decided, seats are allocated, etc. 4. Database Layer: that accesses the database that contains the information that is queried and updated based on the user request. So, in general, N-tier applications are applications that are separated into multiple tiers. [1] [2] N-tier applications are also called “Multi-tier applications” or “Distributed applications” [2] The letter ‘n’ just refers to some number, “n”, of tiers (layers) possible for the n-tier architecture. References: [1] https://en.wikipedia.org/wiki/Multitier_architecture [2] https://msdn.microsoft.com/en-us/library/bb384398.aspx [3] http://searchnetworking.techtarget.com/definition/n-tier Various issues handled by N-tier applications are: 1. Persistence: store the data so that the data persists even when electrical power is lost and restored. 2. Transaction: either the data is written completely or not written at all – all or none, no partial. 3. Caching: cache the data at various layers since it is faster to read from cache/DRAM than disk. 4. Clustering: cluster a set of related services in order to improve the service performance. 5. Security: ensure that the financial/personal data sent over insecure links remains secure. 6. Concurrency: exploit concurrency across several simultaneous requests coming in. e.g. simultaneously check availability of seats on different airlines in parallel for a particular date. Such applications are called Embarrassingly Parallel applications (aka Enchantingly Parallel) since the sub-operations are independent of each other. 7. Reuse components: portions of application logic are reused in components in order to service simultaneous requests from several different clients.

Page 26 of 197

L06c-Q4. How are N-tier applications structured? 1. To describe the structure of N-tier applications, we will use a particular framework called the JEE framework. JEE stands for Java Enterprise Edition framework. JEE was a rebranding of J2EE when Sun Microsystems released J2EE v1.5 (year 2006). That is, JEE was originally called J2EE and the modern name is JEE. 2. A Java Bean is a unit of reuse and contains a bundle of Java Objects to provide a specific functionality, e.g. a Java Bean may provide the shopping cart functionality. A Container is a Protection Domain implemented in a Java Virtual Machine (JVM) and it packages and hosts a related collection of Java Beans to provide higher-level functionality. An Application Service is constructed by using multiple Containers, typically present on different servers and used in a distributed manner. Mnemonic: Java Objects  Java Beans  Containers  Application Service. Note: Think of a Java Bean as a C library of function objects. The JEE framework has 4 containers (C-A-W-E) for constructing an application service: 1. Client Container Both reside on a web server and interact with client browser 2. Applet Container 3. Web Container contains Presentation logic, responsible for creating web pages to be sent back to the client browser 4. EJB Container manages Business Logic that decides what needs to be done to carry out the client browser request and communicates with the Database server to read/write data corresponding to the client browser request that came in.

Page 27 of 197

L06c-Q5. What the different types of Java Beans? There are 3 types of Java Beans (E-S-M): 1. Entity Bean are Persistent Objects with Primary Keys so that they can be easily retrieved from a database. E.g. An Entity Bean may be a row of a database.

2.

3.

Session Bean

Messagedriven Bean

Two types of persistence possible: a. Bean-managed Persistence: Persistence managed by the Bean. b. Container-managed Persistence: Persistence managed by Container. is a bean associated with client-server session. Two types of Session Beans: a. Stateful Session Bean: remember the state associated with the session, e.g. remember the shopping choices put in shopping cart for a shopping session that lasts multiple days before the items are bought. i.e. the state is remembered across multiple sessions. b. Stateless Session Bean: The state is thrown away at the end of each session, e.g. gmail session. is useful for asynchronous behavior, like receiving messages of interest that are typically event-driven, e.g. stock ticker information, newsfeed, RSS feed, etc.

Note that that each Java Bean type denotes a particular functionality. Each Java Bean can be constructed into 2 forms, based on the granularity level: 1. Fine-grained version/form of the Java Bean: provides more concurrency in dealing with individual requests that can be handled by the application server so that it can handle many concurrent requests simultaneously. The tradeoff/drawback in choosing the Fine-grained level of granularity is that the business logic becomes more complex. 2. Coarse-grained version/form of the Java Bean: provides less concurrency, but helps keep the business logic simple. Thus, the tradeoff in structuring n-tier applications is to choose either: Fine-grained granularity: to get more concurrency, but make business logic complex OR Coarse-grained granularity: to get less concurrency, but keep business logic simple.

Page 28 of 197

L06c-Q6. What are the design alternatives for structuring n-tier application servers? There are 3 design alternatives for structuring n-tier application servers: 1. Coarse-grained Session Beans, 2. Fine-grained Data Access Objects using Entity Beans, and 3. Session Beans with Entity Beans. The Client Container and the Applet Container are in the Web-server and so we will not consider them, but instead consider only the Web Container and EJB Container in the various design alternatives below. Recall that the Web Container contains the Presentation Logic and the EJB Container contains the Business Logic. A Servlet corresponds to an individual session with a particular client. Design Alternative 1: Coarse-grained Session Beans: 1. A Coarse-grained Session Bean is associated with each Servlet and serves the needs of a Client. Each Client is associated with one Session. So, Client1 connects to Servlet1 which connects to Session Bean1, which connects to database. Also, Client2 connects to Servlet2 which connects to Session Bean2, which connects to database. 2. Pros: a. Minimal Container Services needed from the EJB Container. The EJB Container coordinates concurrent independent sessions. b. Business Logic is confined to and NOT exposed beyond the corporate network since the Business Logic is contained in the EJB Container and NOT the Web Container. 3. Cons: a. The Application Structure is akin to a Monolithic kernel. b. There is Very Limited Concurrency for accessing different parts of a database in parallel and hence Coarse-grained Bean Structure represents a lost opportunity in exploiting parallelism.

Page 29 of 197

L06c-Q7. What is the 2nd Design Alternative? Design Alternative 2: Fine-grained Data Access Objects using Entity Beans: 1. The Business Logic is pushed to be in the Web Container containing Servlet and Presentation Logic, so as to have a 3-tier software structure of Servlet, Presentation Logic and Business Logic. 2. All Data Access happens through Entity Beans which have Persistence characteristics. That is, Data Access Object (DAO) is implemented using Entity Beans. Recall: Entity Beans can have Container-Managed Persistence OR Beans-Managed Persistence. An Entity Bean can represent the granularity of either one row of a database or a set of rows. Multiple Entity Beans can work in parallel for a single client-server session. The EJB Container contains these Entity Beans. 3. Pros: a. There is opportunity for the Entity Bean to cluster the requests from different clients and amortize access to the database server across several different client requests that are temporally happening at the same time. b. The granularity of the Data Access Object (DAO) determines the level of concurrency desired in constructing the application service – this is reuse of available facilities. 4. Cons: Business Logic is exposed outside the corporate network.

Page 30 of 197

L06c-Q8. What is the 3rd Design Alternative? Design Alternative 3: Session Beans with Entity Beans: 1. The Web Container contains only the Servlet and the Presentation Logic associated with the Servlet. 2. The Business Logic sits along with the Session Façade and Entity Bean in the EJB Container. 3. A Session Façade is associated with each Client Session. The Session Façade handles all Data Access needs of its associated Business Logic. i.e. the Session Bean is put as a Session Façade to access the DAOs (Data Access Objects). The DAOs are implemented using multiple Entity Beans (having CMP/BMP) so that we get Concurrency and can amortize data accesses across different client requests. 4. The Session Bean communicates with the Entity Bean using Java RMI or local interfaces. Using local interface makes the communication faster since no network communication is used, whereas using RMI makes the communication flexible enough to be used anywhere in network. 5. Pros: a. No network communication between Business Logic and Entity Beans. b. Business Logic is confined to and NOT exposed beyond the corporate network. c. There is opportunity for the Entity Bean to cluster the requests from different clients and amortize access to the database server across several different client requests that are temporally happening at the same time. d. The granularity of the Data Access Object (DAO) determines the level of concurrency desired in constructing the application service – this is reuse of available facilities. Note: EJB allows developers to write Business Logic without having to worry about cross-cutting concerns like Security, Logging, Persistence, etc. This is the power of using Object Technology in structuring complex n-tier Application Servers. The video on conclusion is here.

Page 31 of 197

Illustrated Notes for L07a: GMS: Global Memory System L07a-Q1. What are some key insights based on course material so far? What’s next? Some key insights based on course material so far are as follows: 1. Object Technology, with its innate concepts of Inheritance, Composition and Reuse, helps in structuring Distributed Services at different levels of an Application Service. 2. Technological Innovation happens when one looks beyond the obvious and immediate horizon. Often this happens in academia because academicians are NOT bound by market pressures or compliance with existing product lines. This encourages out-of-the-box thinking which makes innovation possible. 3. History is ripe with many examples where the byproducts of a thought-experiment may have more lasting impact than the original vision behind the thought experiment. For example, Java would NOT have happened but for the failed video-on-demand trials of the 90s. Many failed attempts offer technological insights that are the reusable products of those thought-experiments. The lesson outline is given in the slide below. We can think of these lessons organized across different ways of using memory of peer machines in a cluster. 1. GMS: Global Memory System: Use of Cluster Memory for Paging (~AirbnB for Memory!) 2. DSM: Distributed Shared Memory: Use of Cluster Memory for Shared Memory. 3. DFS: Distributed File System: Use of Cluster Memory for Cooperative Caching of Files.

Page 32 of 197

L07a-Q2. Why does GMS use Cluster Memory for Paging and NOT a Local Disk? 1. The Virtual Address Space of a process in an Operating System is much larger than the Physical Memory that is allocated for a particular process. 2. The Working Set of a Process is defined as the portion of Virtual Address Space of the process that is actually present in Physical Memory, even though the Virtual Memory Manager component of an Operating System gives the illusion to the process that all of its Virtual Address Space is contained in Physical Memory by paging in and out from the disk the pages that are frequently accessed by the process. That is, the Working Set of a Process is always contained in Physical Memory. 3. Memory pressure on a particular node is the amount of physical memory used to keep the working set of all the processes in physical memory. If the Memory pressure is low, then the physical memory available is sufficient to hold the Working Set of all processes in the physical memory. If the Memory pressure is high, then the physical memory available is NOT sufficient to hold the Working Set of all processes in the physical memory. So, when multiple nodes are connected on a LAN, the memory pressure on each node will be different since the characteristics of the processes and workload on different nodes will be different. 4. If some nodes are idle and some nodes are loaded/busy, here is an idea! Can we use use the idle cluster memory of a peer node for paging in and out the working set of processes on a busy node? 5. You may wonder: What is the benefit of doing this? The benefit of paging in and out to peer cluster memory is that accessing remote memory is faster than accessing local disk. To give you some typical performance numbers, accessing a local disk can take around 10 milliseconds, whereas accessing remote memory can take around 100 microseconds. FYI: If we use specialized network cards like Infiniband and RDMA protocols, accessing remote memory can take around 20 microseconds or far less time! Accessing a spinning disk involves Seek Latency and Rotational Latency and so the overall transfer rates are around 200 MegaBytes/second. Accessing remote memory through Gigabit Ethernet cards can range from 100 MegaBytes/second to around 5 GigaBytes/second (using RDMA). Think about it!!

Page 33 of 197

L07a-Q3. How does GMS use Cluster Memory for Paging and NOT a Local Disk? 1. GMS uses Cluster Memory across the network for Paging and NOT a Local Disk because accessing remote memory is faster than accessing local disk. 2. A Page Fault happens when the Virtual Memory Address translation to Physical Memory Address fails because the physical page is NOT in physical memory. Usually, on a Page Fault, the normal Virtual Memory Manager would page-in the page from disk so that the Virtual to Physical address translation can succeed. 3. In GMS, on a Page Fault, the GMS checks the remote cluster memory of peer nodes for the pages that it is looking for before it checks the local disk because accessing remote memory is faster than accessing local disk. 4. In other words, the GMS sort of integrates Cluster Memory into the normal Memory Hierarchy of processor caches and DRAM. The normal Memory Hierarchy means that cached data is first checked for in processor caches and if not found there, then DRAM is checked and if not found there, then disk is checked. This is the Memory Hierarchy of Processor Cache, followed by DRAM, followed by Disk. GMS integrates Cluster Memory into this normal Memory Hierarchy, so that it becomes: Processor Cache, followed by DRAM, followed by Cluster Memory, followed by Disk. In short, GMS Cluster Memory serves as yet another level in the Memory Hierarchy and GMS trades Network Communications for Disk I/O. 5. One point to note about GMS is that it is used only for reads and NOT for writes. All writes happens locally to disk and so there is never a scenario where a dirty page is present in remote memory and the remote node crashes causing us to lose data. The only pages that can be in cluster memories are NON-dirty (Clean), paged-out pages. 6. So, the big picture is: When the GMS Virtual Memory Manager decides to evict a page from physical memory to make room for the current working set of processes on a node, then the GMS VMM, instead of swapping out to local disk, goes out on the network and finds an idle peer node and puts that non-dirty (clean), paged-out (aka evicted) page in peer memory for later retrieval (page-in).

Page 34 of 197

L07a-Q4. What are the basic principles of GSM: Global Shared Memory? 1. In GSM, “Cache” refers to Physical Memory, i.e. DRAM and NOT CPU/processor L1/L2 Cache. A Page can have 2 states: a. Shared Page: The same page is copied in the physical memory of multiple nodes and used by an application that spans across multiple nodes of a cluster (e.g. Oracle RAC application). b. Private Page: A page is present only in the physical memory of local node and is NOT present on any other cluster node since the application using that page runs only on that node. 2. Page Faults for a particular node are handled by the community of peer cluster nodes such that the memories of peer cluster nodes serve as supplement for local disk of the particular node. 3. The Physical Memory of a particular node is split into 2 sections: a. Local Memory: contains the Working Set of Local Processes on the particular node. Local Memory contains Private Pages and/or Shared Pages. b. Global Memory: Spare Memory for peer cluster nodes (community service by particular node). Global Memory contains ONLY Private Pages swapped-out by peer cluster nodes on network. Thus, if a page is in Global Memory, it is guaranteed to be Private and cannot be a Shared page. The Split between Local Memory and Global Memory on a particular node is DYNAMIC in response to memory pressure on that particular node. 4. The memory pressure on a particular node is NOT constant and varies with time. If the memory pressure on the particular node causes more local memory to be required to hold the working set of local processes, then the local memory portion increases. On the other hand, if the node is idle (say, user of that node is out for coffee or lunch), then the local memory portion may decrease/shrink, and the node can allow more of the peer node’s swapped-out pages to be present in the global portion of the physical memory of the node, thereby increasing Global portion. In short, More Memory needed to hold Working Set of Local Process causes => Local UP, Global DOWN. Less Memory needed to hold Working Set of Local Process causes => Local DOWN, Global UP. And this boundary/split between Local and Global Memory portions is DYNAMIC and keeps shifting on each node depending on what’s going on various nodes in the cluster. 5. Thus, the whole idea of GSM is to serve as a remote paging facility. However, Coherence of multiple copies of a page is NOT a GSM problem; but it is an application problem. GSM chooses a Globally Least Recently Used (Global LRU) page as the page replacement policy. GSM has to manage age information for the pages across all the nodes in the cluster.

Page 35 of 197

L07a-Q5. Case 1: What happens in the scenario that a Page Fault happens on node P? Let’s recall a few important and foundational points: a. Local Memory contains Private Pages and/or Shared Pages. b. Global Memory contains ONLY Private Pages swapped-out by peer cluster nodes on network. c. Local Memory + Global Memory = Physical Memory (DRAM) of a node. d. A Page Fault on a node means that a process on that node needs a page that is NOT present in its Working Set of memory. e. A Page Fault always happens ONLY for a Local Memory page and NEVER for a Global Memory page because only the Local Memory is used for the Working Set of processes and the Global Memory is used only for the community service of housing non-dirty, paged-out pages of peer nodes. (~AirbnB or Uber for Memory! :-) 1. In figure below, memory on hosts P and Q is split into Local Memory and Global Memory each. The Split between Local Memory and Global Memory on a particular node is DYNAMIC in response to memory pressure on that particular node. 2. Say, a Page Fault of Page X happens on node P. 3. GMS searches the Global Memory of various cluster nodes and finds Page X in Global Memory of node Q. 4. GMS copies Page X from Global Memory of node Q into Local Memory of node P. 5. This new copy of Page X in Local Memory of node P increases Local Memory size by 1 (++). 6. But since Local Memory + Global Memory = Physical Memory (DRAM), which is constant, Global Memory needs to reduce by 1 page. 7. So, GMS picks the oldest page in Global Memory of node P, copies this oldest page in Global Memory of node P to Global Memory of node Q, and decreases the Global Memory size by 1 (--). 8. To summarize, Page Fault of Page X on node P increases Memory Pressure on node P, and on node P: Local Memory size ++, Global Memory size --, BoundaryP moves DOWN, and on node Q: Page Y traded (obtained) for (given away) Page X. BoundaryQ UNCHANGED.

Page 36 of 197

L07a-Q6. Case 2: When a Page Fault happens on node P with its Global Memory = 0? 1. Let’s assume that CASE 1 keeps on happening repeatedly on node P and Local Memory keeps on ++ and Global Memory keeps on --, until Global Memory becomes 0 and Local Memory becomes = Physical Memory, i.e. there is no community service on node P. Case 2 is this common case of too much memory pressure on node P with 0 community service. 2. Now, if a Page Fault happens on node P, then there is no option on node P except to throw out some page from the working set of processes on node P in order to make room for this new page that needs to be got from memory of a peer node. 3. The page on node P that is chosen for replacement is called victim/replacement/evicted page. The victim/replacement/evicted page is usually the LRU page (Least Recently Used page). 4. So, for Case 2, the boundary on both nodes P and Q remains UNCHANGED because on node P: Global Memory is already 0, Local Memory remains = Physical Memory, and Page X traded (obtained) for (given away) Page Y. BoundaryP UNCHANGED. on node Q: Page Y traded (obtained) for (given away) Page X. BoundaryQ UNCHANGED. Remember that the Global Cache of every cluster node acts as a surrogate for local disk because accessing peer memory is faster than accessing local disk.

Page 37 of 197

L07a-Q7. Case 3: What happens when none of the peer nodes have faulted page of node P? 1. When none of the peer nodes have the faulted page of node P, the only option is to get the faulted page from the local disk on node P – this is slower than getting the page from memory of peer node but there is no other option. Note that the figure below shows only one disk, but in reality, each node has its own local disk. Also, the figure has been modified and so compare with the original version to notice the edits. NOTE: This is different from the lecture videos. Please double-check with Prof/TAs. 2. So, a Page Fault of a Page X in Local Memory on node P causes read of Page X from local disk. This increases Local Memory size by 1. Since Local + Global Memory = Physical (constant), this decreases Global Memory size by 1. i.e LocalMemSize++ and GlobalMemSize--. 3. GMS (evicts) picks any page Y from Global Memory of node P, copies it to a peer node R and decreases node P’s Global Memory size by 1. Hence, on node P: Local Memory size ++, Global Memory size --, BoundaryP moves DOWN. Note that node R is the node chosen by GMS that has the globally oldest page in entire cluster. 4. Now, on node R, there are 2 possibilities: a. Page Y from Global Memory of node P is copied into Global Memory of node R. We know that Global Memory has ONLY CLEAN (not updated, non-dirty) pages, and hence the earlier data in page being replaced is simply discarded (thrown away). Thus, on node R: Evicted Page Z discarded & replaced by Page Y. BoundaryR UNCHANGED. OR b. Page Y from Global Memory of node P is copied into Local Memory of node R. Now, there are 2 more possibilities: Page Z in Local Memory of node R can be DIRTY OR CLEAN. If Page Z is DIRTY: on node R: Dirty Page Z swapped to disk & replaced by Page Y. BoundaryR UNCHANGED. If Page Z is CLEAN: on node R: Clean Page Z discarded & replaced by Page Y. BoundaryR UNCHANGED. NOTE: This is different from the lecture videos. Please double-check with Prof/TAs.

Page 38 of 197

L07a-Q8. Case 4: What happens when the faulted page of node P is a Shared Page? 1. Consider the case when a Page X is shared across nodes P and Q, but is currently present in the Working Set of a process on node Q only and NOT on node P. Now, consider the scenario when a process on node P page-faults the Shared Page X. 2. GMS searches for Shared Page X in the peer cluster memory, finds it on node Q and copies the Shared Page X from node Q to node P. However, since Page X is a Shared Page, GMS leaves the Shared Page X in the Working Set of a process on node Q. Now, the same Shared Page X is present in the Local Memory of both nodes P and Q. So, the total memory pressure in the cluster increases by 1 and eventually one page from one of the cluster nodes will need to be swapped out to disk. 3. Increasing Local Memory size by 1 on node P causes decrease of its Global Memory size by 1. On node P: Local Memory size ++, Global Memory size --, BoundaryP moves DOWN. Any page from Global Memory of node P is chosen as the replacement/evicted page and copied to a node, say node R, that has the Globally Oldest Page. Note: We choose any page from Global Memory of node P and NOT LRU page because all pages in Global Memory are clean, peer pages and NOT used locally on node P. 4. Now, on node R, there are 2 possibilities: a. Page Y from Global Memory of node P is copied into Global Memory of node R. We know that Global Memory has ONLY CLEAN (not updated, non-dirty) pages, and hence the earlier data in page being replaced is simply discarded (thrown away). Thus, on node R: Evicted Page Z discarded & replaced by Page Y. BoundaryR UNCHANGED. OR b. Page Y from Global Memory of node P is copied into Local Memory of node R. Now, there are 2 more possibilities: Page Z in Local Memory of node R can be DIRTY OR CLEAN. If Page Z is DIRTY: on node R: Dirty Page Z swapped to disk & replaced by Page Y. BoundaryR UNCHANGED. If Page Z is CLEAN: on node R: Clean Page Z discarded & replaced by Page Y. BoundaryR UNCHANGED. NOTE: This is different from the lecture videos. Please double-check with Prof/TAs.

Page 39 of 197

L07a-Q9. What happens if a node remains idle for a long time-period? Good question! If a node remains idle for a long time-period, its Working Set is NOT utilized locally on that node and hence will be replaced by paged-out pages from peer cluster nodes. Eventually, a completely idle node becomes a memory server for peer cluster noes. Thus, the Local-Global Memory split/boundary is NOT Static, but is Dynamic, Depending on what Local Memory Pressure exists on that node. In summary, Idle Node => Boundary moves UP (more community service). Busy Node => Boundary moves DOWN (less community service). Click here for a quiz.

L07a-Q10. Any background information required to understand future material? Yes, understanding the background information below will help: 1. A file on a disk can be “memory mapped” to a process such that any access to the file can be performed using a memory access to a portion of memory in the process that maps to the file instead of doing explicit read() and write() calls to access the file. 2. A TLB is a fast memory cache on a processor for virtual to physical memory address translation. 3. There are 4 possible states for a page on a cluster node: a. Local-Private: Private page in Local Memory portion, b. Local-Shared: Shared page in Local Memory portion, c. Global-Private: Clean, swapped-out Private page in Global Memory portion, and d. Swapped-out onto disk and hence NOT in physical memory. Page 40 of 197

L07a-Q11. What is the design philosophy for Page Age management of the various cluster pages? Geriatrics: means relating to old people, e.g. a rest home for geriatrics. Here: it is Page Age! (NOT the Age of Larry Page or Ellen Page :-)

1. Geriatrics in GSM relates to page age management, some example of which are: a. identifying the Globally Oldest Page in the cluster, b. ensuring that the age management work is distributed well across all cluster nodes and does not burden any one node, c. ensuring that age management work shifts over time across all cluster nodes and does not burden any one node, etc. Thus, the page age management is broken across the SPACE AXIS and TIME AXIS to have Distributed Page Age Management to avoid Over-Burdening any one node. 2. Across the Time Axis, the page age management is broken into what is called Epochs. An Epoch is a “granularity” of page age management work done by a particular node. The page age management work is either: a. Time-bound (the node does the page age management work for maximum time T duration) OR b. Space-bound (the node does the page age management work for maximum M replacements) In other words, after max T duration OR max M replacements, the current epoch is complete, and so, go to a new epoch, and pick a new node as the new manager for page age management. Distribute the Page Age Management work over multiple cluster nodes (SPACE AXIS) and Shift the Page Age Management work over time onto different cluster nodes (TIME AXIS) so that no single node in the distributed system becomes Over-Burdened with this work. T may be in the order of few seconds and M may be in the order of thousands of replacements.

Page 41 of 197

L07a-Q12. How is the Manager for Page Age management chosen? The Manager is chosen in an interesting way. Pay close attention for next 2 pages! 1. The Manager for Page Age management is called an Initiator. A new Initiator is chosen for every Epoch, at the start of every new Epoch. Recall: An Epoch is a “granularity” of page age management work done by a particular node. Note: Higher the page age, Older the page, & more likely to be unused and chosen for eviction. 2. Every node sends the Page Age information for all Local and Global Memory Pages {Lp, … Gp} to the Initiator (Manager) node for the current Epoch. 3. Let’s say that out of all the pages in the cluster, the Initiator decides to do M replacements (eviction) of old pages in the cluster. That is, if all the pages are sorted by Age, starting from newest to oldest page, then the list looks like: MinAge v |  Active Pages …………………….  |  M replacements  | Any Page Age < MinAge is an Active Page. Any Page Age >= MinAge is a Replacement Page. The Initiator chooses the oldest M pages that are going to be replaced in current Epoch and determines MinAge to be the minimum of the M replacements. 4. Next, out of all the M replacement pages, the Initiator finds % of pages for a particular node in those M replacement pages and calls this % as the Node’s Weight. Node Weight = Expected share of M replacements for a node. Node Weight = Fraction or % of pages to be replaced in upcoming/next Epoch (NOT current). The Initiator calculates the Weight for each node in the cluster. 5. Next, the Initiator sends to every node the following information: { MinAge of M replacements, Weight Distribution of all cluster nodes } Weight Distribution means Weight of each node (Wi) for all cluster nodes (V i).

Page 42 of 197

L07a-Q13. How does each node do page replacements locally using info sent by Initiator? 6. Each node receives {MinAge, Weight Distribution} from the Initiator. The node that has the Highest Weight in Weight Distribution is Least Active or Most Inactive. Hence, each node locally chooses the node with Highest Weight in Weight Distribution as the Initiator for the upcoming/next Epoch. This decision is made locally without any coordination with any other cluster node. The Initiator uses the principle of “Use the Past to predict the Future”. That is, “Use the page age information (past) to predict where the replacements will happen (future).” 7. When a page fault happens on a node P and a new page X is brought into Local Memory of node P, an old page Y from Global Memory of node P is chosen to be evicted/replaced. 8. Node P locally chooses the following approach to decide which page Y needs to be evicted: If PageY’s Age >= MinAge, then this page is a Replacement Page to be discarded in next Epoch. i.e. do NOT send it to a peer cluster node, just discard it now. If PageY’s Age < MinAge, then this page is an Active Page and so send it to a peer cluster node with the Highest Weight among the Weight Distribution Vector. Thus the Page Age Management (aka Geriatric Management) is approximating a “Global LRU” by making an approximate estimate of what is going to happen in future. Note that we emphasize locally because each node makes decisions locally without any coordination with any other cluster node. 9. So, to summarize, this is how GMS thinks globally but acts locally! a. At the beginning of each epoch, GMS Initiator computes globally to get all Page Age info computes MinAge, computes Weight Distribution for all nodes (i.e. Replacement Page %) and sends {MinAge, Weight Distribution} to all cluster nodes. b. Using this information from Initiator, each cluster node then makes local decisions in terms of what to do with a page that is chosen as an eviction candidate, i.e. discard the page if it will be replaced soon OR send it to peer cluster memory if it is active.

Page 43 of 197

L07a-Q14. How is GMS actually implemented? 1. The basic idea of GMS is that instead of using disk as a paging device, use the cluster memory. The GMS authors used DEC’s OSF/1 operating system to implement GMS. 2. There are 2 components of the OSF/1 OS’s Memory system: (shown as blocks in figure below) a. Virtual Memory system: is devoted to managing the page faults that occur for the process address space, in particular, the stack and the heap, and to get these missing pages from disk. These pages are called Anonymous Pages, because a Virtual Page is housed in a Physical Page Frame and when a page is replaced, that same Physical Page Frame is used to host a new Virtual Page. b. Unified Buffer Cache: is the File system cache used to cache pages from frequently used files. The Unified Buffer Cache is responsible for handling page faults to memory-mapped files as well as for handling explicit read and write calls that an application makes to file system. 3. In a typical OS, a. the writes from the Virtual Memory manager and Unified Buffer Cache go to the disk, b. a page fault causes read of the required page from disk (NOT shown in figure below), c. when a physical page frame is freed, it is thrown back into the free list, and d. the pageout daemon periodically discards clean pages and swaps-out dirty pages to disk in order to avoid this expensive activity of writing to disk during a future page fault. 4. After GMS is integrated into the OS, a. the writes from the Virtual Memory manager and Unified Buffer Cache go to the disk i.e. writes remain unmodified after GMS integration into the OS, b. a page fault causes read of the required page from GMS instead of reading it from disk i.e. VM Manager and Unified Buffer Cache are modified to go to GMS instead of disk, c. when a physical page frame is freed, it is thrown back into the free list, and d. the pageout daemon periodically gives clean pages to GMS for later retrieval from peer node and swaps-out dirty pages to disk. 5. The tricky part is collecting Page Age information required for the Global LRU approximation: a. Unified Buffer Cache (UBC) calls are intercepted by modifying UBC to collect Page Age information for pages that are housed in the UBC. b. Changing the Virtual Memory manager is complicated because the memory access performed by a process happens in hardware on the CPU and there is no easy to intercept it. The OS does NOT see the individual memory access that a user program is making. So, the trick used is to have a daemon periodically dump TLB contents into a GMS structure and use this information to derive Age information for all Anonymous pages that are being handled by the Virtual Memory. This is how GMS is integrated into OS, interacting with UBC, VMM, pageout and free-lists.

Page 44 of 197

L07a-Q15. Ok, understood. What are some of the important data structures used by GMS? 1. 3 workhorse data structures of GMS that make cluster-wide memory management possible are: a. PFD: Page Frame Dictionary b. GCD: Global Cache Dictionary c. POD: Page Ownership Directory 2. Virtual Address, VA, is converted to  UID, Universal/Global Identifier to be used cluster-wide using information from the Virtual Memory system and the Unified Buffer Cache like: a. IP-address of the node containing the Virtual Address, b. Disk Partition that contains a copy of the page that corresponds to the Virtual Address, c. i-node data structure of the file that corresponds to the Virtual Address, and d. offset in that file for the page that corresponds to the Virtual Address. Note that UID space spans the entire cluster. 3. PFD: Page Frame Directory: is like a Page Table that converts the Universal Identifier (UID) (Universal Virtual Address) to  Physical Page Frame Number (PFN) hosting that Virtual Addr. 4. GCD: is a Global Cache Directory that is a Partitioned, Cluster-wide Hash Table, used to distribute management of mapping from UID to  node hosting corresponding PFD so as to avoid the static mapping problem of over-burdening any single node. That is, given a UID, GCD will tell us which node has the PFD corresponding to this UID. 5. POD: Page Ownership Directory: Given a UID, POD says which node has corresponding GCD. The POD is replicated on all the cluster nodes and has up-to-date information on each node. That is, the UID space that spans the entire cluster is partitioned into sets of ownership regions called Page Ownership and every node is responsible for a portion of the UID space, present in the POD for that node. If the nodes in a cluster remain the same, then POD is static, but if nodes are added or deleted, then the POD needs to redistributed, which is usually rare. 6. Thus, here is the overall path for page-fault handling: ( Local comms,  Remote comms) VA  UID  POD  Ni-GCD  GCD  Ni-PFD PFD  PFN UID: Virtual Universal ID POD: Page Ownership Directory: Locally present on each node, Replicated on all nodes GCD: Global Cache Directory: Partitioned, Cluster-wide Hash table PFD: Page Frame Directory: converts UID to PFN.

Page 45 of 197

L07a-Q16. How are lookups and remote communications used for a VAPFN translation? Here is the overall path for page-fault handling: ( Local comms,  Remote comms) VA  UID  POD  Ni-GCD  GCD  Ni-PFD PFD  PFN VA: Virtual Address, UID: Virtual Universal ID POD: Page Ownership Directory: Locally present on each node, Replicated on all nodes Ni-GCD: Node Id that contains the GCD for this UID GCD: Global Cache Directory: Partitioned, Cluster-wide Hash table Ni-PFD: Node Id that contains the PFD for this UID PFD: Page Frame Directory: converts UID to PFN PFN: Page Frame Number On a page-fault, there are 3 levels of lookups possible: a) POD lookup, b) GCD lookup, and c) PFD lookup. However, the common case is for a page to be non-shared (i.e. based on a request of a local process on node), and POD and GCD are on the same node and hence GMS can directly go to the PFD node to get the required page. So, the page fault service is quick in most cases since it uses only 1 remote network communication. That is, The common-case page-fault handling path is: ( Local comms,  Remote comms) VA  UID  POD  Ni-GCD  GCD  Ni-PFD PFD  PFN

Page 46 of 197

L07a-Q17. Is it guaranteed that being routed to a node having a PFD will yield desired PFN? No. It is quite possible, though rare, that a node may NOT have the desired UID-PFN mapping. This scenario is described below. 1. Let’s consider the scenario where after reaching the node expected to have the PFD of interest, GMS finds that the PFD does NOT have the desired UID  PFN mapping. Uh-oh! 2. This could happen when the node containing this PFD evicts the corresponding page and sends it to a peer cluster node and the other distributed data structures are still in the mode of getting updated and have not been fully updated yet, due to which this miss happens. 3. Another possibility of this happening is when the POD information is stale due to new additions or new deletions of nodes in the cluster and the distributed data structures are still in the mode of getting updated and have not been fully updated yet, due to which this miss happens. 4. In such scenarios, GMS does a re-lookup of the POD data structure, hoping that the POD data structure may have been updated correctly by this time and the subsequent lookups will help GMS get routed to the correct node having appropriate PFD for the UID-PFN mapping. This is expected to happen very rarely as compared to the common case described earlier. Videos: Click here and here.

L07a-Q18. Any summarized insights? 1. GMS takes advantage of Idle Memory in peer nodes by using remote memory as a paging device instead of using the local disk. 2. The key reason for the GMS architecture is that accessing remote memory is faster than accessing local (electro-mechanical) disk. 3. GMS is more relevant in today’s data center servers since no node is individually owned and all cluster nodes together serve common purposes and business needs. Page 47 of 197

Illustrated Notes for L07b: DSM: Distributed Shared Memory L07b-Q1. What is DSM and what is the use-case for DSM? The Lesson Outline is given in the slide below. We can think of these lessons organized across different ways of using memory of peer machines in a cluster. 1. GMS: Global Memory System: Use of Cluster Memory for Paging (~AirbnB for Memory!) because accessing remote memory is faster than accessing local (electro-mechanical) disk. 2. DSM: Distributed Shared Memory: Use of Cluster Memory for Shared Memory. 3. DFS: Distributed File System: Use of Cluster Memory for Cooperative Caching of Files. DSM: Distributed Shared Memory is an Operating System abstraction that provides an illusion of shared memory to applications, even though the cluster nodes in the Local Area Network (LAN) do NOT physically share the memory. The abstraction of Shared Memory in an OS makes application development easier and hence the abstraction of Distributed Shared Memory should help in application development as well.

Page 48 of 197

L07b-Q2. How can Cluster work as Parallel machine for a Sequential program? A program can be written in 2 styles: 1. Implicitly Parallel Program, OR 2. Explicitly Parallel Program. Let’s explain Implicitly Parallel Programming first. Implicitly Parallel Programming uses a user-assisted but automatic parallelizing compiler that has the following characteristics: 1. It identifies opportunities for parallelism in the program. 2. It exploits the available parallelism in the hardware resources and performs all the heavy-lifting in a transparent manner in terms of converting the sequential program to a parallel program to extract performance for the application. 3. It has compiler directives (e.g. pragma directives) for distribution of data and computation such that it efficiently maps computations to the distributed resources of a cluster. 4. It works really well for certain class of programs, called Data Parallel programs, in which the data accesses are fairly static and determinable at compile-time. 5. However, one drawback of Implicitly Parallel programs is that we are at mercy of Automatic Parallelization Directives and hence cannot control the parallelism completely, due to which there is limited potential for exploiting the available parallelism in the cluster. 6. HPF: High-Performance FORTRAN is an example of a programming language that does User-assisted Automatic Parallelization using compiler-directives. Thus, a cluster can work as a Parallel machine for a Sequential program using an Implicitly Parallel program having compiler directives, that is converted into a Parallel program to exploit parallelism by automated distribution of data and computation onto distributed cluster resources using an User-assisted, Automatic, Parallelizing compiler.

Page 49 of 197

L07b-Q3. Ok, now tell me about Explicitly Parallel programming. A program can be written in 2 styles: 1. Implicitly Parallel Program, OR 2. Explicitly Parallel Program. We already explained Implicitly Parallel Programming. Let’s explain Explicitly Parallel Programming next. In Explicitly Parallel Programming style, the application programmer thinks about his application and writes an explicitly parallel program using low-level parallelization primitives. Now, there are 2 styles of writing Explicitly Parallel programs: 1. MPI-style, and 2. DSM-style. Let’s describe the MPI-style first. 1. The MPI-style uses parallelization primitives from a message-passing library of the run-time system to explicitly send and receive messages to peer cluster nodes and achieve distribution of data and computation in the cluster. 2. This message-passing style of explicitly parallel program is true to the physical nature of the cluster because each cluster node has its own computation and memory resources, i.e. no resources are shared, and the distribution of data and computation across cluster resources is explicitly achieved by sending and receiving messages to each other. 3. The MPI-style is popular for writing scientific applications running in large-scale clusters in national labs like Lawrence Livermore and Argonne national labs. 4. Some examples of MPI libraries are MPI, PVM, CLF from DEC, etc. 5. The MPI-style of programming is hard because it requires explicit coordination to be done by the programmer. The programmer has to think in terms of coordinating the activities on different processes by explicitly sending and receiving messages from peer cluster nodes. This calls for a radical change of thinking in terms of how to structure a program.

Page 50 of 197

L07b-Q4. Now I am curious to know about the DSM-style of programming. Is it better? 1. Yes, the DSM-style of programming can be intuitive than the MPI-style of programming because DSM gives the illusion of all memory distributed in various cluster nodes to be shared and available transparently so that the application programmer can use the DSM abstraction to write parallel programs intuitively. 2. The transition from a sequential program to a parallel program is easy and intuitive using DSM-style of programming because it is natural to think of shared data structures among different threads of an application rather than think of explicit message-passing between cluster nodes to achieve parallelism. 3. The DSM abstraction provides the same level of comfort to a programmer who is used to same set of primitives like locks and barriers for synchronization and the Pthreads library for creating threads, but the underlying DSM run-time will do the heavy-lifting of distribution of data and computation across the different cluster nodes.

Page 51 of 197

L07b-Q5. What is the history of Shared Memory systems? Were many DSM systems built? Yes, many Shared Memory systems have been built over the last 20+ years because the DSM abstraction can really make programming easy and intuitive. In fact, the DSM systems have been built both in hardware to get maximum performance and in software to get maximum flexibility. The software versions of DSM were first built in the mid 80s: 1. the Ivy system built at Yale University by Kai Li, 2. the Clouds operating system built at Georgia Tech, and 3. similar systems built at UPenn. Later on, in the 90s, 2nd generation of DSM systems built were: 1. Munin 2. Treadmarks 3. Cashmere 4. Beehive Also, Structured DSM systems provide Structured Objects for programming in a cluster, e.g: 1. Linda 2. Orca 3. Stampede at Georgia Tech, in concert with DEC and Stampede RT 4. PTS: Persistent Temporal Streams (we will discuss this in a later lesson). And some Hardware DSM systems built were: BBN Butterfly, Sequent Symmetry, KSR-1, Alewife@MIT, DASH@Stanford, Bluegene@IBM, Origin2000@SGI, Altix@SGI. These Clusters of SMPs (Shared MultiProcessors) have now become workhorses of current datacenters.

Page 52 of 197

L07b-Q6. What are the common primitives used in Shared Memory programming? Some of the common primitives used in Shared Memory programming are: 1. Mutex synchronization primitive: (aka Lock) Only one thread can acquire the Mutex, modify some data and release the Mutex. At the same time, if another thread tries to do the same thing, then it waits to acquire the Mutex. That is, only one thread can acquire a Mutex at any point of time. 2. Barrier synchronization primitive: Only after a pre-defined set of threads reach the same barrier is further execution of each thread allowed and until then each thread waits for the other threads to reach that barrier in code. This is the barrier entry criteria and same is the case for the barrier exit criteria. These primitives are very popular in scientific applications. The memory accesses in a shared memory program are of 2 types: 1. Normal Read/Write to Shared Data used by the application. 2. Special Read/Write to Synchronization Variables provided by the operating system.

Page 53 of 197

L07b-Q7. I get confused between Consistency and Coherence. What is the exact difference? I have come across many people who are NOT crystal-clear on this difference. So it may help you to master this difference between Consistency and Coherence. The first thing to note is the prefix for these terms and use it as it is: Memory Consistency Vs Cache Coherence. Avoid using the terms: Cache Consistency OR Memory Coherence. That is, Coherence is always for a Cache and Consistency is typically used with Memory. Here are some excellent links on internet: 1. https://www.quora.com/What-is-the-difference-between-cache-consistency-and-cache-coherence Important: Do read the excellent movie analogy given by a CS professor to Aamir Ogna here! 2. Stanford Primer on Memory Consistency and Coherence (lot of juicy details if more interest). Memory Consistency: Order of multiple updates to the same data. Cache Coherence: Requirement of having latest update to the same data. Memory Consistency is always required, but Cache Coherence is required only if there are Caches. Memory Consistency model is a contract between application programmer and OS, that specifies: how soon a change is going to be made visible to other processes having data from the same memory location in their respective caches. Cache Coherence answers the question of how is the Memory Consistency implemented even though there are many caches present having the same data. In short, Memory Consistency = Model/Contract/Guarantee presented to the programmer Cache Coherence = Implementation/Mechanism/Fulfillment of the Model in presence of Caches. That is, the Memory Consistency Model’s guarantee/contract presented to the programmer is fulfilled by the Cache Coherence Implementation of the contract in the presence of private caches.

Page 54 of 197

L07b-Q8. What are different Memory Consistency models? Let’s start with a popular memory consistency model called the Sequential Consistency Model. We will describe another memory consistency model called Release Consistency Model later. Let’s consider the merge-shuffle of two card-decks example (say card-decks A and B) as shown in the figure below. After the merge-shuffle is complete, the order of cards in the joint-set from card-deck A will still be in order and the order of cards in the joint-set from card-deck B will still be in order, but the interleaved order of cards between card-deck A and card-deck B will be arbitrary. Comparing this card-deck merge shuffle example to two programs A and B, the textual or program order of memory accesses submitted by program A is guaranteed to be the same order in which the memory accesses will actually happen, irrespective of which processor/CPU runs the program A. The same guarantee is also given to program order of memory accesses of program B, even though programs A and B are happening at the same time. However, there is no guarantee given for any order of interleaved memory accesses between programs A and B, i.e. the interleaved memory accesses from programs A and B are arbitrary. The Sequential Consistency Model guarantees that the program order of memory accesses within each thread of execution is maintained, while the interleaved accesses from different threads of execution is arbitrary, but the interleaved accesses are still consistent with always maintaining the program order. The Sequential Consistency Model builds on Atomicity of the individual Read-Write operations. In short, Sequential Consistency Model maintains the program order of each program, but does not provide any guarantee for interleaved memory accesses between multiple programs.

Page 55 of 197

L07b-Q9. How does Sequential Consistency model work for accesses to shared program data and OS synchronization info? This is an important point to note. The Sequential Consistency model does NOT know the association between different accesses to the shared program data (application data) and the OS synchronization information (sync variables like locks, etc.), and hence Coherence action is needed for every memory access. That is, if a process writes to a memory location, then the SC model has to ensure that this write is inserted into the Global order of writes and it performs the Coherence action with respect to all other processors to ensure that all the caches have the same, latest data as the main memory. The main important to remember is that: For SC model, Coherence action is required for every read-write memory access.

Page 56 of 197

L07b-Q10. Why is Sequential Consistency model NOT sufficient in all cases? A Sequential Consistency model has some drawbacks which we describe next. Some background: A Critical Section of code is the portion of source code protected by a mutually exclusive lock so that only one thread can execute that portion of a code at any point of time. Recall: The Sequential Consistency model does NOT know the association between different accesses to the shared program data (application data) and the OS synchronization information (sync variables like locks, etc.), and hence Coherence action is needed for every memory access. However, one point to observe is that all the cache coherence actions between processes P1 and P2 are NOT required until P1 releases the mutually exclusive lock L. That is, the Cache Coherence mechanism provided by OS for implementing the memory consistency model does more work than required and causes more, un-necessary overhead for maintaining coherence, which leads to poor scalability of the shared memory system. This is why we need a new consistency model, called the Release Consistency Model, which we will describe next.

Page 57 of 197

L07b-Q11. Tell me about the Release Consistency model and how is it better than SC model? We observed earlier that the Sequential Consistency (SC) model performs coherence action on every memory access which causes more overhead and poor scalability of the shared memory system. This motivates the need for the Release Consistency (RC) model which is better than the SC model. The Release Consistency model associates protected data structures with the lock protecting them and defers the cache coherency action for these protected data structures until the lock release time. So, the Release Consistency model optimizes on top of the Sequential Consistency model. The Release Consistency model allows exploitation of computation and defers the communication required for the coherence action until the lock release time. This is how you could visualize it and try to remember it: No. Program step Sequential Consistency Model 1. Acquire Lock 2. Data Accesses Perform Coherence Actions 3. Release Lock

Release Consistency Model Avoid Coherence Actions Perform Coherence Actions

In other words, say: Process P1 acquires lock L, does data accesses, releases lock L, and then Process P2 acquires lock L, does data accesses, releases lock L. When process P2 tries to acquire lock L that was held by process P1 earlier, all coherence actions for data accesses made by P1 should complete before lock L is released. This approach of RC model avoids blocking of processors for every memory access and instead blocks processors at the lock release time collectively for all the memory accesses associated with the protected data structures only so that there is more overlap of computation and communication and improved system performance.

Page 58 of 197

L07b-Q12. So, the upshot for RC model is perform coherence action only on lock release, correct? Correct! You got it! The Release Consistency model allows exploitation of computation and communication between different processes and defers the communication required for the coherence action until the lock release time, thereby improving overall system performance. Note that the Release Consistency model is able to do this optimization only because it distinguishes between normal data accesses made to program data and special synchronization accesses made to synchronization primitive data like locks, barriers, etc. You can also map the RC model to another synchronization primitive, the barrier, where Arriving at the barrier is equivalent to Acquiring a lock, and Exiting the barrier is equivalent to Releasing a lock. So, before exiting the barrier, the RC model ensures that any changes made to the protected data structures is reflected at all other processes through the cache coherence mechanism. So, you are correct. The Coherence action is only performed when a lock/barrier is released/exited!

Page 59 of 197

L07b-Q13. Can you take a concrete example and explain the Release Consistency model further? Here is a concrete example to understand the Release Consistency model further. The video lecture explains the example nicely. Note that processes P1 and P2 can execute in any order. If process P2 executes first and waits for flag = 0, then the wait() call releases lock L before it goes into the waiting stage and the wait() call re-acquires lock L when it returns from the waiting stage. Hence the code is safe to check flag again and this is also why the unlock(L) statement is fine after having performed a wait(). The while(flag == 0) is a defensive re-check because when we have more than 2 processes doing a wait(), another waiting process P3 may have acquired the lock L, made flag = 1 and released lock L, just before process P2’s wait call re-acquires lock L. So, it is a defensive re-check to ensure that P2 still has the lock. If process P2 finds that flag is 0, then it does not have the lock and hence does wait() again, even though it got woken up spuriously.

Page 60 of 197

L07b-Q14. Can you summarize the advantages of the RC model over the SC model? Here are the main advantages of Release Consistency model over Sequential Consistency model: 1. No waiting for coherence actions on every memory access and instead having the coherence actions only when required, i.e. at the lock release time. 2. Overlap computation with communication. 3. Better performance due to optimization in performing communication only when required, i.e. at the lock release time. In general, the goal of any computer system is to overlap computation with communication so that both operations can happen in parallel and thus improve overall system performance. This is a key design philosophy behind many performance optimizations and high-performance systems, e.g. the use of RDMA (Remote Direct Memory Access) in HPC (High-Performance Computing) systems aiming for Exascale computing.

Page 61 of 197

L07b-Q15. Hey, can’t the Coherence action be delayed until the lock is acquired? Good thought! Yes, the Coherence action can be delayed until the lock is acquired. That approach is used by the Lazy Release Consistency model. The Lazy Release Consistency model defers the coherence actions from the point of lock release to the point of lock acquisition thereby increasing the opportunity to overlap computation with communication in the time window between lock release and lock acquisition. The principle of procrastination often helps in system design. The vanilla Release Consistency model is also referred to as the Eager Release Consistency model so as to compare with the Lazy Release Consistency Model. Let’s use our earlier table to compare the SC, ERC and LRC models: No. Program step SC Model ERC Model 1. Acquire Lock L 2. Data Accesses Perform Coherence Actions 3. Release Lock L Perform Coherence Actions 4. Acquire Lock L

LRC Model

Perform Coherence Actions

Page 62 of 197

L07b-Q16. What are the pros and cons of Eager RC Vs Lazy RC? The Eager RC model performs the coherence actions by broadcast-pushing the updates at lock release time to other processes having the same data, whereas The Lazy RC model performs the coherence actions by unicast-pulling the updates at lock acquisition time from the single process that performed the lock release. Thus, Eager RC model is a broadcast-push communication from a single process to multiple processes, whereas Lazy RC model is a unicast-pull communication from one a single process to a single process. As we can see, the Lazy RC model has lot less communication than Eager RC, thereby improving overall system performance. A drawback/con of the Lazy RC model is that at the point of lock acquisition, the process acquiring the lock does NOT have all the coherence actions complete yet and so the lock acquisition step may have to incur more latency in completing the lock acquisition. That is, for Lazy RC, Pro=Less number of messages, but Con=More Latency at Lock Acquisition. Mnemonic: ERC: Broadcast-Push@Lock-Release LRC: Unicast-Pull@Lock-Acquisition. The coherence actions could be based on 2 types of protocols: 1. Invalidation-based protocol: Invalidate the stale caches so that next fetch from the cache gets the latest data from main memory OR 2. Update-based protocol: Update the stale cache directly with the latest data from main memory so that next fetch from the cache gets the latest data from the updated cache. Thus, we have seen the following Memory Consistency models: 1. SC: Sequential Consistency Model, 2. ERC: Eager Release Consistency Model, and 3. LRC: Lazy Release Consistency Model.

Page 63 of 197

L07b-Q17. How is the Software DSM implemented? 1. The DSM software partitions the Globally Shared Virtual Memory’s Address Space into Chunks that are managed individually on different cluster nodes. 2. The granularity/unit of Coherence maintenance in a single node is usually words of memory, but communicating words of memory across cluster nodes for each memory access is expensive. Hence, the granularity/unit of Coherence maintenance in software-DSM is an entire page and NOT words of memory in order to exploit Spatial Locality. 3. The DSM software also handles maintenance of coherence by having Distributed Ownership for the different Virtual Pages that constitute the Global Virtual Address Space. 4. The memory access pattern determines which node owns which pages. The page owner is responsible for keeping coherence information for the page, i.e. the page owner knows exactly which node to contact to get the latest update, and performs the coherence actions for that page. 5. Thus, the application programmer views the entire cluster as a globally shared virtual memory.

Page 64 of 197

L07b-Q18. Can you take an example of a page-fault to explain how Software-DSM works? Ok, here is an example: 1. If a page-fault for page X happens on node A, the Virtual Memory manager on node A contacts DSM-software on node A, which then contacts the page-owner, say node B. 2. Now, the node B who is the owner for the faulted page X may have the latest copy of page X OR it may know another node C that has the latest copy of page X. 3. If node B has the latest copy of page X, node B sends page X to node A. 4. If node B does NOT have the latest copy of page X, but knows that node C has the latest copy of page X, then it redirects node A to node C, so that node C sends page X to node A. 5. Finally, on receiving page X, the page-table on node A is updated and the page-fault service routine is considered complete so that the corresponding waiting process can resume execution. Note that the cooperation between DSM software and VM manager happens at the granularity of a page. Some examples of software-DSM systems are: Ivy@Yale, Clouds@GeorgiaTech, Mirage@UPenn, Munin@Rice. These software-DSM systems use the Single-Writer Multiple-Readers Protocol for Cache Coherency maintenance. In this protocol, only a single writer is allowed to write to a page, but multiple readers can read that page at same time. So, if a node A wants to write to a page X that is present on nodes A, B and C, then the DSM-software on behalf of node A invalidates the copies of page X on nodes B and C, so that node A has exclusive access to page X and can perform the write to page X. One drawback of the Single-Writer Multiple-Readers Protocol is potential False Sharing. Recall: False Sharing happens when data appears to be shared even though in reality it is not shared. Say, a single page X contains 10 different data structures, each protected by a different lock. If a node A updates data structure 1 on page X repeatedly, while node B updates data structure 2 on page X repeatedly, then the page X will continue to ping-pong between nodes A and B repeatedly causing False Sharing and thereby decreasing overall system performance.

Page 65 of 197

L07b-Q19. Is Multi-Writer Coherence Protocol possible? Yes! Let’s describe Multi-Writer Coherence Protocol in concert with Lazy Release Consistency, detailed nicely in the Treadmarks paper. 1. The key idea in the Multi-Writer Coherence Protocol is to maintain the associated diffs when the protected data structures are updated while holding the associated lock L, i.e. in the critical section (CS) of the code for process P1. 2. Next, at the point of Local Acquisition L by another process P2, the pages modified within Critical Section of process P1 are invalidated. 3. Then, at the point of Accessing the pages, process P2 fetches the Original Copies of Modified Pages from the page owners and the associated Diffs from the process P1 to construct the latest copies of the pages. 4. The benefit of this approach is that fine-grained diffs decrease the communication overhead to only the updates required. 5. The disadvantage of this approach is that there is increased latency overhead of fetching the data from the page owner at the time of data access.

Page 66 of 197

L07b-Q20. What happens if another process P3 had modified pages between updates of P1 & P2? Interesting thought! Treadmarks handled that scenario by ensuring that the diffs from multiple nodes are obtained by Process P2 and then the diffs are applied in order so that the latest updated page can be constructed accurately. This allows the DSM to handle the scenario where only data structure 1 is updated by process 1 and only data structure 2 is updated by process 2 so that only one diff is obtained by P2 from P1 and only one diff is obtained by P2 from P3, thereby reducing the network communication to only the diffs that are required. Notice how the principle of procrastination applied to lazy fetch of diffs helps in increasing the overall system performance.

Page 67 of 197

L07b-Q21. What happens when multiple data structures are updated independently by 2 different cluster nodes? Let’s take an example describing this scenario in detail. 1. Say, a page X has two data structures D1 and D4, protected by locks L1 and L4 respectively, and updated by processes P1 and P4 respectively. 2. The DSM software brings only the diffs for D1 from previous users of associated lock L1. The DSM software assumes that the changes made by process P4 to data structure D4 on page X using lock L4 is irrelevant to process P1’s critical section of data structure D1 protected by lock L1. That is, the DSM software assumes that the changes made by another process using a different lock to a different portion of the same page is NOT relevant as far as the critical section associated with the current lock and the current process is concerned. 3. The DSM software knows all the pages modified by a critical section associated with its lock L and so a future lock request for lock L results in invalidating all pages associated with lock L. The next critical section associated with lock L will also get the original pages from the page owner and the diffs from the processes that have modified the pages and apply them in order to construct the latest, current version of the pages associated with lock L. 4. Thus, the Multiple-Writer Coherence protocol in concert with Lazy Release Consistency uses fine-grained diffs of changes made to pages, which reduces the amount of communication required in executing critical sections of an application, thereby improving overall system performance.

Page 68 of 197

L07b-Q22. OK, could you describe the implementation of LRC+MW now? The fun part of any Systems Research is actually implementing the ideas! Let’s talk about the implementation details next, specifically how the Diffs are constructed. Recall: The 3 steps used by a process to update critical section data are: Step A: Acquire Lock L Step B: Write data X to a page (update critical section data) Step C: Release Lock L We will describe the implementation on what happens during Steps B and C. 1. At the start of “Step B: Write data X to page”, the Original page is copied to a Twin (aka before-image) page. 2. The Original page is then made Writeable. Note that the Twin page is NOT-accessible by any process since it is NOT mapped into the Page Table. 3. Now, the process goes ahead and does write to data X on the Writeable Original page. 4. Then, on reaching “Step C: Release Lock L”, the diffs are computed between the Twin page and the modified Original page using Run-Length Encoding Diffs. 5. The Original page is made Write-Protect again so that the page cannot be written to unless it is in a critical section and required coherence actions for the page have completed. 6. Finally, the Twin page is freed. Mnemonic: 1. Copy, 2. Orig Writeable, 3. Write Orig, 4. Diffs, 5. Write-Protect Orig, 6. Free Twin. Run-Length Encoding (RLE) example: WWWWWbWW is encoded as 5W 1b 2W. Run-Length Encoding Diff example: Page: (Offset1, Size1, Update1); (Offset2, Size2, Update2)… This is how the Diffs between the Twin page and Modified page are constructed for updates made in the critical section and kept ready for later lazy-fetch at the time of next acquisition of the lock associated with this critical section.

Page 69 of 197

L07b-Q23. Is this protocol safe when we have multiple writers writing to the same page? 1. Yes, it is perfectly fine to have multiple writers writing to the same page as long as they are using locks to do updates to the associated critical sections. 2. To repeat, the important assumption is that the user’s program always does updates to critical sections using locks and each critical section has a different lock associated with it. 3. So, even if the same page is being modified by different processes concurrently, they are modifying different portions of the same page protected by different locks and generating separate diffs for their updates. 4. For some reason, if two processes perform writes to the same portion of the same page, then it is a bug/defect (called data race bug) because the user should have had the writes to the same critical section be protected by a lock but probably forgot to do so. 5. Note the DSM software has a way of ensuring that the changes made to a critical section under a particular lock are propagated from the current lock owner process to the next process that is going to use these updates by acquiring the same lock. When a shared page is accessed by a thread, the OS generates SIGSEGV exception, i.e. the Segmentation Violation Signal. The exception/signal handler for SIGSEGV triggers the DSM software which gets the original page and the diffs for the page and applies the diffs in order to get the current, latest version of the page. 6. Note that if there are too many diffs, there is space overhead and access latency overhead. So, garbage-collection (GC) of these diffs is triggered by applying these diffs in order to the page and the page owner is updated with this latest copy of the page. All other copies of the page are invalidated. This GC is triggered when a particular threshold of pending diffs (space metric) is reached OR a particular threshold amount of time (time metric) has elapsed. This is how the TreadMarks system implements the LRC Multiple-Writer Coherence protocol.

Page 70 of 197

L07b-Q24. Are all DSMs always page-based? No, some DSMs are non-page-based DSMs too. 1. Some DSM do NOT use the granularity of a page for coherence maintenance. For such cases, the DSM has to track individual reads and writes happening for each process. 2. There are various ways to do that. One approach is called Library-based approach: The program uses a library which provides the framework to do this. The library framework annotates shared variables that are used in the program. Whenever a shared variable is touched, the library framework causes a trap at the point of access to the shared variable, which contacts the DSM software and completes the corresponding coherence actions. The binary generated has appropriate code inserted in it to ensure that these coherence actions happen at point of access of each shared variable. Examples on non-page-based-DSM using library-approach: Shashta@DEC, Beehive@GeorgiaTech. 3. The advantages of Library-based approach are: a. there is NO OS support required, and b. since the sharing is happening at the level of variables and NOT at the page-level, there is NO False-sharing. 4. Another approach is the Structured DSM approach, in which a programming library provides shared abstractions at the level of structures that are meaningful for an application and NOT at the level of memory locations or page-based. The shared abstractions are manipulated by the application program using API calls that are part of the language runtime. The API calls complete all coherence actions required for the shared abstractions and do NOT require any OS support. Examples of Structured DSM systems: Linda, Orca, Stampeded@GeorgiaTech, PTS. 5. The advantages of Structured DSM approach are: a. No OS support, and b. No False Sharing.

Page 71 of 197

L07b-Q25. Do DSMs scale well? 1. As more cluster nodes (processors) are added to the DSM system, the DSM has the benefit/pro of increased parallelism, but at the cost of increased overhead due to implicit communication overheads on the LAN for coherence maintenance. 2. However, overall, we do expect DSMs to scale well as the number of cluster nodes increase. The actual speedup may be less than the expected speed due to implicit communication overheads on the LAN for coherence maintenance. 3. The amount of speedup is will be almost 0 if the sharing is too fine-grained because there will be too much communication overhead. So the basic principle to achieve good scalability is: The computation to communication ratio should be very high to achieve good speed-up. That is, the critical sections should be large enough so that the communication required for coherency actions is relatively less as compared to the computation that happens within the critical section. In other words, Distributed Shared Memory scales well only when we share coarse-grained memory and NOT fine-grained memory.

Page 72 of 197

Illustrated Notes for L07c: DFS: Distributed File System L07c-Q1. How is DFS different from NFS? Historically, the NFS (Network File System) was built by Sun Microsystems in the year 1985, for user files on a central server to be accessible by client workstations over the local area network. Some NFS limitations are: 1. The NFS central server is a serious source of bottleneck for scalability. 2. The NFS central server has limited bandwidth to access disks. 3. The NFS central server has limited file system cache. The DFS (Distributed File System) avoid the scalability problems of the NFS central server by having the file system distributed across the cluster nodes. Each file is distributed across all the cluster nodes which together host the Distributed File System. The data and meta-data management of the files in DFS is also distributed across all cluster nodes. All cluster nodes cumulatively provide a bigger “cooperative” cache for Distributed File System. Some DFS benefits are: 1. No centralization due to the distributed file system. 2. Distributed meta-data management. 3. Distributed data management. 4. Increased bandwidth provides faster remote access. 5. Increased cache capacity across multiple cluster nodes avoids remote access and hence improves overall performance.

Page 73 of 197

L07c-Q2. What is the key idea behind a Distributed File System? The Lesson Outline is given in the slide below. We can think of these lessons organized across different ways of using memory of peer machines in a cluster. 1. GMS: Global Memory System: Use of Cluster Memory for Paging (~AirbnB for Memory!) 2. DSM: Distributed Shared Memory: Use of Cluster Memory for Shared Memory. 3. DFS: Distributed File System: Use of Cluster Memory for Cooperative Caching of Files. DFS intelligently uses the cluster memory for efficient management of metadata associated with the files and for caching the file content cooperatively among the cluster nodes for satisfying client requests for files. The key idea behind cooperative caching of files is that since electromechanical disks are slow but networks are faster than disks, DFS avoids accessing the disks and instead retrieves data from peer cluster memory in the network.

Page 74 of 197

L07c-Q3. What is the concept of Striping a file to multiple disks? Let’s introduce some background concepts and technologies: 1. RAID: Redundant Array of Inexpensive Disks: The basic concept of RAID is to split the I/O (e.g. write) of a file to go to multiple disks instead of it all going to only one disk. This splitting of I/O to go to multiple disks in parallel is called Striping a file to multiple disks. The advantage of RAID is to collectively get more IO bandwidth from a group of disks than it is possible to get from a single disk. However, since we have more disks in RAID collection, the probability of failure increases, and hence RAID utilizes Error Correcting technology like ECC (Error Correcting Code). That is, if the RAID has 5 disks, then 4 disks are used to store data and 1 disk is used to store the checksum (ECC) for the data on 4 disks so that if one of the disks fails, then the data for that failed disk can be calculated and recovered back. In short, the ECC data augments Striped data and provides failure protection of only one disk. 2. Pros and Cons of RAID: Pros of RAID: a. Increased I/O Bandwidth due to Striping to multiple disks in parallel. b. Single disk failure protection by ECC. Cons of RAID: a. Increased Cost: More disks and RAID framework increase the overall cost of storage. b. Small Write problem: Inefficient to stripe very small files since more effort and storage is spent for value received.

Page 75 of 197

L07c-Q4. I get confused between Log-structured File System (LFS) and Journaling File System (JFS). What is the exact difference? Yeah, the difference between LFS and JFS is an important difference to understand and remember. 1. Log-structured File System (LFS): In LFS, here are the steps that happen for a write to a file: a. Instead of writing multiple file changes (aka mods) directly to the disk, the multiple file changes are buffered by writing to a log segment data structure in memory. b. This log segment data structure in memory is then flushed to a contiguous log segment on disk causing a Sequential Write instead of a Random Write. This improves performance since Sequential Writes to disk are faster than Random Writes. c. The Flush of the log segment in memory to contiguous log segment on disk happens on 2 conditions: i. either the log segment in memory fills-up to a certain extent (space metric), OR ii. a certain time interval has elapsed since the previous flush (time metric). d. The Flush of the log segment in memory to contiguous log segment on disk typically happens only when the log segment has sufficient data, i.e. the write is a big write and NOT a small write. This big write is striped across multiple disks avoiding the Small Write problem. Thus, Log-structured File System solves the Small Write problem of Striping. 2. A Log-structured File System (LFS) has only log segments and NO data segments. These log segments are append-only segments. LFS has to reconstruct data from multiple log segments to read any data from disk, though reading data in parallel from striped disks using the parallel RAID technology improves overall I/O bandwidth. Hence, LFS has high read latency for the first read of any data; subsequent data may be cached. 3. For writes to LFS, overwriting the same data block multiple times invalidates previous old blocks in the log segments on disk, thereby creating lots of holes in the log segments on disk. These holes need to be cleaned-up periodically to avoid wastage. 4. The logs in LFS are similar to Distributed Shared Memory with Multiple-Writer protocol. 5. Journaling File System (JFS): A JFS has both log segments and data segments. Periodically, JFS applies information from the log segments to the data segments and discards log segments. Thus, the lifetime of information in log segments is only for a short time duration until it is committed to the data segment. Hence, reads from JFS do NOT have reconstruct data as in LFS. 6. Examples: LFS=Pure Storage, AuroraDB@Amazon, WAFL@Netapp, CASL@NimbleStorage. JFS=JFS@IBM, VxFS@Veritas, ext4@Linux.

Page 76 of 197

L07c-Q5. How is Software RAID different from Hardware RAID? 1. Hardware RAID has 2 problems: a. Hardware RAID has the Small Write problem during Striping. b. Hardware RAID uses multiple hardware disks and hence is expensive. 2. Software RAID solves the problems of Hardware RAID in the following manner: a. The Log-structured File System component of Software RAID solves the Small Write problem during Striping. b. Software RAID reduces cost by using commodity hardware by striping a file access across the disks of all cluster nodes in the LAN to get parallel I/O done and thereby improved bandwidth. 3. An example of Software RAID is the Zebra File system developed at UC Berkeley. Zebra FS combines both Log-structured File System (LFS) and Software RAID; LFS combines multiple I/Os into a log segment in memory and then flushes it to a contiguous log segment on disk to solve the Small Write problem during Striping, and Software RAID gets parallel I/O done across the cluster nodes to get improved bandwidth and reduced latency for client requests.

Page 77 of 197

L07c-Q6. Didn’t UCB also build XFS? How was XFS different from Zebra FS? How does a DFS compare with NFS? Yes, XFS is a Distributed File System that was also built at UC Berkeley. XFS has the following features: 1. is a Distributed File System that is truly scalable since it is Serverless, i.e. it has no reliance on a central server. 2. Uses Log-based Striping (from Zebra FS) 3. Uses Stripe Groups: Use subset of storage servers to form a group of striping servers 4. Uses Distributed Log Cleaning to cleanup holes. 5. Uses Distributed and Dynamic management of both meta-data and data across cluster nodes. 6. Uses Cooperative Caching of files across cluster nodes. Accessing peer cache memory is efficient and faster than accessing electromagnetic disk locally. This also helps conserve the total amount of cache size in the cluster and use it frugally by exploiting peer’s remote memory. An example of file meta-data is “inode”. On each file access, the “i-node” helps convert the for a file into corresponding on disk. The problems with a centralized NFS server are as follows: 1. The NFS Centralized Server is unconcerned about the semantics of file sharing and is constrained by the amount of memory space on the central server for caching data and meta-data for the files. 2. Too many requests for the same file results in Hot Spots, which affects Scalability. 3. Different NFS servers having different I/O workloads cannot share the imbalance in load. 4. The mapping between the meta-data manager for a file and the actual location of the file remains the same and is NOT dynamic as it is in XFS.

Page 78 of 197

L07c-Q7. How are Log-based Striping and Stripe Groups used in XFS? Here is how XFS uses Log-based Striping and Stripe Groups: 1. The changes to files by various clients are all written to a log segment data structure in memory. Each change is called a log fragment. All changes together are written to the log segment data structure in memory. [Do NOT get confused between log fragment and log segment.] The log segment is an append-only data structure, i.e. you can only append data to it and cannot delete data from it. 2. When the log segment data structure in memory fills up beyond a threshold (space metric) OR when a particular time interval has elapsed since the previous flush (time metric), the log segment data structure in memory and its ECC data are flushed and written sequentially to a contiguous log segment on disk. Note that the log segment on disk is striped into log fragments and ECC data. 3. The log segment data structure is written to a subset of the storage servers, called Stripe Group. For example, if we have 100 storage servers, the Stripe Group could be 10 servers. That is, The subset of storage servers used for striping a log segment is called a Stripe Group for that particular log segment. If we write the log segment on all the storage servers, say 100 servers, then it may possibly cause the Small Write problem. Hence, by writing the log segment to a subset of storage servers, i.e. Stripe Group, we solve the Small Write problem.

Page 79 of 197

L07c-Q8. What are the features of a Stripe Group? A Stripe Group has the following features: 1. A Stripe Group for a log segment is a subset of storage servers used for striping that log segment. 2. A Stripe Group helps solve the Small Write problem. 3. Stripe Groups allow parallel client activities on each Stripe Group. 4. Stripe Groups provide increased availability because having failure of few disks affects only some and not all client requests. 5. Different Stripe Groups allow different cleaning service processes to be assigned to different stripe groups thereby increasing parallelism and making the log cleaning efficient. 6. Stripe Groups provide high overall I/O throughput.

L07c-Q9. Remind me: What is the use of the “i-node” data structure? Recall: An example of file meta-data is “inode”. On each file access, the “i-node” helps convert the for a file into corresponding on disk.

Page 80 of 197

L07c-Q10. How does the Cooperative Caching of files actually take place in XFS? 1. XFS uses the peer cluster memory for Cooperative Caching of files and for reducing the stress on the management of data files. 2. XFS handles cache coherence in presence of multiple processes and distributed shared memory by using the Single Writer, Multiple Readers protocol. This means that there can be multiple readers reading a file at the same time, but there can be only one writer to a file. 3. The granularity/unit of cache coherence in XFS is a file block and NOT an entire file. 4. The manager for a file is responsible for meta-data management of the file. The manager for the file is aware of the clients that have the file contents in their cache, e.g. in the figure below the manager node is aware that cluster nodes c1 and c2 have cached the file block f1. 5. Now, if a client c3 makes a write request to the manager for block f1, then the manager knows that the file block f1 is in “read-shared” mode, whereas client 3 wants the file block f1 to be in “read-write” mode. 6. Next, the manager sends “cache invalidation” message to cluster nodes c1 and c2 so that they can invalidate their local cached copies of file block f1. 7. Then, the manager gives write token to client c3 so that c3 can complete the write. 8. When a future read request for file block f1 arrives, the manager revokes the write token from c3 so that c3 cannot write to file block f1 anymore, but can read it. For c3 to be able to do write to file block f1 again, it has to request the write token again from the manager. The future read request for file block f1 gets data cached from the memory of c3 and this is cooperative caching. In short, Cooperative Caching satisfies a read request for a file block from the cache of peer node in the cluster instead of reading the file block from the slow, electromagnetic disk locally.

Page 81 of 197

L07c-Q11. Could you explain XFS Log Cleaning with an example? The figure below illustrates a nice example of multiple writes to the same file blocks that generate holes in the log segments. File block 1’ in log segment 1 is updated to  1’’ in log segment 2, leaving a hole for 1’ in seg1. File block 2’ in log segment 1 is updated to  2’’ in log segment 3, leaving a hole for 2’ in seg1. File blocks 1’ and 2’ are marked deleted and hence are holes in the log segment. File blocks 5’, 3’, 4’ are live (active) blocks. Each log segment is a contiguous area on disk. So, when portions of the log segment are marked deleted (as marked by the red-cross inside the log segment in the figure below), they leave “holes” in the log segment. These holes cause data fragmentation and wasted disk space. Remember that the benefit for log segments being contiguous portions on disk is to have sequential writes which are faster than random writes. Having holes makes us lose this benefit and so XFS periodically performs log-cleaning, reading all live (active) data from log segments and copying them to a new log segment so as to reclaim wasted disk space and to continue to have the benefit of sequential writes. Remember that the Distributed Log Cleaning activity happens concurrently with writing to files by separate threads working on different tasks on various cluster nodes in the distributed system. The Clients, that are generating data to be appended to log segments, are also responsible for Log Cleaning and thus there is no separation between clients and servers. Each Stripe Group has a designated leader cluster node which is responsible for assigning cleaning activity to members of that Stripe Group. This Stripe Group Leader responsibility is different from the Manager responsibility which handles meta-data management and is responsible for data integrity. To summarize, the XFS Log Cleaning procedure aggregates (coalesces) all “live” (active) file blocks into a new log segment so that the old log segments can be garbage-collected (cleaned-up), thereby removing holes in all the existing log segments.

Page 82 of 197

L07c-Q12. How does XFS generate the list of data blocks corresponding to a filename? XFS has meta-data management that is not static, i.e. the meta-data management is distributed across various cluster nodes (space metric) and is dynamic since it changes over a period of time (time metric). 1. Each client has a manager-map (m-map) table which is a lookup table that converts a file name to the corresponding metadata manager for that file. Note that the m-map table is replicated on every cluster node for fast lookup. 2. The client contacts the manager node for a particular filename and goes through the following path/flow for various table lookups to generate associated data structures: {Filename}  [FileDir]  {i-no.}  [i-map]  {Log i-node}  [StripeGroupMap]  [StorageServers]  {DataBlocks} The Filename is mapped to the Log Segment i-node to the Stripe Group containing that Log Segment which is present on the various Storage Servers. The client goes to those Storage Servers and gets the data blocks corresponding to the file name. 3. These lookups do NOT have to be repeated for access to the same file name again, because the actual data blocks and the results of all table lookups are cached.

Page 83 of 197

L07c-Q13. What are the various possibilities of table lookups when a file is accessed by a client? Let's see the 3 possibilities described below when a client accesses a file: Key for each point below:  Local Access,  Remote/Network Access 1. The data blocks for the file are already cached locally in the client's memory, which is the fastest path for file access among these possibilities (in the order of 10 nanoseconds). {Filename, Offset}  [FileDir]  {Index#, Offset}  [UNIX Cache]  {Data Blocks}. 2. The 2nd possibility is that the data blocks for the file are not in local memory cache of the client but are present in a peer cluster node's memory. In this case, the data block is fetched from peer cluster memory. This is the second best path for file access since accessing remote memory is still faster (in the order of 100 microseconds) than accessing a disk (around 10 ms). {Filename, Offset}  [FileDir]  {Index#, Offset}  [m-map]  {Mgr}  [Peer UNIX Cache]  {Data Blocks}. 3. If the data blocks are not present in a peer cluster node's memory, then various table lookups distributed over cluster nodes provide the set of storage servers from which the data blocks are read. This could potentially result in 3 network hops. This is the slowest of all 3 possibilities since reading from electromagnetic disks is involved (in the order of 10s of milliseconds). {Filename, Offset}  [FileDir]  {Index#, Offset}  [m-map]  {Mgr}  [i-map]  {Log i-node}  [StripeGroupMap]  [StorageServers] {DataBlocks}

Page 84 of 197

L07c-Q14. What happens in the scenario when clients write to a file in XFS? 1. At a high-level, the client writes to a Stripe Group and notifies the Manager about the latest status on the file. 2. At a low-level, the client aggregates all the changes to files into the log segment data structure in memory. After a space or time threshold has elapsed, this log segment data structure in memory is flushed to a contiguous portion in a Stripe Group of Storage Servers. Finally, the client notifies the latest status on these modified files to the manager so that it has up-to-date status about the files it manages. Other examples of DFS: Andrew File System and the Coda File System at CMU. To summarize, some key technical innovations of XFS are: 1. Log-based Striping. 2. Stripe Groups: Striping across subset of Storage Servers. 3. Fusion of Cooperative caching and Dynamic management of data and meta-data. 4. Distributed Log-cleaning such that file system clients can perform data mutations as well perform log cleaning efficiently by keeping a count of changes made by them to log segments. These technical innovations are reusable nuggets of technology. Modern distributed file systems reuse these technical nuggets to make the implementation scalable by removing centralization and utilizing memory across LAN nodes intelligently.

Page 85 of 197

Illustrated Notes for L08a: LRVM: Lightweight Recoverable Virtual Memory L08a-Q1. What is the lesson outline? This lesson is on System Recovery, i.e. how to build systems that can survive failures? Failures in computer systems can be of 3 types: 1. Power failures, 2. Hardware failures, and 3. Software failures. We will be looking at 3 systems: 1. LRVM: provides Persistent Virtual Memory as a OS System Service or OS subsystem. 2. RioVista: provides performance-conscious Persistent Virtual Memory. 3. QuickSilver: provides Recovery as a First-class Citizen built into the OS design. Some background context for this lesson: Disk access is slow because it is an electromechanical device having a rotating disk and a head that accesses data over the rotating disk. Disk access latency involves: 1. Rotational Latency (time for the disk to spin to the correct location) and 2. Seek latency (time for the head to position to the correct location). Random writes to disk involve both rotational and seek latency for each write and hence are slow. Sequential writes involve rotational and seek latency only for the first write and subsequent sequential writes are very fast. So, Sequential writes are way faster than Random writes. Many papers and people use similar terminology for various different concepts and that can be confusing in understanding already complex concepts. In this lesson, we will use some new terminology that builds on top of the lectures but is not present in the lectures (e.g. redo log entry). This new terminology is expected to aid in better understanding of the concepts.

Page 86 of 197

L08a-Q2. Why is Persistence important for OS subsystems? What is the key idea of LRVM? Persistence of data means to store data on “stable” storage such that the data is available on a power recycle of switching the computer off and on. Persistence is important for various OS subsystems. For example, for a file system, it is important for the file system to persist critical meta-data information like i-nodes and other important structural information on how files are stored on disk. The OS subsystems read such critical information from disk and use the cached data from memory for performance reasons since data access from memory (~50 ns) is faster than from disk (~10 ms). However, this cached data needs to be flushed back to disk (aka stable storage) for consistency reasons. If the virtual memory system is made persistent then the data structures in the virtual memory automatically become persistent and the OS subsystems do NOT have to worry about explicit flush of data to stable storage (disk). This would make recovery from failures easy since critical data is available in persistent, stable storage. Such an abstraction to provide persistent virtual memory will need to be performant (fast) and efficient (use less OS resources), and be intuitive and flexible. The key idea in LRVM is to make only certain important portions of virtual memory persistent, by converting all the random writes of these virtual memory portions to log segment writes in memory and then write this log segment in memory to a contiguous area on disk so that the random writes are converted into sequential writes, thus improving performance significantly. See figure below that explains this LRVM idea. 2 main accomplishments or effects of the LRVM idea are: 1. I/O operations per commit are reduced thereby making the implementation efficient. 2. Random writes are converted to Sequential writes thereby making implementation performant.

Page 87 of 197

L08a-Q3. Could you explain the LRVM Server Design? 1. First let’s define some terms. An in-memory Data Segment is a collection of data structures that need to be persistent on disk. 2. An application maps selected, important portions of its Address Space (NOT complete AS) to an external Data Segment on disk (aka backing store) in order to create an in-memory version of the persistent data structures. We will refer to this file as Data Segment. 3. The application completely manages its own persistence needs using the LRVM API. 4. The server design can choose to use either a single or multiple external data segments. 5. The mapping between the virtual address space portions and the external data segments is one-to-one and there is no overlap between external data segments, which makes the design simple. 6. LRVM has a Data Segment that represents the in-memory persistent data structures, and a Log Segment that aggregates changes to files to convert random writes to sequential writes and reduces the number of disk I/O operations required for each transaction commit. The Log Segment is also called a “redo log” file. Typically, an application will have multiple Data Segments but only 1 Log Segment (aka redo log file). Some applications could have a few (3-10) log segments. 7. In short, the LRVM goals are to be: a. Performant (fast) b. Efficient (use less OS resources) c. Flexible (extensible) d. Easy-to-use (intuitive). 8. LRVM is NOT part of the OS kernel, but is provided as a runtime library that sits on top of the OS in user-land. In the subsequent slides, the approach used to explain concepts is to explain first at a high-level, then go to a little deeper-level and so on. So, if you do not fully understand at first, continue reading to the subsequent slides and reach the end of the lesson and then retry one more time. I think it will help! This is not as complex as it seems.

Page 88 of 197

L08a-Q4. What are the LRVM primitives? Initialization primitives: performed once at startup and shutdown of LRVM. 1. initialize(options): identifies a log segment (aka redo file) to be used for recording data structures persistently. Every process declares its own log segment to record its changes to persistent backing store. 2. map(region, options): map a region of the Virtual Address Space (AS) to the specified external data segment. 3. unmap(region): decouple the region of the Virtual AS from the specified external data segment. Server code primitives: executed many times for each transaction. 1. begin_xact(&tid, restore_mode): starts a transaction and returns a new unique transaction id, tid. In some implementations, this can be a transaction pointer. This call does not perform any writes to log or data segments. 2. set_range(tid, addr, size): specifies portion of memory that will be modified next. This call does not perform any writes to log or data segments. 3. end_xact(tid, commit_mode): persistently flush the modifications (aka mods) to disk. This call writes the changes made between begin and end transaction as redo log entries to the log segment. Note that the changes are NOT written to the data segment yet. These changes in the redo logs of the log segment are written to (called “applied to”) the data segment lazily at a later point of time. 4. abort_xact(tid): throw away the modifications made as part of this transaction. This call does not perform any write to the log or data segments. The flow is usually either of the following sequences: begin_xact(); set_range(); Do the modifications; end_xact(); : transaction is committed OR begin_xact(); set_range(); Do the modifications; abort_xact(); : transaction is aborted. The code between begin and end transaction is similar to a critical section of code.

Page 89 of 197

L08a-Q5. LRVM primitives (Continued). How does a server actually use LRVM primitives? 1. The changes in the redo logs of the log segment are lazily applied to the data segment at opportune points of time. 2. When all the redo log entries in a log segment are applied to a data segment, the log segment is cleaned-up (aka truncated) using the truncate() call. 3. The flush() and truncate() operations are automatically performed by LRVM, but are also provided to the application programmer for flexibility in writing the application. 4. The concept of transaction is intended for Recovery Management only and it does NOT provide all the ACID properties of a database. It provides only A and D of the ACID properties, i.e. Atomicity: Yes, Consistency  Concurrency: No, Isolation: No, Durability: Yes. [I think that LVRM does provide Consistency but it does NOT provide Concurrency.] 5. The intuitive LRVM API is easy for a developer to grok (understand deeply). 6. LRVM does not allow Nested Transactions and there is no Concurrency control. 7. LRVM provides an optimization configuration parameter to defer flushing of changes to disk until control reaches the commit transaction call. The first part of the code is a one-time, per-process initialization of log segments using: initialize(options); map(region, options); Each process performs the above initialization steps only once at the beginning of LRVM API use. 1. initialize() initializes a log segment on disk. 2. map() maps application-specific portions of the application address space to a data segment to make these portions as “persistent”. The flow then uses either of the following sequences for each transaction: begin_xact(); set_range(); Do the modifications; end_xact(); : transaction is committed OR begin_xact(); set_range(); Do the modifications; abort_xact(); : transaction is aborted. The code between begin and end transaction is similar to a critical section of code. Finally, the process calls the one-time de-initialization steps to clean-up log segments: unmap(region, options); deinitialize(options);

Page 90 of 197

L08a-Q6. Can you describe the inner workings of a transaction in detail? Let’s dig deeper into what happens for each transaction: 1. begin_xact() and end_xact() are used before and after the code that makes changes to the mapped persistent data structures. Only one transaction can be open for each process. 2. set_range() specifies the portions of memory within the mapped range of address that will be modified within the current transaction. It returns failure if a transaction is not open. The set_range() call is needed because it helps LRVM create an “undo” record for the upcoming modification, i.e. it creates an in-memory before-image of the memory portion that will be modified within the current transaction. If something goes wrong, this “undo” data will help LRVM restore this “undo” data back to that memory portion making it consistent. 3. There can be multiple set_range() and corresponding write() calls within one transaction, and it is usually a good programming practice to have the set_range() call be immediately followed by the corresponding write() call for better code readability. 4. end_xact() generates “redo log entries” in memory for the modifications made corresponding to set_range() calls. All these “redo log entries” are the collection of modifications made for the current transaction. The end_xact() call has an optional optimization called “sync mode” or “async mode”. The default is “sync” mode which forces a flush of “redo log entries” of all pending transactions not flushed yet. An optional mode is “async mode” in which the in-memory “redo log entries” for the current transaction are NOT flushed to disk at transaction commit. Note that end_xact() closes (aka commits) the transaction but the flush of the redo log entries is performed if “sync mode” is set (default behavior). And, the flush of the redo log entries is deferred only if “async mode” (aka no_flush) is set. The default “sync mode” blocks the API caller until the write to disk is complete, whereas the “async mode” does NOT block the API caller since there is no disk write. The “async mode” reduces the number of disk I/O operations and improves performance, but has the drawback of introducing a time window of vulnerability, during which if the application crashes or dies, then all the pending writes in memory are lost. So the choice of using “sync” or “async” mode in end_xact() is a tradeoff decision between Reliability or Performance respectively. 5. abort_xact() performs the important step of restoring the undo record back to the corresponding memory location so that the application gets the original before-image data back. Then it throws-away the in-memory undo record and closes the current transaction. Remember that only one of end_xact() OR abort_xact() calls can be called to close a transaction. 6. An optional optimization is to specify restore_mode in begin_tx() to be “no_restore_mode”. This is used in cases where the application knows that the transaction will never abort and so the application tells LRVM to not expect a set_range() call OR if a set_range() call is made by the application, then LRVM skips undo record creation since it will never be needed. This helps squeeze more performance from LRVM. The advantage of wrapping such application writes within a transaction even if we know that the transaction will never abort is that the transaction helps aggregate writes, thereby reducing number of I/O operations per transaction commit and also converts random writes to sequential writes to make the writes more performant and more efficient. Page 91 of 197

L08a-Q7. Can you describe the inner workings of a transaction in detail with a concrete example? 1. begin_xact(&tid, restore_mode); // creates a new transaction with tid = say, 101 2. set_range(tid, 0x100, 5); // creates undo record1 by copying into it data from 0x100 of 5 bytes Assuming address 0x100 had data = “A2345”, undo record1 = “A2345” (aka before-image). 3. write(0x100, “hello”, 5); // writes “hello” to address 0x100 of size 5 bytes 4. write(0x100, “HELLO”, 5); // over-writes “HELLO” to address 0x100 of size 5 bytes Note: undo record1 has data = “A2345”. 5. set_range(tid, 0x200, 7); // creates undo record2 by copying into it data from 0x200 of 7 bytes Assuming address 0x200 had data = “B234567”, undo record2 = “B234567” (aka before-image). 6. write(0x200, “NAMASTE”, 7); // write “NAMASTE” to address 0x200 of size 7 bytes Note: undo record1 has data = “A2345” and undo record2 has data = “B234567”. 7. end_xact(tid, commit_mode); // persistently flush the modifications (aka mods) to disk. mod1 = “HELLO” for address=0x100 of size=5 bytes, and mod2 = “NAMASTE” for address=0x200 of size=7 bytes. This call writes the mods made between begin and end transaction as redo logs to log segment. Note that the changes are NOT written to the data segment yet. These changes in the redo logs of the log segment are written to (called “applied to”) the data segment lazily at a later point of time. The undo records are thrown away (discarded). OR 7. abort_xact(tid); // throw away the modifications (mods) made as part of this transaction and restores (copies-back) the undo records back to the appropriate memory locations. This call does not perform any write to the log or data segments. i.e. copy undo record1=”A2345” to address=0x100 for size=5 bytes, and copy undo record2=”B234567” to address=0x200 for size=7 bytes.

Page 92 of 197

L08a-Q8. Could you summarize the optimization opportunities in LRVM API? The LRVM API provides 2 opportunities for the application programmer to optimize transactions: 1. Use no_restore_mode in begin_xact(&tid, restore_mode): No need to create the in-memory undo record for data modifications within the transaction. 2. Use no_flush_mode in end_xact(tid, commit_mode): No need to sync flush the redo log entry to log segment on disk. The persistence is deferred and performed lazily. The “sync mode” blocks the API caller until the write to disk is complete, whereas The “async mode” does NOT block the API caller since there is no disk write. The “async mode” reduces the number of disk I/O operations and improves performance, but has the drawback of introducing a time window of vulnerability, during which if the application crashes or dies, then all the pending writes in memory are lost. So the choice of using “sync” or “async” mode in end_xact() is a tradeoff decision between Reliability or Performance. Recall that the LRVM transaction semantic is a stripped-down version of the traditional transaction semantic and has only A and D of the ACID properties of a traditional transaction. Transactional systems perform and scale well when we do NOT use the full ACID semantic requirements of Transactions and in particular, do NOT use Sync I/O, which improves Performance but deteriorates Reliability (due to increased vulnerability of data loss). One of the most important goals of LRVM is an efficient implementation and the restricted semantics for transactions help in achieving that goal and hence is called Light-weight RVM.

Page 93 of 197

L08a-Q9. How the implementation of LRVM made efficient? One of the strategies used for an efficient implementation of LRVM is the “no undo, redo value logging” strategy. It is called “no undo logging” because the undo is retained in-memory (NOT written to disk) only for the duration of the transaction and is thrown away at the close of the transaction. It is called “redo value logging” because LRVM only records the redo log entries for the data modifications to the log segment on disk. The forward displacements of the redo log entries allow LRVM to easily append new redo log entries to the log segment. The reverse displacements of the redo log entries allow LRVM to easily traverse the log records backwards during recovery.

Page 94 of 197

L08a-Q10. How does crash recovery work in LRVM? 1. Each redo log entry has a transaction header and the data modifications made as part of the transaction. So, after resuming from a crash, all redo log entries of committed transactions from the log segment are “applied” to the data segment. 2. The forward and reverse displacements help the recovery procedure traverse the log segment easily and apply the redo log entries to the data segments so that the data segments are made consistent. 3. Once all the redo log entries from the log segment are applied to the data segment, the log segment is thrown away since that data is no longer needed. This is how LRVM recovery works at a high-level. An alternate approach: A more fine-grained way of implementing Log Truncation would be to look at the in-memory copy of the log segment and apply it directly to external data segments so as to avoid incurring the cost of writing a disk version of the redo log, but it would have no performance guarantees.

Page 95 of 197

L08a-Q11. How does log truncation happen in parallel with redo log entry writes to the log? 1. Log Truncation is the procedure of reading the redo log entries from the log segment and applying them to the data segments so that those redo log entries can be deleted from the log segment. 2. The Log Truncation happens in parallel along with forward processing of adding new redo log entries to the log segment. 3. This parallel processing of Log Truncation and logging new transaction data is enabled by having each step work on a set of redo entries, identified by a term called “Epoch”. So, the Log Truncation process works on the “Truncation Epoch”, a particular portion of redo log entries in the log segment and in parallel, the application works on the “Current Epoch”, another portion of redo log entries in the log segment that correspond to data modifications happening as part of the current transaction. In other words, the log segment is split into 2 Epochs: a. Truncation Epoch: used by Log Truncation, and b. Current Epoch: used by current transaction to log data modifications in parallel with Log Truncation. 4. Log Truncation can be deferred indefinitely and performed only when the map(region, options) call is performed during the initialization phase. This has the drawback of requiring sufficient disk space to hold all the transaction modifications in the log segment until it is mapped for use by another process. The map() procedure performs log truncation, applies the data modifications from the redo log entries in the log segment to the data segment, truncates the log segment and makes the system ready for further changes. 5. Managing the log segments including truncating them is one of the heaviest problems in an LRVM implementation because it has a direct consequence on the performance of LRVM. To conclude, this is how LRVM uses light-weight transactions to manage persistence for critical data structures.

Page 96 of 197

Illustrated Notes for L08b: RV: RioVista: Performant Persistent Memory L08b-Q1. What is the lesson outline? In this section, we are studying 3 systems: 1. LRVM: provides Persistent Virtual Memory as an OS System Service or OS subsystem. 2. RioVista: provides performance-conscious Persistent Virtual Memory. 3. QuickSilver: provides Recovery as a First-class Citizen built into the OS design. We have studied LRVM so far. To recap, LRVM uses lightweight transaction semantics to provide Persistent Virtual Memory as an OS system service to reliably recover from failures. The transaction semantics are termed as lightweight since it eliminates all the usual, heavy-weight ACID properties associated with a transaction. In LRVM, changes to virtual memory are synchronously written as redo log entries to a log segment (aka redo file) on disk thereby blocking the application at the end of every transaction and then later on, these redo log entries from the log segment are applied to the data segment so that the redo log entries can be cleaned-up and the log can be truncated. However, this heavy-weight, synchronous I/O at end of every transaction blocks the application. RioVista solves this problem of LRVM. In this lesson, we will study RioVista. RioVista is a performance-conscious design and implementation of persistent memory that eliminates the heavy-weight, synchronous I/O at the end of every transaction commit in LRVM.

Page 97 of 197

L08b-Q2. How is RioVista design different from LRVM? There are 2 types of failures: 1. Power Failure: 2. Software Failure: application crash due to some software defect/bug. RioVista poses a rhetorical question: If we postulate that the only source of system crash is Software Failures and that power failures are somehow handled so that they can never happen, then how does that change the design and implementation of failure recovery? The way the power failures can be handled so that they can never happen is by having battery-backed memory or UPS-backed memory so that the memory is always persistent. UPS = Uninterruptible Power Supply. RioVista is an interesting experiment and applicable in today’s modern environment if we think of the battery-backed persistent memory as SSD drive (Solid-State Drives), in which the memory is non-volatile and remains intact even after a power recycle.

Page 98 of 197

L08b-Q3. Could you summarize LRVM so that we can easily compare it with RioVista? Here are the steps of LRVM: 1. begin_xact() starts a new transaction and returns a transaction id or a transaction pointer. Then, set_range() creates an in-memory “undo record” as before-images of data that will be modified next. This is the 1st “UNDO” memory copy by LRVM. Next, the application program performs writes to the memory portions for which set_range() created the in-memory undo records. 2. end_xact() commits the transaction, in which it does a synchronous disk write to log segment of the redo log entries corresponding to the writes performed by the application and then discards the in-memory undo records. This is the 2nd “REDO” disk copy from memory to disk. 3. At a later point of time, LRVM applies (aka replays) the redo log entries from the log segment to the data segment and gets rid of the redo log entries in the log segment to truncate the log. This is the 3rd “APPLY” disk write. Thus, LRVM makes 3 copies corresponding to the UNDO, REDO and APPLY steps to manage persistence for recoverable objects. An optimization available for the application programmer is to set the no_flush mode during transaction commit so that the synchronous flush of the redo log entries to log segment can be deferred to a later point of time. This no_flush optimization increases the system vulnerability in favor of system performance because if a failure happens after the no_flush transaction commit but before the in-memory redo log entries are flushed to disk, then the data writes made as part of the committed transactions will be lost. So, one of the biggest sources of vulnerability in LRVM is power failure because data could be lost when we try to use performant no_flush-enabled LRVM. RioVista solves this problem of LRVM which we will explain further.

Page 99 of 197

L08b-Q4. Ok, I am now curious to know more about RioVista. Tell me more. Good, here is more information on RioVista. First, let’s talk about the Rio layer of RioVista. 1. At a high-level, Rio provides a persistent file cache using battery-backed DRAM. 2. The application program writes directly to the persistent file cache (battery-backed DRAM). Since the persistent file cache is resilient to power failures, there is no need to do sync I/O. 3. “Write-back” of files from the Rio persistent file cache to the disk can now be arbitrarily delayed to a later point of time. This has the benefit of aggregating I/O to improve system performance. 4. As an example, if the application uses short-lived, temporary files, then these files will be created in the persistent file cache and soon deleted, due to which they will never be written to disk, thereby saving work. The creation of temporary files is common for various applications, for example, compilation of programs creates such short-lived temporary files, database queries could create short-lived temporary files, etc. Typically, an application uses one of the following models for writes: 1. The application opens a file using the fopen() call, reads from the file using pread(offset), writes to the file using pwrite(offset), flushes data from filesystem cache to disk using fsync(), and finally closes the file using fclose(). OR 2. The application opens a file using the fopen() call, memory-maps the file into its address space using mmap() call to get a memory region pointer, reads and writes directly using this pointer, flushes the cached data to disk using the msync() call, and finally closes the file using fclose(). The mmap mode of application I/O is most convenient to application programmers because the application programmer can directly read from and write to memory and does NOT have to make explicit pread() and pwrite() calls. Additionally, this mmap() mode of implicit I/O is more performant and efficient as compared to the pread-pwrite mode of explicit I/O.

Page 100 of 197

L08b-Q5. Interesting! Tell me about Vista now. Vista is a RVM library on top of Rio’s persistent file cache. That’s RioVista!

Rio means river in Spanish. Vista means a pleasant view. Here the summarized steps performed by an application to use RioVista: 1. The application maps its memory region to an external data segment using RioVista library to make the memory region persistent. 2. The application calls begin_xact() which starts a new transaction. 3. The application calls set_range() on the portions of memory that will modify next. The set_range() call copies the before-image as undo records to the Rio persistent file cache. 4. The application does writes to the mapped portion of memory region and these writes are automatically made persistent because they happen to the battery-backed DRAM of Rio persistent file cache. 5. The application calls end_xact() which just throws away the undo records. Note that end_xact() does NOT need to perform any work or disk I/O for redo log entries! Pleasing, isn’t it? You may wonder: So, what happens if the transaction is aborted? If the application calls abort_xact(), then the undo records are used to “restore” the modified memory portions back to their original state before the beginning of the transaction and then the undo records are thrown away. Again, there is no need to perform any disk I/O. Cool! In short, the implication of having RVM on top of Rio persistent file cache is that there is no need for expensive disk I/O at all!

Page 101 of 197

L08b-Q6. Hey, what happens in the scenario of a crash? A crash scenario is simple! Treat the crash scenario like an abort of the transaction, abort_xact(). Remember that the undo record in the Rio persistent file cache is available after a software crash and we know that power failures are already handled through the user of a battery-backed DRAM. So, after a software crash, the persistent undo records are used from the Rio persistent file cache to “restore” the modified memory portions back to their original state before the beginning of the transaction and then the undo records are thrown away. Again, there is no need to perform any disk I/O. Super-cool! Now, RioVista also handles the scenario of a software crash that happens during crash recovery because the recovery operation is idempotent. Super-pleasing! Rio-Vista! Some background on Idempotent operations: 1. Setting a variable to some value (e.g. var = 11) is an Idempotent operation. The effect is the same if you do it once or multiple times. 2. Incrementing a variable (e.g. var = var + 1) is a NON-idempotent operation. The effect is different if you do it once or do it multiple times.

Page 102 of 197

L08b-Q7. RioVista seems simple to use and implement, correct? Absolutely Yes! You are correct. The Vista code-base is merely 700 LOC (Lines Of Code) as compared to 10 KLOC of LRVM. To summarize, here are the benefits of RioVista: 1. RioVista is simple to use, design and implement. The checkpointing and recovery implementation is simplified significantly. 2. RioVista avoids expensive, synchronous disk I/O by having no redo log files (log segments) and thus no log truncation. RioVista does NOT need any group commit optimizations either. 3. RioVista performance is 3x better than LRVM performance. Thus, RioVista is simple like LRVM but is also performant (fast) and efficient (uses less resources).

RioVista is an interesting thought-experiment. It basically shows how, if you change the starting assumption for a problem, then you can come to a completely different design! In the RioVista case, the starting assumption was that the source of crashes is only software failures and NOT power failures. This changes everything. Remember that the simple trick that enables the RioVista magic is to make the DRAM persistent. Notice how the presence of some assumptions and hardware resources simplifies software! See also: The Butterfly effect in Chaos Theory.

Page 103 of 197

Illustrated Notes for L08c: QS: Quick Silver: Transactional Operating System L08c-Q1. What are these screen snapshots about? 1. These screen snapshots demonstrate the effect of some non-hygienic programs that do not perform appropriate cleanup at graceful shutdown (e.g. application stops) or at ungraceful shutdown (e.g application crashes). 2. When an application has ungraceful shutdown, the application usually has some intermediate state (aka breadcrumbs) strewn at various places in the computer, for example, as temporary files in directories and/or as intermediate data in shared memory, as orphaned windows, as memory leaks, etc. 3. These breadcrumbs occupy precious resources in a computer and hence it is important to avoid generation of these breadcrumbs or ensure that they are cleaned-up in a timely manner. 4. For the example of a stateless file server like the NFS file server, the stateless server retains no state pertaining to each client. Each operation made by the client to the server is an Idempotent operation. Therefore, when a client that has initiated a session with the NFS file server quits in the middle of a session, the NFS server has no way of knowing that some state has been left over and needs to be cleaned-up. 5. This is the problem that is solved by QuickSilver, which we will study in this lesson.

Page 104 of 197

L08c-Q2. Hmm, so what does QuickSilver do to solve the problem of breadcrumb state cleanup? 1. QuickSilver makes Recovery a 1st class citizen in the OS design. This means that Recovery is available in-built into the OS itself so that an application can use it readily. 2. Conventional wisdom says that Performance and Reliability are opposing concerns and usually there is a tradeoff between Performance and Reliability, so that we can achieve only one of them at the expense of the other. 3. QuickSilver’s approach is that if recovery is taken seriously from the beginning, then we can design the system to be robust to failures without losing much on performance. That is, we can achieve both Performance and Reliability, i.e. we can eat the cake and have it too. 4. QuickSilver was conceived as a workstation OS at IBM in 1984. Recall that NFS was conceived at Sun Microsystems in 1985 (around same time-frame as QuickSilver).

Page 105 of 197

L08c-Q3. So, QuickSilver aims to provide performance in addition to functionality similar to the Micro-kernel architecture, correct? 1. Correct! In fact, QuickSilver uses a micro-kernel based design and has the same vintage as Mach OS from CMU (Carnegie Mellon University). 2. Recall that a Micro-kernel structure for OS design lends itself to OS extensibility and yet has high performance. In a Micro-kernel structure used in distributed systems today, the micro-kernel is a thin framework layer that provides only process management, IPC, and manages hardware resources. Other system services like the file server, web server, window manager, network stack, etc. are implemented as server processes that sit on top of the micro-kernel and could be distributed on various cluster nodes in a distributed system. This Micro-kernel based distributed structure lends itself to extensibility and yet has high performance. 3. QuickSilver was conceived as a distributed, workstation OS at IBM in 1984. QuickSilver was the first distributed system to propose transaction as a unifying concept for recovery management of the servers, i.e. QuickSilver was the first to propose Transactions in Operating Systems.

Page 106 of 197

L08c-Q4. How is IPC structured in QuickSilver? 1. QuickSilver is a distributed OS and hence both intra-node and inter-node IPC is a fundamental system service that QuickSilver provides to applications. 2. The figure below shows the semantics of the IPC call. The OS kernel has a data structure called the service queue, service_q, that is created by the server to queue service requests from clients. A client puts a service request on service_q, and waits for a completion response. Once a service request is queued in the service_q, the OS provides an upcall to the server so that the server can process the request and queue the response back onto the service_q. This response is then given by the OS kernel back to the waiting client. This is an example of a synchronous client call in which the client does blocking-wait for the completion of each client request. 3. Next, let’s discuss how a non-blocking, asynchronous client call works. A client puts a service request on service_q and is given a request id since it does NOT wait for a completion response. Once a service request is queued in the service_q, the OS provides an upcall to the server so that the server can process the request and queue the response back onto the service_q. The client can now call wait on the request id in a blocking manner to wait for the completion response or it can periodically check for completion response corresponding to this request id in a non-blocking manner. The asynchronous semantic allows multiple requests and responses to be buffered in the service_q. 4. The service_q data structure in the kernel is globally unique for each system service. There is location transparency for the client-server interactions, i.e. a client does not need to know where in the network is the location of a particular service provider server or the client. 5. Any number of servers can offer their services to be used by clients in the distributed system. The IPC semantic also allows multiple servers to wait on a single service queue, where the requests are given to the appropriate server based on their workload/busyness. 6. The client-server relationship is interchangeable, meaning that a particular server can become a client to another server. 7. The RPC (Remote Procedure Call) paradigm was invented around the same time as the QuickSilver OS. Since all services are contained in several processes distributed across various cluster nodes, the IPC subsystem is fundamental to QuickSilver. 8. Thus, QuickSilver IPC provides the following fundamental guarantees for IPC: a. No loss of requests b. No duplication of requests, i.e. requests handles exactly once. c. Location transparency d. Sync and Async IPC semantics.

Page 107 of 197

L08c-Q5. How does QuickSilver use transactions to provide recovery management? 1. QuickSilver bundles IPC with recovery management. QuickSilver uses the notion of a distributed, light-weight transaction tree bundled with IPC as the secret sauce to provide recovery management. Since QuickSilver predates LRVM, LRVM inherits the transaction semantics proposed in QuickSilver. 2. Because of the feature of location transparency provided by QuickSilver, a client on node A contacts the QuickSilver kernel on node A, which in turn, contacts the Communication Manager (CM) on node A, which then contacts the CM on node B, to transparently get the work done across the distributed cluster nodes. The communication between the CMs across different cluster nodes is reliable, i.e. it recovers from link failures. Additionally, any state left behind in the distributed system due to server failures is recoverable. 3. So, when a client on node A initiates connection to a server, it starts the “root” (aka “coordinator”) of a transaction tree which, in turn, contacts other cluster nodes which are called “participants” of this distributed transaction. Each participant, in turn, contacts other cluster nodes forming a distributed transaction tree. This functionality of transactions is provided by the Transaction Manager (TM) module on each cluster node. Note that the initiator of the transaction is at the root of the transaction tree and is its owner, whereas all other cluster nodes participating in this transaction tree are called participants. However, the transaction owner can change the ownership i.e the coordinator role of a transaction tree. The transaction tree gets created automatically as a by-product of the client-server interactions and it can span several cluster nodes. 4. The transaction messages are piggybacked on top of regular IPC due to which communication happens naturally as part of IPC. Hence, there is no extra overhead for communication between Transaction Managers (TMs) on different cluster nodes. Also, the IPC calls are automatically tagged with the corresponding Transaction ID for the transaction tree created. 5. The Transaction tree is used for the cleanup of state during abnormal termination of the distributed transaction, e.g. a client opening multiple windows on different cluster nodes will cleanup all windows if the distributed transaction aborts, and a client opening multiple files on different cluster nodes will cleanup state for the open files if the distributed transaction aborts. Thus, recoverability uses distributed transactions to provide multi-node atomicity of operations.

Page 108 of 197

L08c-Q6. Could you describe the Distributed Transaction further? 1. The Transaction Manager (TM) for a distributed transaction is responsible for maintaining the Transaction Tree corresponding to each one of the client-server interactions initiated by a client. The transaction initiator is the root of the transaction tree and is also its owner. 2. The Transaction Tree of QuickSilver is used purely for recovery management only. It does not provide any concurrency control or any other ACID properties provided by a heavy-weight transaction. The root of the transaction tree is the coordinator while all other members of the distributed transaction are called participants. The coordinator and its participants form a graph/tree structure. 3. The Transaction Manager (TM) at each node periodically logs state that is created on behalf of the client or server and creates checkpoint records that are used to provide state recovery. 4. The tree structure of a transaction tree helps in reducing network communication because each node reports status only to its parent node and not to the coordinator node at the root of the transaction tree. In the figure below, nodes C and D report their status to node B and nodes B and E report their status to node A. 5. The transaction is not aborted at the first indication of a node failure because we do NOT want error reporting to stop as result of the failure. So the error propagates back to the coordinator and appropriate cleanup of the whole transaction tree is triggered only by the transaction tree coordinator. This ensures that any partial failures that may have left any stray state can be cleaned-up when the coordinator of the transaction tree initiates termination of the distributed transaction.

Page 109 of 197

L08c-Q7. What happens when a commit is initiated by the transaction tree coordinator? 1. An example of a Transaction Tree creation is due to a client opening a file, doing many reads and writes and closing the file. On close, the transaction tree is terminated and appropriate cleanup of resources is performed. 2. The Coordinator sends the commands to its subordinates in the transaction tree, which in turn then propagate the commands to their respective subordinates. This propagation happens for both requests sent down the transaction tree and for responses sent up the transaction tree. 3. The transaction termination issued by the coordinator to its subordinates could be either for the transaction commit or abort. When the commit is initiated by the coordinator, each TM commits the transaction and frees up resources from the leaves of the transaction tree upwards towards the root of the tree. When an abort is initiated by the coordinator, each TM aborts the transactions and performs appropriate local cleanup of state from the leaves of the transaction tree upwards towards the root of the tree. 4. Persistent servers such as a file system may need sophisticated commit processing such as a 2PC (2 Phase Commit) protocol, whereas a window manager needs only a simple 1PC protocol. Thus, different classes of service may require different recovery management including different commit protocols.

Page 110 of 197

L08c-Q8. So, in what scenario does the recovery built into the OS help? 1. The upshot of bundling IPC and Recovery Management is that a service can safely collect all the breadcrumbs that the service has left behind in all the places that it touched throughout the course of its service, with no extra communication required for breadcrumb cleanup or reclamation. 2. Some examples of breadcrumbs are memory allocated but not freed, open file handles, temporary files created, open communication handles, orphaned display windows, etc. 3. A Transaction Tree records trail of all resources used by each transaction in the tree so that breadcrumbs can be cleaned-up and resources freed on transaction completion. 4. The transaction messages are piggybacked on top of regular IPC due to which communication happens naturally as part of IPC. Hence there is no extra overhead for communication between Transaction Managers (TMs) on different cluster nodes. Also, the IPC calls are automatically tagged with the corresponding Transaction ID for the transaction tree created. 5. QuickSilver provides the recovery mechanisms and the policy for when to use the mechanisms is entirely decided by each service. 6. The overhead for recovery management in QuickSilver is similar to that of LRVM. Every node uses log records on disks to recover persistent state and a synchronous flush of these log records affects performance whereas no flush mode can improve performance but decrease reliability by increasing the time window of vulnerability, similar to that of LRVM. So, there is a performance-reliability tradeoff in how the Transaction Manager (TM) writes the log record to disk. 7. Some services like file systems use heavy-weight mechanisms like 2PC for a transaction, whereas some services like window managers use light-weight mechanisms like 1PC transaction.

Page 111 of 197

L08c-Q9. Any special points on log maintenance in QuickSilver’s implementation? 1. One of the key aspects of the QuickSilver implementation is log maintenance. To recover persistent state, the Transaction Manager (TM) writes updates made as redo log entries to in-memory data structures, which are periodically flushed to disk synchronously. This synchronous flush to disk is called a “log force” operation. 2. The TM of a cluster node does log maintenance for all the processes running on that node. Note that the log file contains log redo entries required by all processes on the cluster node. Therefore, if an individual client decides to do a log-force, then it actually impacts the performance of not only that client but also all other clients on that cluster node. Hence, the services have to be careful about choosing the OS mechanism that are commensurate with their recovery requirements. 3. This is how QuickSilver uses transactions as a fundamental OS mechanism to provide recovery of OS services.

Page 112 of 197

Illustrated Notes for L09a: GSS: Giant Scale Services L09a-Q1. What is GSS or ISC? Giant Scale Services or Internet Scale Computing addresses the following questions: 1. What are the systems issues in managing large data centers? [DQ principle] 2. How do you program big data applications (e.g. search engines) to run on massively large clusters? [Map-Reduce programming model] 3. How do you store and disseminate content on the web in a scalable manner? [CDN] 4. How do you handle failures? It is NOT a question of if a failure will happen, BUT it is a question of when will a failure happen. Some examples of GSS are: 1. Online reservation system (makemytrip.com) 2. Online purchasing system (amazon.com) 3. Web mail (gmail.com) 4. Online movie streaming (netflix.com) One of key characteristics of Giant Scale Services is that the client requests are Embarrassingly Parallel requests – i.e. they are all independent of each other and hence can be handled in parallel as long as there is enough server capacity to meet all the incoming requests. Note: There is nothing to be embarrassed about such requests and hence some textbooks call these requests as Enchantingly Parallel requests.

Page 113 of 197

L09a-Q2. What is the Generic service model of Giant Scale Services? The Generic service model of GSS is shown below. It consists of many embarrassingly parallel requests issued to the Giant Scale Service which processes these independent requests in parallel. The backend servers handling these requests are connected by a high bandwidth communication backplane and these requests are fielded by a Load Manager that balances the incoming traffic onto the backend servers. The benefits of the Load Manager/Balancer are as follows: 1. Load-balance client traffic for effective server utilization (Mnemonic: LB) 2. Provide high availability by hiding partial failures. (Mnemonic: HA)

Page 114 of 197

L09a-Q3. What are the advantages of using clusters of machines in Giant Scale Services? Computational Clusters are the work-horses of Giant Scale Services. Their advantages include: 1. Incremental Scalability (++): the ability to incrementally add more resources without worrying about re-architecting the internals of the data center and get better performance. Also, if the volume of requests goes down, we can scale down the resources. 2. Reduced Costs ($$): the ability to easily mix and match hardware of different generations. 3. Improved Performance (^^): the ability to get improved performance by adding more resources due to the presence of Embarrassingly Parallel Queries. See the diagram below to get a feel of the number of nodes in a cluster in year 2000 (from Eric Brewer’s paper).

Page 115 of 197

L09a-Q4. Can the Load management be made more intelligent? Yes, Load management for dealing with client requests can be done at any of the levels of the 7layer OSI model. The higher the layer in the 7-layer OSI stack, the more functionality that can be associated with the load manager in terms of: 1. how the load manager deals with server failures, 2. how the load manager deals with direct incoming client requests to different servers, 3. how the load manager deals with the load on each backend server, etc.

Page 116 of 197

L09a-Q5. What are the choices for the load management? Load Management can be done at various layers of the 7-layer OSI model. 1. Load management at layer 3 (IP): RR-DNS (Round-Robin Domain Name Server): a. Each DNS request to a domain name (say, gmail.com) goes in a round-robin manner to a different IP address from a small set of IP addresses (say, 3 IP addresses) so as to balance the load. The assumption here is that all servers are identical and the data is fully replicated across all servers so that any server can service any request. b. The pro is that the RR-DNS can choose a least loaded server or redirect an incoming client request to a particular server. c. The con is that it cannot hide failed server nodes. 2. Load management at layer 4 (TCP): Transport-level switches: a. Layer-4 Transport-level switches can be architected as Switch-pairs so that there is hot failover: failover the client request from a failed node to a healthy node. b. Provides opportunity to dynamically isolate failed server nodes from external world. c. Service-specific front-end node functionality: Send gmail requests to set of gmail servers, Send picasa requests to set of picasa servers, etc. d. Use device-specific characteristics: e.g. send mobile requests to a set of mobile servers that serve mobile-display-optimized web pages than regular web servers.

Page 117 of 197

L09a-Q6. When is data partitioned and when is data replicated? Data is Partitioned for Performance. Data is Replicated for Availability. When data is partitioned (aka “sharded”) across shards, each shard can be processed for queries in parallel, thereby improving performance of a particular query since the sub-components of a query can be processed in parallel. If a particular shard is offline, then there is data loss for the query and hence data coverage is lost. When data is replicated, a full copy of the data is made across multiple servers so that if one copy goes offline, the other copies are still available thereby leading to better availability. Sometimes, the textbooks say that data is replicated for redundancy – that is true, but the purpose of redundancy is actually availability and hence it seems more appropriate to say that: Data is Replicated for Availability. The servers communicate with each other and the load-balancer using a high bandwidth communication backplane.

Page 118 of 197

L09a-Q7. What is the famous “DQ principle”? First, let’s define some terms to be able to understand the DQ principle. Note: Even though the terminology and the diagrams use the singular word “server”, the DQ principle applies to a set of servers along with a single server and hence the definition uses the word “server(s)” to signify either a single server or a set of servers. Df = the full data set required to handle any incoming queries by the server(s), aka full corpus of data Dv= the available data set used in query processing, i.e. partial portion from the full data set that is available. The remaining portion may be unavailable due to failures or server load. Harvest D = Dv/Df = ratio of the available data to the full corpus of data. Harvest D is a fraction between 0.0 and 1.0. Ideal Harvest D = 1.0 if an incoming request is completely served with all the data it wants. Harvest D is aka as Fidelity. If a web-search is not able to look at the full corpus of data due to some server failures, then the Quality (Fidelity) of the search results is less than 1.0. Qo = the offered load to server(s), i.e. the amount of requests hitting the server(s) per unit time. Qc = the completed requests, i.e. requests completed by the server(s) per unit time. Note: The servers(s) cannot complete all submitted requests per unit time and hence can only complete a fraction of the submitted requests = Qc. Yield Q = Qc/Qo = ratio of completed requests to offered load (aka submitted requests) Yield Q is a fraction between 0.0 and 1.0. Ideal Yield Q = 1.0 if all the client requests are serviced. The DQ principle states that: “For a given set of servers capacity, the Harvest-Yield D*Q product is a constant.” That is, the tradeoff is Quality (D) vs. Quantity (Q).

Page 119 of 197

L09a-Q8. What is the practical application of the DQ principle in real-life? The DQ principle states that: “For a given set of servers capacity, the Harvest-Yield D*Q product is a constant.” “i.e. Maximize Quantity of query responses OR Quality of query responses.” Maximize Quantity of query responses: Reduce Harvest to increase Yield, i.e. use less data so that more clients can be served. OR Maximize Quality of query responses: Increase Harvest at the expense of Yield, i.e. use more data to provide better results. In other words, for a given system capacity (DQ product), we can increase the number of clients served (Yield Q) by reducing the amount of data used (Harvest D) to process the incoming queries. That is, we are increasing Yield Q by reducing the Harvest D. Example: Google would like to have web-search results served for a large number of clients (Yield Q) and may not want to use the full corpus of data that it has on its servers (Harvest D). OR we can do the opposite, we can reduce the number of clients served (Yield Q) by increasing the amount of data used (Harvest D) to process the incoming queries and thus improve the quality (fidelity) of these queries. That is, we are decreasing Yield Q by increasing the Harvest D (aka Fidelity or Quality). Example: Gmail would like to serve as much Harvest D as possible for a single user so that the user is satisfied with seeing all of his/her emails rather than serve more Gmail users (Yield Q). Note: A key requirement for the DQ principle to apply to Giant Scale Services is: In Giant Scale Services, system performance is limited (bound) by Network capacity and NOT by I/O capacity. In Database applications, (just as a comparison) system performance is limited (bound) by I/O (disk) capacity and NOT by Network capacity. So, if some nodes in the server farm fail, then the SysAdmin can play with the DQ knob in terms of dealing with the reduced system capacity of DQ value, i.e. Either sacrifice Yield Q for Harvest D OR sacrifice Harvest D for Yield Q. You cannot increase both Yield Q and Harvest D without increasing server capacity.

Page 120 of 197

L09a-Q9. What is the Uptime metric? Here is my Piazza post on Uptime when I took the AOS class in Fall 2014: MTBF = Mean Time Between Failure (Unit = time) MTTR = Mean Time To Repair

(Unit = time)

Note that typically MTBF is quite large as compared to MTTR. F indicates Failure. R indicates Repair completed for the Failure. The repair could be a reboot OR replacement OR repair of the device. F --------------------> F

(MTBF = 10 hours)

This is the MTBF, say, 10 hours, that is, on an average for various measurements, the mean time between 2 failures, measured over many failures, is 10 hours. F --> R (MTTR = 1 hour) This is the MTTR, say, 1 hour, that is, on an average for various measurements, the mean time to repair a failure, measured over many failures, is 1 hour. Typically, MTTR is always less than MTBF. That is why we subtract MTTR _from_ MTBF to get a positive value. F

R

F

+--+------------------+ | D|

U

|

+--+------------------+ In this diagram, D = Mean Down time, and U = Mean Up time. Note that the MTBF and MTTR have time units and so U and D also have time units. Now, Uptime is a ratio of time duration that a service was up between failures and so we divide (MTBF - MTTR) by MTBF. MTBF is the complete box above. So, Uptime Ratio = (MTBF - MTTR) / MTBF. Since this is a ratio, it has no Units. Therefore, Uptime Ratio = (MTBF - MTTR) / MTBF, (a number between 0 and 1), and Uptime Percentage = Uptime Ratio * 100 (a number between 0 and 100) For our example, where MTBF = 10 hours and MTTR = 1 hour, Uptime Ratio = (10 - 1) / 10 = 9 / 10 = 0.9 Uptime Percentage = 0.9 * 100 = 90 % When you see the value for Uptime, the context should help us understand whether it is a number between 0 and 1 (for Uptime Ratio) or it is a number between 0 and 100 (for Uptime Percentage).

You could write similar formula for Downtime Ratio too. Downtime Ratio = MTTR / MTBF = 1 / 10 = 0.1 Downtime Percentage = Downtime Ratio * 100 = 10 %

Page 121 of 197

Some references on Uptime: 1) NOTE: This uptime is different from the "absolute" uptime displayed on a UNIX machine: http://en.wikipedia.org/wiki/Uptime 2) Related reading: http://en.wikipedia.org/wiki/High_availability 3) Gmail's 2013 uptime = 99.978% (close to 5 nines) (approx Mean Downtime = 30 minutes per year, 30 seconds per week) http://umzuzu.com/blog/2014/3/21/gmails-2013-uptime-99978 4) Google's SLA for uptime: https://cloud.google.com/compute/sla http://www.google.com/apps/intl/en-in/terms/sla.htm

L09a-Q10. How does the Harvest-Yield DQ product compare with Uptime as a metric? Hmm, Uptime used to a popular metric for a long time and still is but for network-bound GSS applications, the DQ product is a better metric than the Uptime metric because of the following reasons. The Uptime metric is not a very intuitive measure of how well a server is performing. Instead, the Yield and Harvest metrics say how well a server is able to deal with the dynamism and scale of requests handled by a particular GSS. Thus, the DQ principle is very powerful in advising the sysadmin on how to architect the system in terms of: 1. How much to replicate? 2. How much to partition in terms of the data set that the server is handling? 3. How to deal with failures? 4. How to gracefully degrade the servers when the volume of incoming traffic increases beyond server capacity (DQ product)? Remember that the underlying assumption in the DQ principle is that the Giant Scale Services are Network-bound and NOT I/O-bound.

Page 122 of 197

L09a-Q11. What is the effect of failure on DQ product for GSS when data is replicated and when data is partitioned? When data is Replicated, on Failure: Harvest D is unchanged, but Yield Q decreases. When data is Partitioned, on Failure: Harvest D decreases, but Yield Q is unchanged. Note: Harvest D is also known as Fidelity or Quality of results. When the full corpus of data is NOT available for the case when data is partitioned and there are failures, then the Fidelity (Quality) of results will be less than that when the full corpus of data is available. The DQ product is independent of whether we are replicating or partitioning the data. Remember that the underlying assumption in the DQ principle is that the Giant Scale Services are Network-bound and NOT I/O-bound. For the rare scenario in GSS, when there is significant write traffic to disk (i.e. requests are I/Obound and NOT network-bound), Replication may require more DQ than Partitioning. Beyond a certain point, a good strategy is to BOTH replicate and partition the data. In short, Data is Partitioned for Performance. Data is Replicated for Availability. As long as the GSS services are Network-bound and NOT I/O-bound, for a given system capacity, the Harvest-Yield DQ product is a constant.

Page 123 of 197

L09a-Q12. Wait a minute, tell me again why do we need both: Replication and Partitioning? Users like gmail users would normally prefer complete data or a complete harvest and for such use-cases, replication is more important than partitioning. In other words, serving a fewer set of users and keeping them satisfied with the service is more important than serving more number of gmail users and keeping all of them dissatisfied with the service. As an alternate use-case, for internet search results, serving more number of users is more important than serving most accurate results to a fewer set of users. In other words, for internet search results, it may be okay to have a Harvest D which is less than 1.0 and instead have Yield Q that is as close to 1.0 as possible. As a concrete example, Inktomi uses Partial Replication for Web-cache in CDN servers (more Yield Q, less Harvest D), and Inktomi uses Full Replication for Email (complete Harvest D, less Yield Q), because the typical user expectation is: Partial Harvest D for Web-cache is acceptable for search results, whereas Complete Harvest D for email is required. Another example: Google images can decrease Fidelity/Harvest D of images served to increase Yield Q, or Google images can increase Fidelity/Harvest D of images served and decrease Yield Q. Fidelity of images can decreased by serving images with lower bitrate. Remember: The Harvest-Yield DQ product defines the total system capacity as long as the GSS services are Network-bound and NOT I/O-bound. For Network-bound GSS applications, use the Harvest-Yield DQ product metric. For I/O-bound Database applications, use the IOPS metric. (I/O operations per second).

Page 124 of 197

L09a-Q13. How does the DQ principle help when there are failures or server saturation? When there are failures or server saturation, the Harvest-Yield DQ principle is very useful in managing graceful degradation of the service from a client’s point of view (instead of sudden failures that are visible to all users). We have 2 choices/options: 1. Keep the Harvest D to be the same, i.e. every client request has the same Fidelity (aka Quality) in terms of query results, so Harvest D is fixed, but Yield Q decreases since the Harvest-Yield DQ product is constant. 2. Keep the Yield Q to be the same, i.e. keep the volume of clients serviced to be the same, but Harvest D decreases, i.e. the fidelity of the query results returned to the users is less than 100% since the Harvest-Yield DQ product is constant. The Harvest-Yield DQ product being constant allows us to gracefully degrade the service being provided by the system depending on the choice that we want to make in terms of either: 1. the Fidelity or Harvest D of the results, OR 2. the Yield Q that we want to provide to the user community. In other words, the Harvest-Yield DQ principle gives us an explicit strategy for managing failures and saturation. Other related options for a sysadmin in dealing with server failures, saturation , data freshness or structuring system services are that the sysadmin can do: 1. Cost-based Admission Control: you pay more, you get more. 2. Priority-based Admission Control: Critical/Important applications are treated with priority. To summarize, the Harvest-Yield DQ principle gives us an explicit strategy for managing graceful service degradation when there is a failure or saturation by choosing between data freshness (Harvest D) or query volume (Yield Q), thus helping in the overall structuring of the system services.

Page 125 of 197

L09a-Q14. How does DQ principle in online evolution and growth of the Giant Scale Service? As the services are evolving continuously (e.g. machine upgrades, software upgrades, failed server replacements, etc.) over a period of time, the online evolution and growth of GSS is handled by managing the service loss or DQ loss. There are a few choices for managing service loss during server upgrade/replacement time: 1. Fast Reboot: Bring down all servers at once, upgrade all of them and then turn them back on. See diagram below. Note: The Y-axis is DQ Loss per node. Number of cluster node = n = 4. The height of each green rectangle represents the DQ Loss per node (DQ). The width of each green rectangle represents the upgrade time per node (u). The total DQ loss = the complete green area = DQ Loss per node (DQ) * Upgrade time per node (u) * Number of nodes (n) = DQ * u * n At any point of time, partial DQ Loss = the total DQ Loss = DQ * u * n The Fast Reboot approach is particularly useful in Diurnal-server situations where the user community is segmented such that the fast reboot of off-peak servers can be performed since users are offline during off-peak hours (say night time). 2. Rolling Upgrade or Wave Upgrade: Rather than bringing all the servers down at the same time, bring down one server at a time, upgrade it and then do the same for the next server and so on. Rolling Upgrade is done for batches of servers and takes longer time than Fast Reboot. However, Rolling Upgrade has the benefit that service is partially available during upgrade as compared to Fast Reboot where the service is completely unavailable during upgrade. At any point of time, partial DQ Loss = DQ Loss per batch (DQ-delta) * u * n-delta, where n-delta = number of nodes in the batch set. The total DQ Loss for Rolling Upgrade is the same as that of Fast Reboot. The total DQ loss = the complete pink area = DQ Loss per node (DQ) * Upgrade time per node (u) * Number of nodes (n) = DQ * u * n

Page 126 of 197

L09a-Q15. How does DQ principle in online evolution and growth of the Giant Scale Service? (Cont’d) (Cont’d) 3. Big Flip: Bring down half the number of nodes at once. The service is available at 50% of the full capacity. The Big Flip upgrade time will last (n/2) * u because entire server capacity is partitioned in 2 halves. The Big Flip upgrade time is more than Fast Reboot but less (better) than Rolling Upgrade. Big Flip reduces the total DQ capacity available by 50% for u units of time. The total DQ Loss is the same for all 3 strategies = blue area = DQ * u * n To summarize, 1. Fast Reboot: all nodes upgrade, off-peak, Diurnal-servers, Segmented User community. 2. Rolling Upgrade or Wave Upgrade: batch upgrade, takes long time. 3. Big Flip: upgrade half nodes at once, upgrade duration = u * (n/2), capacity reduction=50%. In short, the sysadmin has a choice of whether to make the DQ Loss apparent or not apparent to the user community. That is, the sysadmin can make an informed decision on how to do online evolution by controlling the DQ Loss that is experienced by the user community at any point of time. In other words, the DQ principle helps the sysadmin in managing the maintenance and upgrades to be controlled failures. Mnemonic: Maintenance and Upgrades  DQ Loss  Controlled Failures

Page 127 of 197

L09a-Q16. Any key insights? Some key insights are: 1. Giant scale services are Network-bound and NOT disk I/O-bound. 2. This is what helps in defining the DQ Principle. 3. The DQ Principle really helps the system designer in optimizing either for Harvest D or Yield Q for a given system capacity. 4. This also helps the system designer in coming up with explicit policies for graceful degradation of services when either servers fail or load saturation happens or upgrades are planned. 5. Mnemonic: Maintenance and Upgrades  DQ Loss  Controlled Failures

Page 128 of 197

Illustrated Notes for L09b: MR: Map Reduce framework L09b-Q1. What are Big Data applications? 1. Big Data applications work over large data sets and hence are called Big Data applications. 2. Big Data applications take a long time to compute since they work over large data sets. 3. Some examples of Big Data applications are: a) Search for John Kennedy’s photos in all documents on internet. b) Search for tickets for a particular route on all airlines. c) Page rank of web pages d) Word indexes to facilitate document searches on the internet. 4. Big Data applications use computation elements on the order of 10,000+ nodes, i.e. they use vast computational resources. 5. Big Data computations are Embarrassingly (aka Enchantingly) Parallel computations, i.e. they are independent computations that do NOT require any synchronization or coordination among each other and hence can be run in parallel. 6. Big Data applications have a common programming model. One such model uses the Map-Reduce (MR) programming framework, that can handle the following common tasks: a) Parallelize running the application on thousands of nodes, b) Build a pipeline of tasks and handle data distribution and plumbing between them, c) Handle scheduling of tasks, monitoring them and re-executing failed tasks.

Page 129 of 197

L09b-Q2. What is the Map-Reduce programming environment (or framework)? The Map-Reduce programming environment is a software framework for easily writing big data applications which process vast amounts of data in parallel on a large cluster containing thousands of nodes in a reliable and fault-tolerant manner. The following inputs to the MR framework are provided by a domain expert: 1. Set of records identified by Key-Value pairs. 2. Two functions: map() function and reduce() function. The domain expert is an expert in the appropriate field, e.g. financial analyst is the domain expert for financial data. The domain expert is a user of the MR framework. The MR framework is developed by a different team of software developers. Hadoop MapReduce is an example of the MR framework and is implemented in Java programming language.

Page 130 of 197

L09b-Q3. Can you describe an example of a MapReduce application? MR example: Find count of names of individuals in a corpus of documents, Say, find number of occurrences for names: “Kishore”, “Arun”, “Drew” Inputs to the MR framework, supplied by the domain expert developer: a) Whole corpus of documents in KV pairs, = b) map() and reduce() functions. 1. n files => n KV pairs 2. Map() is input-specific: Number of Mappers = Number of input files. Map1() searches for Kishore, Arun, Drew in file1 and produces 3 counts for each string searched. Map2() searches for Kishore, Arun, Drew in file2 and produces 3 counts for each string searched. Map3() searches for Kishore, Arun, Drew in file3 and produces 3 counts for each string searched. 3. Reduce() is output-specific: Number of Reducers = Number of search strings. Reduce1() aggregates the counts for Kishore from the 3 Maps and produces . Reduce2() aggregates the counts for Arun from the 3 Maps and produces . Reduce3() aggregates the counts for Drew from the 3 Maps and produces . In this example, map() = search and reduce() = aggregate. 4. The MR programming environment (e.g. Hadoop MR) automatically handles various tasks: a) Instantiates the mappers and reducers and does plumbing of the data pipeline between them b) Coordinates data movement between mappers and reducers in parallel on thousands of nodes c) Handles scheduling of mappers and reducers, monitoring them and re-executing failed ones. The Domain expert (application developer) does NOT have to worry about how MR runtime works. The Domain expert only supplies the input files, map() and reduce() functions and instantiates the MR programming library. All heavy-lifting is done by MR runtime to produce results.

Page 131 of 197

L09b-Q4. Why do Giant Scale Services use MapReduce framework? Mainly because in Giant Scale Services, several processing steps are expressible as map() and reduce() functions. Sometimes, they are pipelined to each other to build larger services.

GSS share common properties: 1. Use Big Data sets 2. Use Vast Computational resources 3. Have Embarrassingly (aka Enchantingly) Parallel computations Mnemonic: Big Data, Vast Computations, Enchantingly Parallel The MR programming environment (e.g. Hadoop MR) automatically handles various tasks: a) Instantiates the mappers and reducers and does plumbing of the data pipeline between them. b) Coordinates data movement between mappers and reducers in parallel on thousands of nodes. c) Handles scheduling of mappers and reducers, monitors them and re-executes failed ones. The Domain expert (application developer) does NOT have to worry about how MR runtime works. The Domain expert only supplies the input files, map() and reduce() functions and instantiates the MR programming library. All heavy-lifting is done by MR runtime to produce results.

Page 132 of 197

L09b-Q5. Can you describe the heavy lifting done by MapReduce runtime (programming env)? The Domain expert (application developer) does NOT have to worry about how MR runtime works. The Domain expert only supplies the input files, map() and reduce() functions and instantiates the MR programming library. All heavy-lifting is done by MR runtime to produce results. The MR programming environment (e.g. Hadoop MR) automatically handles various tasks: a) Instantiating the mappers and reducers and does plumbing of the data pipeline between them b) Coordinates data movement between mappers and reducers in parallel on thousands of nodes c) Handles scheduling of mappers and reducers, monitoring them and re-executing failed ones. The MR programming environment also does the following: (user = domain-expert below) d) Spawn one Master thread and multiple Worker threads for the desired computation. The Master thread orchestrates all the Worker threads, monitors and controls them. e) Auto-split of the input files, based on a user-specified key, into M splits of Key-Value pairs. Each split given to one Mapper thread (aka Worker Thread in general). Thus, there are M mappers. The value of M can be user-configured or can be a default value from the MR library. f) Each Mapper thread produces R intermediate files (e.g. one intermediate file for one search string). An intermediate file contains the intermediate results generated by the Mapper Worker thread and saved on local disk of the Mapper. g) The Master thread waits for all the Mapper Worker threads to complete and then starts R Reducers. The number of Reducer Worker threads depends on the application (e.g. number of search strings). h) Thus, M splits are worked on by M mappers. Each Mapper generates R intermediate files, with a total of (M * R) intermediate files, which are aggregated by R Reducers to generate R result files.

Page 133 of 197

L09b-Q6. Can you describe the heavy lifting done by MapReduce runtime (continued)? We have described Mapper operation. Let’s continue describing Reducer operation next. i) Once the Mapper Worker threads complete, the Master thread starts R Reducer Worker threads. Each Reducer does remote read of the intermediate files from the local disks of the Mappers, sorts this data and then calls the user-defined reduce() function to aggregate the intermediate data to produce the final result file. j) Once R result files have been generated by the R Reducers, the MR job is considered complete. k) We have M splits of input files and R result files. But the number of machines available, N, may be less than M or R. i.e. say N = machines or worker nodes, M = 1000 splits. In this case, the Master assigns first 100 of 1000 splits to 100 worker nodes and then next 100 and so on. In short, the Master manages available resources to carry out the work that needs to be done. l) Similarly, say R = 500 search strings. In this case, the Master assigns first 100 of 500 intermediate files to 100 worker nodes and then next 100 and so on. m) In short, typically M > N and/or R > N, and in such cases, the Master distributes work among the N worker nodes and iterates task scheduling until all work for the Map phase is complete. Then, it repeats the same approach for the Reduce phase until all the results are generated. Thus, the Master manages available resources to carry out the work that needs to be done. n) All this heavy-lifting is done transparently and automatically by the MR runtime without any involvement from the domain expert user, who only supplies the input and the map(), reduce() functions.

Page 134 of 197

L09b-Q7. What is the list of issues that are handles by the MapReduce runtime? The MapReduce runtime handles following tasks required for the distributed data processing: 1. Location of intermediate files created by the Mappers on GFS (Global FileSystem). 2. Scoreboard of current Mappers and Reducers assignment to worker nodes to work in parallel on input splits and intermediate files respectively. Typically, M Mappers > N Worker Nodes and/or R Reducers > N Worker Nodes. 3. One of the most critical tasks of the MapReduce runtime is provide Fault Tolerance. a) Start new instances if no timely response from a Mapper or Reducer (slow or unresponsive) Failures can happen due to: i) Node down or slow ii) Network link down or slow iii) Non-homogenous machines: * different architectures (Intel Celeron Vs Pentium Pro), * different capacities (memory size, disk size), etc. b) Handle completion messages from redundant stragglers: Inherent assumption in the Fault Tolerance model = All operations are Idempotent operations. Idempotent operation = same effect if you do the operation once or multiple times. Example: Assignment of a variable to a value, say x = 4, is an idempotent operation. Non-example: Incrementing the value of a variable, say x=x+4 is NOT an idempotent operation. When a task is complete, the Master atomically renames the mapper or reducer’s output temporary filename to the final output filename, marking the task as complete. The rename(tmpFile, finalFile) system call in UNIX returns success only if the finalFile is NOT present. This is what is meant by “atomic rename”. This helps the Master ignore similar completion messages from redundant stragglers who have completed the already completed task.

Page 135 of 197

L09b-Q8. What is the list of issues that are handles by the MapReduce runtime? (Continued) (Answer Continued) 4. Another important issue is Locality Management: The MR runtime ensures that the working set of computations fit in the closest level of the Memory hierarchy of a process so that the computation can make good forward progress and complete efficiently. The Locality management is performed using GFS (Google FileSystem) which provides a way to efficiently migrate intermediate data from Mappers to Reducers. 5. The MR runtime is also responsible to divide the tasks appropriately, i.e. come up with correct task granularity to have good load balancing and utilization of computational resources. The MR runtime has default values that can be configured by the user to change behavior. e.g. the user can override the default partitioning hash function with a user-supplied one in order to organize the input data better, say, in terms of how the keys are ordered. e.g. the user can incorporate combining functions to be included in map()/reduce() functions. Remember the fundamental assumption for the Fault Tolerance model is that the Map() and Reduce() functions are idempotent operations. See References for image credits.

Page 136 of 197

L09b-Q9. Any key insights? Some key insights are: 1. The power of MapReduce framework is its simplicity for the user. 2. The domain expert user has to supply to the MapReduce runtime only the Input data and write 2 functions: the Map() function and the Reduce() function which are specific to the application. 3. All the heavy-lifting is done transparently and automatically by the MR runtime without any involvement from the domain expert user.

References: 1. https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html 2. http://dme.rwth-aachen.de/en/system/files/file_upload/project/MapReduce.jpg 3. http://www.laureateiit.com/projects/bacii2014/projects/coa_anil/images/me_hie.jpg 4. http://www.cheatography.com/uploads/sporkbomber_1366575922_memory_hierarchies.png 5. Memory Hierarchy: http://csapp.cs.cmu.edu/2e/ch6-preview.pdf

Page 137 of 197

Illustrated Notes for L09c: CDN: Content Delivery Networks L09c-Q1. What is a CDN? A Content Distribution Network is:  a globally distributed overlay network of proxy web servers  deployed in multiple data centers  to serve cached content to end-users efficiently  providing high-availability and improved performance. Mnemonic: CDN = Overlay, Cache, HA, Performance. Examples of content provider are CNN.com, NYTimes.com, BBC.com, etc. Examples of CDN companies are Akamai.com, LimeLight Networks, etc. Some large internet companies like Amazon, Google, Microsoft, etc. own their own CDN. The roots of CDN are from Napster, the pioneering peer-to-peer music file-sharing service. References: [1], [2], [4]

L09c-Q2. What are the benefits of a CDN?  Offload of web-site traffic from content provider results in: o Improved performance of web-sites o Possible cost savings for the content provider  Improved Latency: time taken for a client to receive information from the server [3]  Improved Security: CDN’s large distributed server infrastructure can absorb a DoS (Denial of Service) attack traffic and thus provide a degree of protection from DoS attacks to the content provider. Page 138 of 197

L09c-Q3. How do I find out what CDN does a web-site use? (Optional) There are websites that display this information. An example is given below. http://www.cdnplanet.com/tools/cdnfinder/#site:http://cnn.com Results: We believe the site http://cnn.com is using Fastly as a CDN. L09c-Q4. What is the difference in response time with and without CDN? (Optional) Well, I tried the example in [2] and the page load time of No-CDN (10 ms) was a lot better than that of the CDN case (532 ms), but the answer depends on the geographical location of the client. Try it out yourself! CDN example: http://stevesouders.com/hpws/ex-cdn.php No-CDN example: http://stevesouders.com/hpws/ex-nocdn.php The author of the book [2] mentions that the results will vary depending on your connection speed and geographical location. The closer you live to Washington DC (server location), the less of a difference you will see in response times in the CDN example. Do try it out!

Page 139 of 197

L09c-Q5. What is the relationship between CDN and DHT? CDNs are implemented using DHT (Distributed Hash Table). Let’s take an example: 1. We want to store a video such that users in various geographies (geos) can access locally a cached copy from the CDN. 2. We will store the video at multiple locations (geos) using a DHT. 3. Take the video file and generate a hash of the video file content, say, 149. 4. Let’s say we store the video file at some node id = 80. 5. So, the key-value pair for the meta-data of the video file is: . i.e. Key = ContentHash and Value = Location NodeID where video file is stored. 6. Now, store this meta-data tuple on a node with id=149 or in nearby mathematical range, say 150, i.e. meta-data is stored on node=150. 7. The meta-data is NOT stored on a central server so as to make the solution scale to very large number of nodes. 8. In summary, CDNs exploit the DHT technology to store content on the internet so that the content can be discovered and disseminated to the users.

Page 140 of 197

L09c-Q6. How is the content hash generated? 1. The content hash of the content (video in above example) is generated using the SHA-1 algorithm, which guarantees that the generated hash is unique for different content. 2. There are two names spaces: . 3. The generic objective is: Store the metadata at nodeId = ContentHash. 4. Example: Store the metadata at nodeId close to 149 = 150. 5. We have used numbers 149, 150 and 80 for convenience. In reality, running SHA1 algorithm on the content gives us 160-bit Hash (aka Key). And running SHA1 algorithm on IP-address of the nodeID gives us 160-bit destination nodeid. The CDN software has two APIs: a. putkey(ContentHash, ContentNodeId), and b. ContentNodeId = getkey(ContentHash)

Page 141 of 197

L09c-Q7. Why is CDN called an Overlay Network? CDN is an example of an overlay network An Overlay network is a virtual network on top of a physical network. CDN is an Overlay network such that it allows content to be shared and distributed with a set of users. Let’s take an example: Say, we want to reach from Node A to Node C. On NodeA, we check the user-level routing table of NodeA to find C. Since C is not present in the User Routing Table of Node A, we use the * entry which means “default entry”, match every entry that has not been matched earlier. Using the default entry we go from NodeA to NodeB. The NodeId in the diagram below correspond to the IP address. Now, on Node B, check its user-level routing table. Hey, we find Node C there and it is directly connected and so the packet reaches the final destination of NodeC. Thus, note that the nodes have exchanged information with one another so that they can discover other nodes in either of the two ways: a. directly by knowing about each other, OR b. indirectly by knowing about each other through friends of friends. This routing information is stored in a user-level routing table which is used as follows: Given a nodeName, what is the nodeId of the next-hop for the packet to be delivered to the final destination.

Page 142 of 197

L09c-Q8. What the heck is an Overlay Network? The concept of an Overlay Network is a generic principle. An Overlay network is a virtual network on top of a physical network. An Overlay network is kind of a simulated reality on top on actual reality, similar to the concept in the movie “The Matrix”. Let’s take a few examples which help solidify the concept: 1. The IP network is an Overlay network on top of the Ethernet network. Hence a packet with IP address needs to include the Ethernet (MAC) address in the packet before it can be routed to the next hop. This IP address to Ethernet address translation is done using a routing table. You can see the contents of a routing table on your Linux or Windows machine using the command: $ arp –a 2. The Skype network is an Overlay network on top of the TCP/IP network. 3. Similarly, the CDN network is an Overlay network on top of the TCP/IP network. The CDN nodeName (or NodeId) is converted to the IP address using a user-level routing table in the CDN software.

Page 143 of 197

L09c-Q9. Ok, can you help solidify the relationship between CDN and DHT? 1. DHT or Distributed Hash Table is an implementation vehicle for a CDN to populate the user-level CDN routing table. 2. The CDN software is at the user-level (as compared to software running in the operating system kernel which is called kernel-level software). 3. The CDN software has two APIs: a. putkey(ContentHash, ContentNodeId): The ContentHash is the key/hash of the content that you want to store as a cache in CDN, and the ContentNodeId is the node id (proxy) of where the Content is stored. b. ContentNodeId = getkey(ContentHash) The ContentHash is the key/hash of the content that you want to get from the CDN cache repository. ContentNodeId can also be thought of as ContentLocationNodeId, which is the location/nodeId where the content is located.

Page 144 of 197

L09c-Q10. Good, how is DHT concept utilized for large-scale scalability in a CDN network? We will now discuss two approaches for the DHT concept to be utilized for large-scale scalability in a CDN network: a. Traditional (greedy) approach using traditional DHT that leads to some problems (given below), and b. Coral approach using Sloppy DHT that solves these problems. Sloppy DHT is real cool! Let’s first discuss the traditional greedy approach using traditional DHT that leads to problems. In the greedy approach, we do the following: Store the metadata at nodeId = ContentHash where the nodeId chosen is equal to ContentHash or close to ContentHash (mathematically). Note that the is also called the pair or . Similarly, to get (retrieve) the Key ContentHash, CDN goes to nodeId = ContentHash or a nearby node. This is called the Greedy approach because CDN tries to get to the desired destination as quickly as possible with the minimum number of hops at the virtual (overlay) network level. This greedy approach is what eventually leads to various problems and we will see later that if the approach is less greedy (i.e. less aggressive) in reaching the destination node, then the problems of the greedy approach are minimized. Let’s take an example to solidify the concept: (see diagram below) If the CDN software wants to go to a nodeId = 58, but the CDN routing table does not have the nodeId=58 to IP address mapping, then CDN chooses the nodeId that is closest (mathematically) to the destination nodeId. In this case, it is nodeId 60. This is the best bet that CDN has, with the hope that nodeId 60 may actually know how to communicate with nodeId 58 or even better, nodeId 60 itself may have the next key (meta-data) that CDN may be looking for.

Page 145 of 197

L09c-Q11. Ok, what are the problems with the Greedy approach of CDN? The first problem with the Greedy approach of CDN is Meta-data server overload. If a video becomes suddenly very popular (e.g. Gangnam Style video on YouTube), then it will result in a lot of putkey() operations for the same contentHash, which will all go to the same meta-data server causing meta-data server overload. Another problem with the greedy approach is that it not only overloads the meta-data server but also overloads the network causing network congestion in the path from the intermediate nodes to the destination node. The network congestion is in the form of a tree that is rooted at the destination meta-data server and the nodes and the network near the root of this tree get congested. This is referred to as the Tree-saturation effect. The problem happens when multiple different videos result in the contentHash being close to each other causing multiple putkey() operations with keys mathematically close to each other and thereby resulting in meta-data server overload. See left diagram below. The problem also happens when multiple users view the same video causing multiple getkey() operations on the same contentHash (aka key) and thereby resulting in meta-data server overload. See right diagram. This will also cause the origin server hosting the actual content to be overloaded, referred to as the origin server overload problem. A related terminology is the Slashdot effect that a popular website (like the Slashdot news website) links to smaller site, causing a massive increase in inflow traffic to the smaller site causing it to become slow or unavailable temporarily. This is called Reddit effect and nowadays the generic term used is Flash crowd [5]. To summarize, the greedy approach of using minimum hops to the destination node leads to meta-data server overload, origin server overload and network congestion. The network congestion causes the tree-saturation effect. Mnemonic: Meta-data server overload, Origin server overload, Tree-Saturation effect.

Page 146 of 197

L09c-Q12. Hey, how does CDN help if I am getting live information on sport events or latest news? Aha, the problem of getting stale information from a CDN is solved by re-routing the user’s request for a web-site to the geo-local mirror of that web-site so that the web-site (origin server) is not overloaded. Content providers like CNN.com buy such CDN services from CDN providers like Akamai that automatically mirror the latest information to their geo-local mirror website, thus serving as a CDN cache for live CNN.com content. The use of geo-local CDN mirror services is expensive and hence companies like Amazon, Google, etc. build their own CDN networks for caching the content they produce. These companies use approaches similar to non-greedy Coral approach that we will describe next. Mnemonic: Live content request redirected to a geo-local mirror (e.g. Akamai)

Page 147 of 197

L09c-Q13. I am eager to learn about Coral’s non-greedy approach. Tell me more about it. The non-greedy approach used by Coral CDN is implemented using a DHT called as Sloppy DHT. Recollect that the traditional DHT has its putkey() and getkey() operations satisfied by nodes that have their nodeId mathematically close in range to the ContentHash (key). Example: Store the metadata at nodeId close to 149 = 150. In contrast, the sloppy DHT has its putkey() and getkey() operations satisfied by nodes that have their nodeId mathematically far apart to the ContentHash (key). Example: Store the metadata at nodeId = 1000 (say) which is far apart from 149. The sloppy DHT implementation spreads the meta-data overload so that in the democratic process of helping one another by storing the meta-data, no single node is overloaded or its nearby network is not saturated. The exact nodeId selection (1000 in the above example) is done using a novel key-based routing algorithm, which is described next. The key-based routing algorithm calculates the XOR distance between the source node doing the put/get operation and the destination node for the put/get operation. Recall that the source nodeId and destination nodeId are 160-bit SHA1 hash values and hence the XOR operation is a quick and efficient operation as compared to subtraction to compute the distance between the source nodeId and destination nodeId. Example: If source nodeId = 14 and destination nodeId = 4, XOR distance = 14 XOR 4 = 10. The bigger the XOR value, the larger is the distance between source and destination in the application namespace.

Page 148 of 197

L09c-Q14. How does is the Key-based Routing approach non-greedy? The greedy approach is to get from the source nodeId to the destination nodeId with the fewest number of hops using the user-level routing table that has information on directly reachable nodes. The Coral key-based routing approach does not rush from the source nodeId to the destination nodeId and hence is non-greedy. In contrast to the greedy approach, the Coral key-based routing approach slowly progresses nongreedily on each hop by going approximately half the distance towards the destination nodeId. Example: See diagram below: source nodeId = 14, destination nodeId = 4. The greedy approach rushes as fast as possible to go from 14 to 4 and optimizes only its own lookup – it does not care about causing problems to others and hence is called greedy. See the figure on the left below. The Coral non-greedy approach goes half the distance from source to destination on each hop and thus tries to be a good citizen caring for and not causing problems to the infrastructure, i.e. from 14 to 5 to 2 to 1. See the calculation in the figure on the right below. Greedy Approach

Coral’s Non-Greedy Approach

Page 149 of 197

L09c-Q15. What happens if the user-level routing table of an intermediate entry does not have the entry of the next hop? If there is no direct path to the next hop in the routing table, then the Sloppy DHT implementation goes to a node whose nodeId is approximately half the distance towards the final destination. Note that the distance metric used here is NOT the physical distance metric but is the distance in the nodeId namespace. That is, if the Sloppy DHT wants to go a nodeId = 5 and it is not in its CDN routing table, then it will search for a mathematically close entry and say, it finds nodeId = 4, then it will use that entry and hop to nodeId=4. In short, Sloppy DHT hops to a node that is mathematically close to the desired next hop. Example: See diagram below. Source nodeId = 14, Destination nodeId = 4. Instead of rushing from 14 to 4 using the greedy approach, the key-based routing approach slowly progresses non-greedily on each hop by going approximately half the distance towards the destination nodeId. XOR distance between nodeIds 14 and 4 = 10. 10 / 2 = 5. So Coral CDN want to go to nodeId that is distance 5 from 14. But there is no such nodeId in its routing table. So, Coral checks for any other nodeId that is at distance of approximately 5 from 14. It find nodeId=0 that is at distance 4 from nodeId=14 and hops to nodeId=0. NOTE: In the figure below, do NOT get confused between the top-row that is nodeIds and the table below that shows the distance from the source nodeId=14.

Page 150 of 197

L09c-Q15. Continued. Next, Coral CDN calculates XOR distance between nodeIds 0 and 4 = 4. 4 / 2 = 2. Now nodeId=0 has information about nodeIds={4,5,7} none of which are exactly 2 distant from the destination nodeId=4. So, Coral CDN checks if there is a nodeId with distance less than 2 but nearest to 2 mathematically. Note that Coral CDN does NOT use a higher distance than 2 because it wants to get closer to the destination … obviously. In short, Coral CDN chooses a node that is closest enough in distance to the destination nodeId. Note that Coral CDN could have jumped directly from nodeId=0 to destination nodeId=4 but the Coral CDN approach is NON-greedy and it goes to intermediate nodeId=5 and then to destination nodeId=4. The tradeoff disadvantage of the Coral approach is that it increases the latency in reaching the desired destination due to increased number of hops with the tradeoff advantage that it results in common good of less load on the network and servers. Thus, the Coral approach places the common good of less network and server load to be more important than its personal interest of better latency in reaching the desired destination node. The Coral approach reminds me of Dalai Lama, who says: “Be kind whenever possible. It is always possible.” So, in future, Dalai Lama will be remind me of the Coral approach and vice-versa. 

Page 151 of 197

L09c-Q16. What are the primitives for the Coral Sloppy DHT? Note that the put() and get() primitives of the Coral Sloppy DHT have the exact same interface as those of the put() and get() primitives of a standard DHT, but the implementation semantics of Sloppy DHT are completely different than that of the standard DHT. Recall the two primitives of a DHT: putkey() and getkey() a. putkey(ContentHashKey, ContentNodeId): The ContentHash is the key/hash of the content that you want to store as a cache in CDN, and the ContentNodeId is the node id (proxy) of where the Content is stored. b. ContentNodeId = getkey(ContentHashKey) The ContentHash is the key/hash of the content that you want to get from the CDN cache repository. The putkey() can be: 1. initiated by the origin server with new content OR 2. initiated by a node that just downloaded the content and wants to serve a proxy for the content so that it can help in reducing the load on the origin server. This putkey() operation is performed in a way that it avoids metadata server overload. In short, the putkey() operation by a node is announcing the willingness of that node to become a Proxy so as to serve the content whose signature is the Key in order to reduce load on the origin server.

Page 152 of 197

L09c-Q17. So, how does the putkey() operation avoid metadata server overload? First, let’s define 2 states for any node: 1. Full: A node is considered to be Full if it has a maximum of L values per key, i.e. how many values is the node willing to store for a particular key = L. 2. Loaded: A node is considered to be Loaded if it has a maximum of B request rate per key, i.e. how many requests per unit time is the node willing to entertain for a particular key = B. Note that the values of L and B are pre-configured for a Coral CDN instance. Observe that “Full” is a Space metric and “Loaded” is a Time metric. As the Coral CDN’s putkey() operation slowly progresses non-greedily to the destination nodeId, it checks the state of each intermediate hop (nodeId) and if it is either Full or Loaded, then it infers that the remaining network path to the destination (origin server) is all clogged-up because of tree saturation. So it retracts and performs putkey() at the previous hop (nodeId). In short, there are 2 phases of the putkey() operation: 1) Forward phase: Slowly progress non-greedily by going approximately half the distance towards the final destionation until an intermediate node it is either Full or Loaded. 2) Retract phase: If an intermediate node is either Full or Loaded, then retract backwards to the previous hop, recheck the state of the previous nodeId and if the previous hop is neither Full nor Loaded, then perform putkey() on that nodeId. This is how the putkey() operation avoids server and network overload.

L09c-Q18. How does the getkey() operation avoid metadata server overload? In every hop, the Coral CDN’s getkey() operation slowly progresses non-greedily by going approximately half the distance towards the final destination, in the hope that the getkey() operation will find the key somewhere along the way, if the intermediate node is serving the metadata for a particular key. If not, then the getkey() operation will get the metadata for the particular key from the final destination since nobody has retrieved the key before. But the hope is that if the content is popular enough, then multiple proxies may have gotten the Key-Value pair and in turn when they got the content, they will have performed putkey() of their own nodeId as a potential node for the content. This will cause the metadata server to be an intermediate node along the path to the destination node if the content has been retrieved by somebody else earlier. Thus, the Coral CDN avoids metadata server overload by distributing the putkey() and getkey() operations in a democratic manner so that the load for serving both as a metadata server as well as the Content server gets naturally distributed Page 153 of 197

L09c-Q19. Enough of theory, could you show me an example now? Ok, here we go… let’s take some examples and show the Coral CDN in action. Let’s say, Naomi, who is at nodeId=30 has some interesting content that she wants to share with the world and she has created the ContentHashKey to be 100 for the interesting content. So, Naomi performs the operation: putkey(ContentHashKey=100, ContentNodeId=30). This results in a series of RPC calls: Notice the numbers in the figure below that match the steps below. 1) Try an intermediate hop which returns the list of next hops. 2) Select next hop that is half the distance to the destination nodeId and get the list of next hops. 3) Finally, reach the destination node, David, which hosts this key-value pair of (100, 30). Note that the nodeId of David is neither Full nor Loaded since it is ready to serve a metadata server.

Page 154 of 197

L09c-Q20. Hmm, so how does getkey() work now? Another user, Jacques, wants to get the interesting content, whose ContentHashKey=100. It knows that this content will be at ContentNodeId=100 and so Jacques issues the operation: ContentNodeId = getkey(ContentHashKey=100) The Coral’s Key-based routing approach will issues a bunch of RPC calls: Notice the numbers in the figure below that match the steps below. 1) Get the list of next hops from this intermediate node 2) Get the list of next hops from this intermediate node 3) Reached David. David says: Hey, I have the Key-Value pair that you are looking for and here is the associated ContentNodeId=30, which is Naomi’s computer. So, then Jacques goes to ContentNodeId=30 (Naomi’s computer) and gets the content, as shown in the two figures placed next to each other below.

Page 155 of 197

L09c-Q21. So, does the story end here? No, Jacques tries to be “nice and kind” (remember, Dalai Lama) and says: “I want to serve as a proxy for Naomi”. So, Jacques performs the operation: putkey(ContentHashKey=100, ContentNodeId=60), where nodeId of Jacques=60. This results in a series of RPC calls, slowly progressing non-greedily to the destination nodeId: Notice the numbers in the figure below that match the steps below. 1) Get the list of next hops from this intermediate hop, assuming it is neither full nor loaded. 2) Get the list of next hops from this intermediate hop, assuming it is neither full nor loaded. 3) Reached David. Let’s assume that David is either full or loaded. Therefore, the previous hop is used as the metadata server for the content nodeId=60 (Jacques node).

Page 156 of 197

L09c-Q22. Hmm, do we have two metadata servers now: David and Jacques? Correct! We have two metadata servers: David and Jacques for the content that is hosted by Naomi. But note that the content is also hosted at Naomi and Jacques and so there are two content servers too. Say, another user, Kamal, now wants to get the interesting content and issues the following operation to Coral CDN: ContentNodeId = getkey(ContentHashKey=100) This results in RPC calls to get to the destination nodeId=30 for Naomi but the intermediate nodeId=60 (Jacques), who is serving as the content proxy, returns the content to Kamal and hence Kamal does not have to go all the way to nodeId=30 (Naomi). If Kamal decides to be a good Samaritan, then the Coral CDN will now have 3 metadata servers. In general, the hope is that the getkey() operation hits one of the intermediate metadata servers and that way the request for the actual content may go to different content proxy servers dynamically as the system evolves over time. As a result, the metadata server load gets distributed and the origin content server is also NOT overloaded. Thus, the Coral CDN’s non-greedy approach places the common good of less network and server load to be more important than its personal interest of better latency in reaching the desired destination node. In other words, the Coral CDN system evolves dynamically to reduce the stress on both the origin content server as well as the metadata server by naturally and dynamically distributing it.

Page 157 of 197

References [1] CDN: Wikipedia: https://en.wikipedia.org/wiki/Content_delivery_network [2] High Performance Web sites, OReilly Publication, By Steve Soulders [3] Web Performance Today: 11 Q&A about CDN and web performance: http://www.webperformancetoday.com/2013/06/12/11-faqs-content-delivery-networks-cdn-webperformance/ [4] http://www.wpbeginner.com/beginners-guide/why-you-need-a-cdn-for-your-wordpress-bloginfographic/?display=wide [5] https://en.wikipedia.org/wiki/Slashdot_effect

Page 158 of 197

Illustrated Notes for L10a: TS Linux: Time-sensitive Linux L10a-Q1. Why do we need Time-Sensitive Linux? Traditionally general-purpose operating systems (OS) have catered to the needs of: 1. Throughput oriented applications: E.g. Databases, Scientific applications, etc. that are throughput-sensitive. Nowadays, we want the operating systems to also cater to the needs of: 2. Real-time oriented applications: e.g. Synchronous Audio/Video players, video games, etc. that are latency-sensitive and need soft real-time guarantees on performance. Latency-sensitive applications are time-sensitive and require quickly responding to an event. e.g. Shooting at a target in a video game requires the event to appear instantaneously on the screen. Time-Sensitive Linux is an extension of the commodity general-purpose Linux operating system that addresses 2 questions: 1. How to provide soft real-time guarantees for real-time applications in the presence of background throughput oriented applications? 2. How to bound the performance loss of throughput oriented applications in the presence of latency-sensitive applications? Note: The tradeoff is between the OS being Throughput-sensitive OR Latency-sensitive.

Page 159 of 197

L10a-Q2. What are the 3 source of Latency in an OS? There are 3 sources of Latency in an OS: Mnemonic: T-P-S: 1. Timer Latency: due to the granularity of the timer mechanism in the OS. The time goes off at time T-H but the timer interrupt actually happens at time T-T. For instance, periodic timers tend to have a 10 millisecond granularity in Linux OS. 2. Preemption Latency: due to non-preemptiveness of OS when timer interrupt happened. The Preemption Latency is because of the fact that the timer interrupt could have happened when the kernel was in the middle of doing something from which it cannot be preempted (aka a critical section) OR when the kernel was in the middle of handling another higher priority interrupt. Due to the Preemption Latency, the timer interrupt actually happens at time T-P. 3. Scheduler Latency: Once the OS delivers the timer interrupt, the OS scheduler schedules the application process waiting for the timer interrupt, so that the application can take appropriate action for this external interrupt. But the timer interrupt cannot be delivered because another higher priority process is already waiting in the OS scheduler’s queue and needs to be serviced first. Due to the Scheduler Latency, the timer interrupt actually happens at time T-A. The time difference between Event Occurrence to Interrupt Activation is the time that prevents the application from reacting in a time-sensitive manner. It is extremely important to shrink the time difference between Event Occurrence to Interrupt Activation, i.e. shrink the difference between Event time and Activation time = (T-A – T-H)

Page 160 of 197

L10a-Q3. What are the different types of Timers available in an OS? Typically, there are 4 types of timers available in an OS: 1. Periodic timer: Pro: Periodicity: The timer interrupts the OS periodically at regular intervals. Con: Event recognition latency: Because of the granularity of the periodic timer, the event is recognized at a much later point in real time than when it actually happened. As an example, if the timer granularity is 10 ms, then the event recognition can happen only at 10 ms time intervals and NOT in-between the time periods of 10 ms. That is, the worst-case latency = Periodicity of the Periodic timer itself. Analogy: You “periodically” check email every 5 minutes. 2. One-shot Timer: (Exact timers) A One-shot timer is an exact timer that can be programmed to be triggered exactly when we want the event to be delivered. Pro: Timelines (exactness/preciseness) Con: Extra interrupt overhead for the OS to field these interrupts and to reprogram them. Analogy: You get a “one-shot” pop-up on your screen when you receive an email. 3. Soft Timer: For Soft Timers, the OS polls at strategic times, like system calls into the OS, or external interrupts to the OS (say network packet arrival), to check if there is an external event of interest. Pro: Reduced overhead since there are no timer interrupts to be handled periodically by OS. Con: Increased polling overhead since the OS has to poll periodically, and Increased latency in the OS being triggered for an event after the event has happened. Analogy: You “softly” check email whenever you take a break, typically, say every 45 minutes. 4. Firm Timer: (new mechanism proposed in TS Linux; combines pros of all timers) Firm Timer = Soft Timer + One-shot Timer + Periodic Timer. Analogy: You “firmly” do what is necessary combining all approaches.

Page 161 of 197

L10a-Q4. Firm Timer seems to be Cool. Could you describe it? Firm Timer is a new mechanism proposed in TS Linux. It combines One-shot Timer and Soft-Timer to provide accurate timing with very low overhead. The Periodic Timer has the con of event recognition latency and hence let’s not use it. The One-shot Timer has the con of processing overhead of reprogramming the timer. The Soft-Timer has the con of polling overhead and increased latency. So let’s combine Soft-Timer with One-Shot Timer to get Firm Timer. In Time-Sensitive Linux, the Firm Timer Design has a knob called the Overshoot parameter, which is the time-interval between the actual event happening and the point at which the one-shot timer is programmed to interrupt the CPU. Within the Overshoot parameter time window, there could be a system call, which is a soft interrupt. Note that applications typically make system calls and it is likely that the OS will be in the kernel space within the Overshoot parameter time window, which is when the expired timers will be dispatched and the one-shot timer will be reprogrammed for the next one-shot timer interrupt. Since system calls happen frequently, the OS does the expired timer processing when the system calls happen and thus avoids the expensive one-shot timer interrupt processing. However, if a system call does not happen in a timely manner, then the one-shot timer interrupt gets triggered, the expired timer processing gets done and the one-shot timer gets reprogrammed for the next one-shot timer interrupt. Thus, the Firm Timer combines the pros of Soft-Timer with One-Shot Timer to provide accurate timing using One-Shot Timer, but at the same time avoids the overhead associated with One-Shot Timer by processing expired events within the overshoot parameter time window when Soft-timers like system calls or external interrupts happen. In other words, the Firm Timer gets the accuracy of One-Shot Timers along with the low overhead of Soft-Timers. By choosing the Overshoot parameter (knob) value between One-Shot (Hard) Timer and Soft Timer, we reduce the number of times that the one-shot timer actually interrupts the OS.

Page 162 of 197

L10a-Q5. How is the Firm Timer actually implemented? In Linux, the term task is used to signify a schedulable entity. The Firm Timer implementation contains a timer_q data structure containing a linked-list of task_q structures, sorted by ascending expiry time, i.e. earlier expiry time first. struct task_q { taskName; taskExpiryTime; taskHandler; *nextTaskPointer; };

In the figure below, Task T1 expires first, Task T2 expires next and then Task3 expires next. This is the way that the OS kernel maintains the timer_q data structure to know when a particular task’s expiry time is due for processing the event associated with that task. The basics for the Firm Timer implementation is the availability of the APIC hardware. APIC is Advanced Programmable Interrupt Controller and is implemented on chip in modern CPUs starting from Intel Pentium onwards. The advantage of Firm Timer using APIC is that reprogramming a one-shot Timer takes only a few cycles and hence is less expensive. So, when the APIC (hardware) Timer expires, the interrupt handler will go through the timer_q data structures and look for tasks whose timers have expired, call the corresponding callback handlers for these tasks and remove them from the timer_q queue. If a task in the timer_q queue corresponds to a periodic timer, then it is removed from timer_q, its callback handler processed, its expiry time updated and the task is re-enqueued onto timer_q. If a task in the timer_q queue corresponds to a one-shot timer, then the interrupt handler will reprogram that task for the next one-shot event.

Page 163 of 197

L10a-Q5. How is the Firm Timer actually implemented? (Continued) Note that the APIC timer is a hardware-based timer and it works by setting a value into a register which is decremented at each memory bus cycle until it reaches 0, at which point it generates an interrupt. The APIC timer has a theoretical accuracy of 10 ns, but the actual timer interrupt processing time is significantly higher, which becomes a limiting factor in the granularity that can be obtained with one-shot timers implemented using the APIC hardware. But still, the APIC hardware allows implementation of very fine-grained timers in the OS. By choosing an appropriate Overshoot parameter value for reprogramming the APIC timer, we can eliminate the need for fielding One-Shot Timer interrupts, by using the occurrence of soft timers (syscalls, external interrupts) going off within that Overshoot time period. Another optimization is to dispatch a One-Shot event at a preceding Periodic event. That is, if a One-Shot event is coming up fairly soon, then simply dispatch that One-Shot event at the preceding Periodic event. This approach has the following advantages: 1) Processing of Periodic events is very efficient [O(1)] as compared to the processing of One-Shot events which is less efficient [O(log n)]. This approach also helps avoid the overhead of dealing with One-Shot event and the cost of reprogramming the One-shot event. 2) By choosing the appropriate Overshoot parameter, we can eliminate the need for fielding the One-shot timer interrupts if Soft timers (system calls, external interrupts) go off within that overshoot time period. 3) If the distance between One-shot timers is really long, then instead of using One-shot timers, we simply use Periodic timers and dispatch the One-shot event at the preceding periodic timer event. This is how the Firm Timer implementation reduces the Timer Latency, the first component of the latency from the point of event occurrence to event activation.

Page 164 of 197

L10a-Q6. How is the Kernel’s Preemption Latency reduced in Time-Sensitive Linux? Recollect the 3 types of Latencies: T-P-S: Timer, Preemption and Scheduling latencies. The Preemption Latency is because of the fact that the timer interrupt could have happened when the kernel was in the middle of doing something from which it cannot be preempted (aka a critical section) OR when the kernel was in the middle of handling another higher priority interrupt. The Preemption Latency is reduced in Time-Sensitive Linux using the following approaches: 1. Explicitly insert preemption points in the kernel code. The preemption points are when the kernel actually looks for events that may have gone off and processes them. 2. Allow preemption of the kernel at all times when it is not manipulating shared structures. Robert Love’s Lock-breaking Preemptible Kernel combines these two ideas to reduce Preemption Latency. e.g. A long critical section code can be broken up into shorter critical sections so that the kernel becomes preemptive between the shorter critical sections. This is the time to preempt the kernel and a great opportunity to check for expired timers, dispatch them, and reprogram one-shot timers, if required.

Page 165 of 197

L10a-Q7. How is the Scheduling Latency reduced in Time-Sensitive Linux? Time-Sensitive Linux uses a combination of 2 principles to reduce the Scheduling Latency: 1. Proportional Period Scheduling: On process startup, allocate a fixed proportion Q within the time window T to the process/task. In the figure below, on startup, task T1 says it needs Q=2/3 proportion of CPU time to allocated to it in every time quantum T, task T2 says it needs Q=1/3 proportion of CPU time to allocated to it in every time quantum T. The time quantum T is exposed to the application. The scheduler provides temporal protection by allocating each task a fixed proportion Q of the CPU during each task period T. If the scheduler does not have sufficient capacity, the process’s request to the scheduler will fail. To summarize, on process startup, the scheduler does Admission Control. The proportion Q and time quantum T are parameters adjustable using a feedback control mechanism, so that it improves the accuracy of the scheduling analysis performed on behalf of the time-sensitive processes. An advantage of Proportional Period Scheduling is that TS Linux OS can have control over how much of the CPU time is devoted to Time-Sensitive tasks so that the OS can reserve a portion of the time for throughput-oriented tasks and can thus balance well between supporting timeliness of time-sensitive tasks and ensuring that throughput-oriented tasks are able to make forward progress. 2. Priority-based Scheduling: The scheduler schedules higher priority processes before lower priority processes. However, one problem with Priority-based Scheduling is the problem of Priority Inversion. Let’s take an example to explain Priority Inversion. Say, a high priority task C1 makes a blocking call to a low priority server C2. However, this low priority server C2 is preempted by a medium priority task C3. This creates the problem of high priority task C1 not being able to run since it is waiting indirectly for the medium priority task C3 to complete. This is the Priority Inversion problem from the point of view of the high priority task C1. The solution for the Priority Inversion problem is to boost the low priority of server C2 to be equal to the high priority of the client C1 for the duration of the time that the server C2 services the client C1’s request.

Page 166 of 197

L10a-Q8. Any key insights? Some key insights are: 1. TS Linux provides Quality of Service (QoS) guarantees for Real-time applications running on commodity OS such as Linux. 2. Using Admission Control in Proportional Period Scheduling, TS Linux ensures that the throughput-oriented tasks are not shut out from getting CPU time and are able to make forward progress even in the presence of time-sensitive, latency-oriented tasks. 3. The 3 major ideas that are enshrined in TS Linux for dealing with time-sensitive tasks are: a) Firm Timer design that increases the accuracy of the timer without exorbitant overhead in dealing with One-Shot Timer interrupts. b) Using a preemptible kernel to reduce the kernel Preemption Latency. c) Using Priority-based Scheduling to avoid Priority Inversion and Using Proportional Period Scheduling to guarantee a portion of the CPU to be allowed for time-sensitive and throughput-sensitive tasks.

Page 167 of 197

Illustrated Notes for L10b: PTS: Persistent Temporal Streams L10b-Q1. What are the standard programming paradigms? Can they be used for emerging distributed multimedia applications? The standard programming paradigms are: 1. Parallel programs: Pthreads API 2. Distributed programs: sockets API and RPC API. RPC API use sockets API. Con: Sockets API are too low-level and do NOT have the semantic richness needed for emerging, novel distributed multimedia applications. 3. Distributed multimedia applications: PTS API: PTS = Persistent Temporal Streams PTS API provides a simple programming model for live stream analysis. PTS allows us to capture, prioritize, process and propagate temporal causality throughout the system. In this lesson, we will study more about the Persistent Temporal Streams, aka PTS.

Page 168 of 197

L10b-Q2. What are the characteristics of the novel multimedia applications? Novel multi-media applications tend to be based on distributed sensors like: Temperature sensors, Humidity sensors, Cameras, Microphones, Radars, etc. These sensors generate continuous stream of data and applications that perform live-stream analysis of such continuous stream of data are called Situation-aware applications. Situation-aware applications perform the following steps: 1. They perform real-time sensing of input data, called “sensing” 2. Then, they prioritize the sense data to figure out what data are important or more interesting than others. 3. They devote more computational resources to important and interesting data and based on the analysis, they take some action – this is called “actuation”. 4. Part of the feedback loop may be feedback to the sensors themselves to re-target and perform some changes to the sensors, e.g. move the camera towards an object that is moving. Thus Sensors trigger Actuators in a real-time manner and there is a need to shrink the latency from Sensing to Actuation, so that we can take actions based on the sensed data. Since such situation-aware applications are computationally intensive, cloud-based clusters are used to provide the horsepower to run these large-scale, novel, distributed, sensor-based, multi-media applications. Properties of Situation-aware applications: Real-time, Sensor-based, Distributed, Computationally-intensive

Page 169 of 197

L10b-Q3. Can you explain an example of a large-scale situation awareness application? A good example of large-scale situation-awareness application is Monitoring, e.g. Monitoring activities in an airport. The requirement is to monitor and report any abnormal (anomalous) events by sending triggers to software agents or humans. For instance, the city of London has 400K cameras. The amount of data streams generated by these cameras continuously 24/7 can overload the infrastructure. Hence to avoid overload the data from these sensor streams is pruned at source. Humans monitoring the large number of cameras is impractical due to large cognitive overload. Another important problem to avoid is the False positive and False negatives. A False positive happens when the system thinks an event is anomalous but it is not. A False negative happens when the system thinks an event is non-anomalous but it is actually is. False positives are harmless. False negatives are harmful. The metrics of False positives and False negatives are important metrics in a large-scale situation awareness application. In short, typical problems with large-scale situation awareness applications are: 1. Infrastructure Overload, 2. Cognitive Overhead in manual monitoring, 3. False +ves (harmless) and False –ves (harmful) metrics.

Page 170 of 197

L10b-Q4. What are the requirements of the programming infrastructure for such Large-scale situation awareness applications? The programming model for situation awareness application has the following requirements: (pain-points that need to be solved) 1. Provide simple and easy-to-use programming abstractions that allow seamless migration of computation between sensors at network edges and cluster computational resources in a datacenter. 2. Capture the temporal ordering of events, process the time-sensitive data to create digest and propagate temporal causality of events through the network. 3. Correlate live data with historical data for high-level inference. For instance, if we see a speeding car at this point of time on the highway, then we want to ask: Was this car involved in an incident in the last n days?

Page 171 of 197

L10b-Q5. What is the programming model for Situation Awareness applications? Let’s look at the Computation Pipeline for a Video Analytics application. The requirement is to detect and track an anomalous event from the camera video. Say, the camera detects an object (e.g. gun) or a suspicious individual which needs to be tracked in the video. The domain expert has to write programs using a sequential computation pipeline of detection, tracking and recognition algorithms that can handle the scale of processing video streams from thousands of camera sensors in real-time, to derive high-level inference and finally generate appropriate alarms. Persistent Temporal Systems (PTS) is an example of a Distributed Programming System that caters to the needs of Situation-Awareness applications in a scalable and simple-to-use manner.

Page 172 of 197

L10b-Q6. Hmm, interesting. How is the PTS programming model used? The PTS programming model is a simple model that provides 2 high-level abstractions: 1. Threads (for Computation), and 2. Channels (for Communication). The PTS programming model uses Threads and Channels to form a Computation Graph. This is similar to the UNIX programming model that uses Processes and Sockets to form a Computation Graph. However, the semantics of the Channel abstraction is very different from the Socket abstraction.    



  

A Channel contains a continuous stream of data items that have been snapshotted with different timestamps associated with each data item. That is, a Channel contains time-sequenced data items. The way a Channel differs from a Socket is that a Channel allows many-many connections between producers and consumers of data items. Multiple producers can write to a channel and multiple consumers can read from the channel. The contents of a Channel for a particular thread show the Temporal Evolution of data that is produced by a particular thread. The APIs are: o putDataItem(dataItem, timestamp) o getDataItem(lower bound for timestamp, upper bound for timestamp) o getOldestDataItem(channel) o getNewestDataItem(channel) To build a computation pipeline of tasks, a channel from a task is connected to another task. For example, a camera feeds data into a channel that is connected to recognizer thread and the getDataItem(lower bound TS, upper bound TS) API helps generate digest of required information that is then tagged as a new data item and time-stamped before it is put on other output channels. Thus, the PTS programming model facilitates Temporal Causality to be propagated in a distributed system. Quite often, in Situation awareness applications, the computation may have to use Composite results from multiple data streams in order to do high-level inferencing. The fact that every stream is temporally indexed allows a computation to correlate the incoming streams and recognize which data items in the input streams are temporally correlated to one another using timestamps of data items from different channels.

Page 173 of 197

L10b-Q7. Does PTS support bundling multiple data streams from different sensors? Yes, PTS supports computations that may need to get correspondingly time-stamped data items from multiple, different sensor sources, in order to do robust high-level inference. For instance, it could use different modalities of sensing, like video-source, audio-source, text-source and gesture-source. FYI: A modality means any of the various types of sensations such a vision, hearing, etc. PTS allows multiple streams to be grouped together and labelled as a Stream Group or Stream Bundle. A Stream Group has one “Anchor Stream” and all other streams are “Dependent Streams”. In the figure below, the video stream is the Anchor Stream and the audio, text and gesture streams are Dependent Streams – dependent on the Anchor Stream. The related PTS primitive operation is: getStreamGroup(StreamGroup, TimeStamp) to get correspondingly time-stamped data items from all streams in the specified StreamGroup. Remember that using multiple modalities makes the high-level inference robust.

Page 174 of 197

L10b-Q8. What are the various tasks in the computational pipeline of video analytics app? Any system design should be as simple as possible because the power of simplicity is the key for adoption of the framework. Let’s take the example of video analytics sequential program pipeline. Recollect that Channels are named entities that can be used to hold the temporal evolution of output from a particular computation. Channels are network-wide and can be discovered and accessed from anywhere. A camera thread periodically captures frames from the camera sensor and places the frames onto a frame channel. This frame channel now contains the temporal evolution of output produced by capture computation. The detection algorithm in the detector thread discovers the frame channel, connects to the frame channel, gets images from the frame channel, processes the images, and produces blobs corresponding to objects in a frame. The tracking thread takes these blobs and places them in the output channel for the recognition thread. The recognition thread consults a database of known objects and compares the observed objects to known objects in order to detect and report an anomalous situation via an alarm. The sequential program for video analytics is converted into a distributed PTS program that uses the channel abstraction and the get/put primitives available in the PTS programming model. In the figure below, the ovals are threads of the PTS abstraction and the rectangles are channels between the computational threads. Note the use of verb-noun pairs for the thread-channel pairs. Capture Frames, Detect Blobs, Track Objects, Recognize Events.

Page 175 of 197

L10b-Q9. What are the PTS Design Principles? PTS provides simple abstractions and interfaces – the channel abstraction and the get/put interfaces. Similar to UNIX sockets, Channels are network-wide unique, named entities that can be present anywhere in the system and can be discovered and accessed from anywhere. i.e. a thread can discover the channels, connect to the channel and do I/O on the channels using the get/put primitives. The PTS system does all the heavy lifting underneath to keep the interfaces simple. The PTS channels are particularly attractive for situation-awareness applications because: 1. The PTS runtime and APIs provided to manipulate the Channel treat Time as a 1st class entity. i.e. the application queries the runtime system using Time as an index into the Channel. 2. The PTS abstraction allows streams to be persistent under application control. 3. The PTS runtime system and the semantics of the channels allows seamlessly handling of live and historical data by specifying lower and upper time-bounds for data items of interest.

Page 176 of 197

L10b-Q10. How does persistency of the channel help? * Various data producers perform put() operations on PTS while consumers perform get() operations. * These put() and get() trigger various threads of the PTS runtime system. * Garbage Collection (GC) trigger activates GC threads that either clean old data items from the channel OR archive and persist the data items for later analysis. * Persistent triggers: When data items become old, the Live Channel layer generates Persistent Triggers to indicate old items that need to be archived/persisted. * The implementation of the Persistent Channel Architecture (PCA) uses a 3-layer architecture: 1. Live Channel Layer: * The Live Channel Layer reacts to “new item” triggers from worker threads * Holds a snapshot of data items that have been generated on a particular channel. * Channel characteristics are configurable at channel creation time. e.g. retain data of last 30 seconds. 2. Interaction Layer: * The Interaction Layer is the glue layer between the Live Channel Layer and Persistence Layer. 3. Persistent Layer: * Based on the Persistence triggers from the Live Channel layer, the Persistence Layer of the Channel architecture takes data items from the Channel and decides how to persist them. It calls the pickling callback handler functions, specified by the application, before the data item is persisted. An example of the callback pickling handler function is to create a digest of information. * Supports different backend datastores like MySQL Database, Unix Filesystem, and IBM’s GPFS. * Pickling handler functions called before persistence of a data item is done. * get(Lower-bound TimeStamp, Upper-bound TimeStamp) can fetch live items or archives items from the live channel or archival storage respectively. In short, some unique features of the PTS programming model are: 1. Time-based, distributed data structures for data streams. 2. Automatic data management. 3. Transparent stream persistence.

Page 177 of 197

L10b-Q11. Can you summarize various programming models? Here is the list of various programming models: 1. Pthreads for Parallel programs 2. Sockets for Distributed programs 3. Map-Reduce for Distributed, Big-data applications 4. PTS for Distributed, Situation-awareness applications performing Live-Stream Analysis in a real-time manner. Thus, similar to how the MapReduce programming framework provides a simple and intuitive programming model for the domain expert to develop big data applications, the PTS framework provides a simple programming model for the domain expert to develop “live stream analysis” applications, while the PTS run-time handles all the heavy-lifting required to perform the work underneath in a transparent manner.

Page 178 of 197

Illustrated Notes for L11a: Security: Principles of Information Security L11a-Q1. Can you provide some context for Information Security? Various issues of Information Security are so interesting that computer visionaries thought about these issues of Information Security even before computers were connected to one another. Cryptography has been studied since thousands of years and it is still a popular, important and practical course of study. Many universities offer an entire master’s degree program on Information Security. More recently, Hellman and Diffie were awarded the 2015 Turing Award for their ground-breaking work on public-key cryptographs and for helping make cryptography a legitimate area of academic research [Source: Wikipedia]. Here is some history on Information Security: 1. 1963: Memorandum on Intergalactic Computer Networks 2. 1969: First computer communications, UCLA SRI, Menlo Park, Prof. Leonard Kleinrock. 3. First email: Ray Tomlinson @ BBN Technologies (Ray passed away in March, 2016). We will next describe the seminal paper by Jerome Saltzer that identified various information security issues like Denial Of Service (DoS), Firewalls, Sandboxing, etc. in the year 1975!

Page 179 of 197

L11a-Q2. What are the common terminologies related to Information Security? Privacy Vs Security: Both Privacy and Security relate to: when to release information. Privacy relates to individual’s preference as to when to release information. Privacy is the individual’s right and responsibility in terms of the information that they own. i.e. Privacy is an individual-specific function: How the individual’s information is protected and when it is released. Vs Security is dealing with how to make sure that the system is respecting the guarantees that the user needs both in terms of privacy of information as well as when to release information. i.e. Security is a system-specific function: The system guarantees certain properties about information that it preserves on behalf of the user community. The system needs to provide Authentication and Authorization for Release, Modification or Denial of information in the system that relates to individuals. Authentication and Protection of information go hand-in-hand in building a secure system. Following are a comprehensive set of additional security concerns regarding Authorization: 1. Unauthorized Release of Information (e.g. your family photos not released to strangers) 2. Unauthorized Modification of Information (e.g. your family photos cannot be photoshopped/modified by strangers) 3. Unauthorized Denial of Information use (i.e. DoS: Denial Of Service) (e.g. DoS = you cannot access your own family photos) Preventing all such violations is a goal of a Secure System. However this is a Negative statement and a Negative statement is hard to achieve. e.g. There is no way to prove that a non-trivial program has no bugs. Similarly, there is no way to assure that the system prevents all violations. i.e. There is no way to guarantee that bad guys cannot break-in into a system. Jerome Saltzer, in his seminal paper on Information Security, from 1975, argues that the goal of a Secure System should NOT be stated negatively, but positively. If the goal of a Secure System is stated in a negative manner that it has to prevent all violations, then it can give a false sense of security because it is NOT achievable in practice.

Page 180 of 197

L11a-Q3. What are various levels of protection possible? Various levels of protection possible are as follows: 1. Unprotected: e.g. MS-DOS had hooks for mistake prevention, e.g. a program could accidentally corrupt memory of another program, but Prevention != Security. 2. All or nothing: e.g. IBM’s VM-370 has the notion of a virtual machine that provides an illusion of having personalized resource access even though it is shared with other users. This is the all or nothing property of the time-sharing systems in the 60s and 70s time-frame. 3. Controlled Sharing: e.g. ACLs (Access Control Lists) associated with files in a file-system. The ACLs control which other users have read, write, execute permissions for files owned by a particular user. This is Controlled Sharing. 4. User-programmed Sharing Controls: e.g. ACLs of a file in file-system for file owner, Group and Others. i.e. Access rights for files for different groups of users. 5. Strings on information: e.g. Military physical files labeled as top-secret that can be opened only by some privileged set of users. A Secure System does NOT have these protection levels cast in concrete and needs to deal with “dynamics” of use of information as the system and user community evolves. e.g. an admin deals with information that was not to be shared yesterday but can be shared today and with what users.

Page 181 of 197

L11a-Q4. What are the Design Principles to build a Secure System? Jerome Saltzer identifies 8 design principles that go hand-in-hand with the levels of protection: 1. Economy of Mechanisms: The Mechanism should be easy enough so that it can be verified. 2. Fail-Safe Defaults: Default = No Access, Allow access to information by explicit config. 3. Complete Mediation: No shortcuts to Authentication: e.g. no caching of passwords. 4. Open Design: Publish the Design, but protect the Keys used by the Design. The Open Design should make breaking the Keys computationally infeasible. Detect that a violation has happened rather than prevent it. The underlying tenet is that Detection is easier than Prevention. 5. Separation of Privileges: e.g. requiring two keys for a particular bank account, where the two keys are held by two separate individuals, so that both individuals have to come together to open the bank account vault. 6. Least Privilege: Use the absolute minimum capability needed in order to carry out a certain task. e.g. a user should need admin privileges to install a package, add a new user, etc. Controls in the system should be based on “need-to-know” – origin for the idea of Firewalls. Firewalls ensure that individuals within an organization are able to access external information from inside the corporate network only on a need basis and inside information is allowed to get out only under authorized conditions. 7. Lease Common Mechanism: Limit the amount of damage that a malfunctioning mechanism can do to the system as a whole. A compromised user library code will have less damage possible as compared to compromised kernel code and hence have the functionality in user library, if possible. 8. Psychological Acceptability: Mechanisms should be easy-to-use by the user so that they completely understand what they are doing. E.g. Good UI.

Page 182 of 197

L11a-Q5. Any key takeways that we need to remember? Remember these 2 key takeaways for Information Security: 1. Make cracking the protection boundary computationally infeasible. 2. Build the system to detect violations rather than prevent violations.

Page 183 of 197

Illustrated Notes for L11b: Security in AFS: Andrew File System L11b-Q1. What was the reason for building the distributed Andrew File System in 1980s? Andrew File System was a bold, new experiment in the CS department at CMU in 1988. The intent was to enable students across campus to able to walk to any workstation on campus and login to start using their personal files on the central server in a safe and secure manner from the workstation. The network is untrusted but Private Key Cryptographic infrastructure is used for security and authentication. Local disks on a workstation (WS) served as efficient caches of files downloaded from central server (S), as shown by WS and S in the figure below. Many of the technologies like cloud computing and mobile computing that we take for granted today had their modest beginnings in experiments such as the Andrew file system and the Coda file system at CMU (image credits: CMU). Coda FS descended from AFS.

Page 184 of 197

L11b-Q2. What is the high-level architecture of the Andrew File System? Note: Though Wikipedia accepts the usage of both Filesystem and File System, we will use File System as two separate keywords. Client workstations, called Virtues, are connected by insecure network links to LAN, which is connected to a secure environment that houses the servers, called Vices. Communication inside the secure environment is unencrypted. Communication outside the secure environment (between Virtue-Vice) is encrypted. Venus is a special UNIX process on each Virtue Client workstation for 1. User Authentication, and 2. Client-side caching of files. Secure RPC encrypts all communication over insecure link between Virtue Client workstations and Vice File Servers.

Page 185 of 197

L11b-Q3. Can you provide some basic information on Encryption? There are 2 families of Encryption systems: 1. Private key cryptosystem: Both sender and receiver use Symmetric keys for encryption and decryption of data. e.g. Passwords used for login to a system. Encryption(Data, Key) => Ciphertext => Decryption(Ciphertext, Key) => Data One of the major problems with Private key cryptosystem is the Key Distribution problem especially as the size of the organization becomes larger and larger. 2. Public key cryptosystem: The Public key cryptosystem overcomes the Key Distribution problem. One-way, irreversible functions are the mathematical basis for the Public Key Cryptosystem. The Public key cryptosystem uses a pair of asymmetric keys, called Public key and Private key, for information exchange. The Public key of sender and receiver is published in a central directory like yellow pages, etc. The Private key of sender is known only to the sender and same is the case for receiver. The Sender encrypts the data using the Public key of the Receiver to get Ciphertext. This data conversion is one-way and cannot be done in the reverse direction. The Receiver decrypts the ciphertext using its own Private key to get the data back. Remember: Encrypt using Public key and Decrypt using Private key. ClearText  CipherText  ClearText Public Key Private Key Jerome Saltzer’s Design Principle for Security: 1. Publish the Design, but Protect the Key. 2. Make breaking the Key computationally hard enough that the system is secure.

Page 186 of 197

L11b-Q4. How does Private Key Encryption work? Two entities A and B have exchanged private keys KA and KB before-hand. A uses private key KB to send data to B. B uses private key KA to send data to A. Both the entities need to know when they get an encrypted message as to who is the author of the message so that they know which key to use to decrypt the message. But the identity of the sender is sent in cleartext to the receiver along with the ciphertext so that the receiver knows which key to use to decrypt the ciphertext. In Private Key Encryption, KA is the same as KB.

Page 187 of 197

L11b-Q5. What are the challenges for the Andrew File System? 1. Authenticate the User trying to login to the system. 2. Authenticate the Server to ensure that there is no Trojan horse pretending to be the server. 3. Prevent Replay Attacks: Some man-in-the-middle intruder could sniff the packet, make a copy of the packet, potentially modify it and then resend that packet fooling the receiver into receiving the rogue packet. This is called a Replay Attack and should be prevented. 4. Ensure User Isolation to prevent one user from interfering with another user due to: a. Unintended Interference. b. Malicious Interference. The Andrew File System uses Secure RPC as the basis for Client-Server communication and to implement Secure RPC, Private Key Cryptosystem is used. The Key Distribution Problem is NOT a big challenge for a closed community like the university campus environment and hence the Andrew File System used Private Key Cryptosystem. Jerome Saltzer’s Design Principle for Security: 1. Publish the Design, but Protect the Key. 2. Protecting the key means make breaking the Key computationally hard enough, but do NOT use a pair of identity and private keys for a long time, since overexposing them gives an intruder long time to break the key. This becomes a Security hole. If we change the pair of identity and private keys frequently, then breaking the security by trying various combinations of the private keys in a short time-period becomes infeasible. The dilemma for AFS was what to use as Identity sent in Cleartext and what to use as Private Key so that we do NOT use any pair of Identity and Private Key for a long time – prevent overuse.

Page 188 of 197

L11b-Q6. How the security problems solved in the Andrew File System? The following techniques are used to solve security problems in Andrew File System: 1. Username and password are used only once for the login session over insecure links and not used later on for any communication. The username and password are communicated over insecure links. 2. The ephemeral id and keys are used for subsequent secure venus-vice communication over insecure links. These are used for each session, i.e. several times during a login session. Recall that Venus is the surrogate process that resides on the virtual workstation, acting as a surrogate for file caching. This gives rise to 3 classes of Client-Server interactions: 1. Login Session using (Username, Password): happens once for the entire login session. 2. RPC Session Establishment using (Ephemeral Id, Key): may happen several times during a login session. 3. File access during the RPC Session using (Ephemeral Id, Key): may happen several times during a login session. The RPC Session is closed once the remote file is downloaded and cached locally. If the file needs to be accessed again or another file needs to be accessed, then a new RPC Session is established.

Page 189 of 197

L11b-Q7. How does the Login procedure work? Here the steps for the Login procedure: (this may seem a little complicated but try going over it multiple times and draw diagrams) 1. The user walks up to the Virtue Workstation and performs login from the Virtue Workstation by entering the user’s username and password. 2. The are communicated over insecure links in a secure manner from the Virtue Workstation to the Login Server Process on the Vice Server. 3. After the Login Server Process performs authentication using the Auth Server Process, the Login Server Process sends a pair of tokens back to the Virtue Workstation in a secure manner over insecure links. The pair of tokens are: The ClearToken is a data structure that contains a Handshake Key, called HKC. The Login Server Process encrypts the ClearToken using a Private Key, PK1, known only to Vice Server. This Private Key, PK1, is different from the Handshake Key, HKC. The SecretToken is unique for the Login session. 4. The Login Client Process on Virtue Workstation receives the message from Login Server process, decrypts the message, and extracts the from the message. 5. The Login Server Process then extracts the Handshake Key, called HKC, from the ClearToken. The SecretToken is just a bit string, generated by the Login Server process. This SecretToken is used as an Ephemeral ClientID for the login session so that it helps prevents overexposure of the on the insecure link. 6. Remember that the SecretToken is an encryption of the ClearToken, and that the Private Key, PK1, used to decrypt this SecretToken is known only to Vice. Hence, the SecretToken is used an Ephemeral ClientID by Vice to extract out the corresponding ClearToken from the SecretToken by decrypting it, and then extract out the Handshake Key, HKC, from the ClearToken. The SecretToken represents the encrypted, ephemeral clientID and remains unique for each Login session. The Handshake Key, HKC, represents the actual client identity and is used by Venus as another private key, PK2, for establishing a new RPC session with Vice (discussed later). 7. The pair of tokens are stored on the Virtue Workstation by the Venus process on behalf of the user for the entire Login Session. Once the Login Session ends, this pair of tokens associated with the user are thrown away by Venus. 8. To summarize, Venus uses SecretToken as ephemeral ClientID for the duration of Login Session to send information to Vice, and Venus establishes a new RPC session using the Handshake Key, HKC, from the ClearToken.

Page 190 of 197

L11b-Q8. So, how does RPC session establishment happen after Login session establishment? After the Login session establishment, the RPC session establishment happens using the Bind mechanism that is explained below: 1. Venus Client process sends message to the Server in order to establish a new RPC session. The message contains: Xr is a Random Sequence Number generated for each new RPC session and encrypted using HKC. Recall: HKC was extracted from ClearToken sent back by the Server during Login Session. HKC is the key used encrypt Xr. E[Data, Key] is an Encryption function that encrypts Data using Key to generate ClientCipher. A Cipher is an Encrypted message in security parlance. Recall: Venus process in the Virtue Client Workstation sends message to Vice Server. 2. Recall: The Ephemeral ClientID=Secret Token is an encryption of the ClearToken using a Private Key known only to the Server. So, the Server uses its Private Key to decrypt Ephemeral ClientID=Secret Token portion from the message received from client to get the ClearToken, from which the Server gets the HKC Key. The Server then uses the HKC key to decrypt the Cipher portion of the message to get Xr. 3. Next, the Server increments Xr, generates a new Random Sequence Number Yr, and encrypts both (Xr+1) and Yr using another key called HKS to generate ServerCipher. By design, HKC and HKS are exactly the same, i.e. the Server uses HKS to be same as HKC. 4. The procedure of Server incrementing Xr will avoid Replay Attack because only the Server and nobody else will be able to extract Xr from the encrypted token, increment Xr and then re-package (Xr+1) as an encrypted token to be sent back to the Client. This increment procedure by the Server establishes that the Server is genuine. Note that anybody else (aka Man-in-Middle) can capture a packet in transit and fake being a Server, but only a Server that can successfully increment Xr is a genuine Server and NOT a Trojan Horse. A fake Server is also called as a Trojan Horse in security parlance. 5. Similarly, the procedure of Client incrementing Yr will avoid Replay Attack because only the Client and nobody else will be able to extract Yr from the encrypted token, increment Yr and then re-package (Yr+1) as an encrypted token to be sent back to the Server. This increment procedure by the Client establishes that the Client is genuine. Note that anybody else (aka Man-in-Middle) can capture a packet in transit and fake being a Client, but only a Client that can successfully increment Yr is a genuine Client and NOT a Trojan Horse. A fake Client is also called as a Trojan Horse in security parlance. (Points 4 and 5 are exactly same except for the interchange of Client-Server and Xr-Yr).

Page 191 of 197

L11b-Q9. How does RPC Session Establishment avoid over-exposure of ID/Password over insecure links? Notice the last arrow from the Server to the Client in the figure below. Remember that Login Session used exactly once for the Login Session. In general, the Venus Client process will have multiple RPC Sessions over one Login Session. And, the Venus Client process will have multiple Filesystem calls within one RPC Session. And for all of those Filesystem calls, we want to avoid overexposing the Handshake Key, HKC/HKS. Recall that by design, HKS is the same as HKC. Note that the Handshake Key, HKC, is used exactly once for establishing an RPC Session. To avoid overexposure of this Handshake Key, HKC, the Server generates a RPC Session Key, SK. The Server also generates a new Random Sequence Number for the RPC Session, and uses HKS to encrypt SK and Sr (shown as num below) to generate SessionCipher to be sent to Client. This Session Key, SK, is used by the Venus Client Process for each secure RPC call. The Sequence Number Sr establishes that the Client is Genuine and NOT a Trojan Horse and prevents Replay Attack to the Server. Also, recall that: The Sequence Number Xr establishes that the Server is Genuine and NOT a Trojan Horse and prevents Replay Attack to the Client. The Sequence Number Yr establishes that the Client is Genuine and NOT a Trojan Horse and prevents Replay Attack to the Server. Thus, once the genuineness of the Client and Server is established, the Server says that for a particular RPC Session, HKC/HKS Handshake Key is NOT used anymore, and instead a new Session Key, SK, along with a Sequence Number, Sr, is used for all the RPC calls during that RPC Session. Thus, the over-exposure of the Handshake Key, HKC/HKS is avoided over insecure links.

Page 192 of 197

L11b-Q10. Can you summarize how a Client or Server determined to be Genuine? One way to remember the Genuineness check to prevent Replay Attack is: Whoever increments the Sequence Number is established to be Genuine And prevents Replay Attack to its Peer. That is, If the Server increments the Sequence Number Xr, then it is established to be Genuine, since only the Server, and NOT a Trojan Horse, can increment the Sequence Number Xr securely, and thus prevent Replay Attack to the Client. Similarly, If the Client increments the Sequence Number Yr, then it is established to be Genuine, since only the Client, and NOT a Trojan Horse, can increment the Sequence Number Yr securely, and thus prevent Replay Attack to the Server.

Page 193 of 197

L11b-Q11. How is the Login Session Establishment a special case of the Bind Mechanism? The Login Session Establishment is a special case of the Bind Mechanism in the following sense: 1. The is used as respectively. 2. After validation, the Server returns back to the Client a message with: . 3. These 2 tokens are encrypted using Password as the Handshake Key. 4. The Virtue/Client Login Process using Password to decrypt the message to get the 2 tokens. 5. These 2 Tokens are kept by the Venus Process for the duration of the Login Session. 6. The ClearToken contains the Handshake Key, HKC, needed by Venus Process for establishing the RPC session. This is how the Login Session Establishment is a special case of the Bind Mechanism.

Page 194 of 197

L11b-Q12. How do we put these concepts together and compare them to the taxonomies proposed by Jerome Saltzer in his seminal paper? The Andrew Filesystem enables authorized users to login remotely from Virtue Client Workstations to securely access their data files on Vice File Servers over insecure links, using the Private Key Cryptosystem. The secure communication over insecure links happens using 3 classes of client-server interactions: 1. Login Session Establishment: Virtue Client sends to Vice Server. Vice Server returns back to the Virtue Client. Note: The is exposed only once per login session. The Handshake Key setup as part of the Login Session is used only for new RPC Sessions, and the validity of the Handshake Key is for the duration of the Login Session. 2. RPC Session Establishment: Venus Process on Virtue Client establishes an RPC Session by sending:. After the RPC Session is established, the Vice Server sends a Session Key, SK to Venus Process. Note: The validity of the Handshake Key, HKC, is for the duration of the Login Session, and The validity of the Session Key, SK, is for the duration of the RPC Session. 3. RPC calls on an RPC Session to access files on the Central Server: For each file open, read, write, close of the files by the user on Virtue Client Workstation, the Venus Process performs a secure RPC call by sending the message: The Private Key=SK is used by Venus to encrypt message and by Vice to decrypt the message.

Page 195 of 197

L11b-Q13. Summarize the AFS Security Report Card. 1. Mutual suspicion: Yes. 2. Protection from the system for users: No. The user has no choice but to trust the system. If the system misbehaves, the user has no choice. 3. Confinement of resource usage: If a user misbehaves and uses more than fair share of its usage, the user can cause Network bandwidth problems leading to Denial of Service attacks. 4. Authentication: Yes. 5. Server Integrity: No? The Servers are assumed to be in a secure environment but the secure environment itself Could be a source of vulnerabilities due to physical or social attacks.

Page 196 of 197

L11b-Q14. What are the other security features that a file system could have? A filesystem could have additional features like the following: 1. Extend the privilege levels provided by the file system to semantics to include groups and sub-groups. 2. Accesses to files with +ve and –ve rights so that –ve rights are useful for quick revocation of rights. 3. Support Audit trail for system administrators modifying the file system. An important insight gained from Jerome Saltzer’s seminal paper on Information Security is to benchmark the solution against Information Security Design Principles laid out in the paper and to know the vulnerabilities that exist in the solution. This is important to be aware of in order to be able to safeguard the system against attackers.

Conclusion Thank you for reading Illustrated Notes for the Advanced Operating Systems course. I have spent considerable amount of time and effort preparing these illustrated notes and I hope they turn out to be useful to at least some of the AOS students. This is work in progress. So, I would appreciate any feedback regarding these notes that serves the purpose of helping students master the AOS material efficiently. Please complete a short survey by clicking here. New versions of this document will be posted at: (you can subscribe for automated updates) https://www.researchgate.net/profile/Bhavin_Thaker3/contributions Cheers, Bhavin Thaker: [email protected] End-of-File. Page 197 of 197