Experience of Adaptive Replication in Distributed File Systems

3 downloads 4922 Views 178KB Size Report
unique global file system, transparently distributed over the physical computing .... recovery and are referred only when the primary copy is not available.
Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

Experience of Adaptive Replication in Distributed File Systems Giacomo Cabri, Antonio Corradi Franco Zambonelli Dipartimento di Elettronica Informatica e Dipartimento di Scienze dell’Ingegneria Sistemistica - Università di Bologna Università di Modena 2, Viale Risorgimento - 40136 Bologna - ITALY 213/b, Via Campi - 41100 Modena - ITALY E-mail: {gcabri, acorradi, fzambonelli}@deis.unibo.it

Abstract Replication is a key strategy for improving locality, fault tolerance and availability in distributed systems. The paper focuses on distributed file systems and presents a system to transparently manage file replication through a network of workstations sharing the same distributed file system. The system integrates an adaptive file replication policy that is capable of reacting to changes in the patterns of access to the file system by dynamically creating or deleting replicas. The paper evaluates the efficiency of the system in several situations and shows its effectiveness.

1. Introduction Distributed operating systems are becoming increasingly well spread because of their capacity of providing a uniform view of a set - even heterogeneous of computing resources [Dis93]. This allows a user to neglect distribution related issues, such as the allocation and the access to remote resources, transparently managed by the system itself. In addition, the diffusion of distributed operating systems stems also from other issues: • high performance: application can be distributed across the system considered as a parallel computing [HooM95]; • fault tolerance: it is possible not to compromise the whole system because of one local failure [Bir85]; Among the resources a distributed system can provide a uniform view, the file system is an important one: dispersed users and applications can have access to a unique global file system, transparently distributed over the physical computing resources of the system. In distributed file systems, a variety of mechanisms can be considered to improve performances and fault

tolerance: file migration grant locality of accesses by migrating files to the site from where they are accessed [Hac89, GavL90]. Replication can provide multiple copies of a file so to grant both locality and better availability [Bal92, Lom93]. However, to give the user transparent access to the file system without involvement in low-level issues related to allocation requires automated policies to evaluate whether to effectively apply the above mechanisms. The paper focuses on this issues and presents a system for dynamic file replication in distributed file system. The system is based on a distributed implementation that limits the coordination degree needed among their components. This makes it possible to grant scalability and to minimise the overhead of the policy on applications. The replication policy integrated in the system is based on an adaptive scheme: the replication degree of a file is not statically decided but it can be tuned to the current system status, i.e., to the current pattern of access to the file system from the application level. The presented system has been implemented for a network of heterogeneous workstations on the top of standard commercial software - i.e., UNIXTM and NFSTM [Sun90]. This allowed us to experience the possibility of enhancing the performance of already available systems without any specialised hardware or operating system support. The behaviour of the system has been tested under a variety of traffic situation, from static and quasi-static one - i.e., with pattern of access to the file system that do not substantially change in time - to very dynamic ones. In both cases the policy is able to increase the throughput of the file system: in the former by granting a very low overhead; in the latter, by adapting the replication degree of files so to minimise the costs associated to replica management.

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

The paper is organises as follows. Section 2 introduces the general issues connected to the definition and to the implementation of a replication policy for distributed file-system. Section 3 describes the implemented system and section 4 evaluates its achieved performances..

2. The Replication Problem This section presents the replication concept and its role in the area of distributed operating systems and, in particular, of distributed file systems. In addition, it sketches the problems related to the implementation of a policy for the management of replicated files.

2.1 General concepts Replication is that associates a single logical entity to several physical copies within a system. The replication degree of an entity is the number of replicas of the entity that are currently available in the system. Replication may achieve many goals: • high-performances: replicating an entity at the site where it is mostly refereed, one can grant better performance and minimise access time. In addition, replication can improve the overall by decreasing the communication load imposed on the network [Hac89, Bal92, HooM95]. • greater availability: if different physical copies of an entity are distributed in the system, a replicated resource is more available than a single-copy one [Lad92]. • fault tolerance: a failure that occurs to one physical replica of an entity does not compromise the access to the logical entity since other replicas exist in the system [Bir85, HuaT93]. With regard to the logical entity - physical replicas relationship, two main schemes are possible: the active and the passive ones. In an active (or flat) replication scheme all replicas are considered at the same level (and accessed without any preference), thus loosing any distinction between the original entity and its copies. A passive (or hierarchical) replication scheme identifies a primary copy and a set (possibly organised in a multilevel hierarchy) of backup replicas, to be used when the primary is not easily available. While an active replication scheme is more oriented to achieve the goal of high-performances, by allowing locality of references, a passive replication scheme, instead is more oriented to fault tolerance. Despite the above described advantages, the presence of replicated entities within a distributed system introduces several additional overheads:

• any single logical entity consumes systems resources. Thus, the higher the replication degree the more resources need to be used, proportionally; • coordination is needed among the replicas to maintain their consistency: if the status of a physical replica changes, all other replicas of the same logical entity (or, at least, a quorum of them) must be made aware of this change in order to maintain the coherence of the involved entity. This obviously implies overhead and traffic over the net. As a consequence of the above introduced costs, it may even happen that the performance improvements granted by having replicated entities are outweighed by the introduced overhead. As an additional issue, transparency of replication, must be achieved: a user should not be involved in lowlevel details about allocation and replication. A system defined replication policy must grant an effective exploitation of the replication mechanism without user involvement, i.e., it must automatically decide when an entity has to be replicated in the systems and where.

2.2 Replication Policies A replication policy is the decision strategy module in charge of managing the replication of logical entities within a system. In particular, a replication policy has to decide [Lom93]: • the replication degree of a given entity, i.e., the number of physical replicas to be provided at a given time for a given logical entity of the system; • the allocation of the replicas within the systems, i.e., the nodes of the systems in which a replicas must be physically stored; In addition, a replication policy is in charge of issuing the protocols to grant the consistency among the replicas, whenever a change in the state of one of then makes it necessary [Bal92]. Depending on the moment at which decisions are taken, we can distinguish between static and dynamic replication policies. In particular: • a static replication policy decides the replication of an entity at the moment it enters the system, i.e., at its creation. In this case, both the replication degree and the allocation is decided at the entity creation time; • a dynamic replication policy, instead, takes its decision while the entity has been already created during its life in the system. For instance, both the replication degree and the allocation of the replicas are likely to change on time. We emphasise the distinction between static and dynamic replication policies is not always so sharp: several parameters may influence the decision of a given

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

policy; some of them may dynamically change while others may be fixed. For example, it is possible for a policy to behave statically with regard to the replication degree and dynamically with regard to the allocation of the replicas. In this case, a given replica is only allowed to migrate from one node of the system to another during its life and not to generate any additional replica or to delete any existing one. As a general rule, a static replication policy takes its decision independently of the current state of the system: once one decision has been taken it is maintained permanently, even if the conditions that could have caused it have substantially changed. A dynamic policy, instead, bases its decisions on the current state of the system, by adapting the replication degree and the allocation of the replicas to the configuration that best fits the current situation. In the following of the paper we mainly refer to dynamic replication policy. With regard to the implementation of a replication policy within a distributed system two main solutions arises: • a centralised replication policy, based on a single module, concentrated on one node of the system and in charge of all decisions about replicas for the whole system; • a distributed replication policy, where multiple decisional modules are distributed in the systems, one module for any node of the system, typically. A centralised solution is the most simple to implement: it can come to its decisions on the basis of a global view of state of the system, thus making them near-optimal. However, a centralised policy is likely to become a bottleneck for large distributed system and it is less resilient. A distributed solution, instead, is more likely to be implemented in large systems since it does not require any centralisation point. However, some form of coordination may be required among the different decisional modules in order to agree on replication decision. At one extreme, a decision can influence the whole system and requires a global coordination among all decisional modules. At the other extreme, one decision can have very limited and local coordination degree or being taken autonomously by a decisional module. In general, a solution that requires a very limited co-ordination among all decisional components is to be preferred, since it is likely to produce less overhead on the system with respect to a global one.

2.3 Replication Policies for Distributed File Systems

When dealing with distributed file systems, file replication can be used to store multiple copies of the same file (or of parts of the same file [HarO95]) in different physical supports [Hac89]. Depending on the goals to be achieved by file replication, one can choose between passive and active file replication techniques [Dis93]: the former is mainly fault tolerance oriented, the latter can even provide better availability and performances. A passive file replication technique - world wide known as primary backup - identifies a primary copy of the file, to be generally accessed from anywhere in the system. Replicas of the same file are created for fault recovery and are referred only when the primary copy is not available. Replicas are only periodically updated, thus making this technique simple and low-intrusive method. However, the technique does not achieve any advantage from the performance point of view. An alternative replication technique - known as caching - is based on an active scheme: all replicas are at the same level and they can be accessed. The decision on which replica to access to can be based on a locality principle: for example, one local replica of a remote file can be created to grant local availability of the file. Though more powerful and capable of granting both locality and fault tolerance, caching techniques tends to be more expensive to be implemented: since the system must handle different copies of the file on different sites, which can be independently modified, the replication policy must grant the consistency of all replicas. As a general rule, only read operations on files do not cause any problem and can be performed concurrently. Write operations on a replica, instead, introduce inconsistencies and that make it necessary firstly to grant exclusive access to the file and, then, to update all other replicas. Different algorithms can be though and implemented to grant different degree of consistency and more or less expansive: though the analysis of these algorithms is beyond the scope of the paper, we emphasise that the higher the degree of consistency the algorithm grants, the higher is its implementation cost.

3. The System for Adaptive File Replication This section describes the file replication system implemented on a network of UNIXTM-based workstations, with the main goals of throughput and fault tolerance. The target environment of the system is a transparent distributed file system [LevS90]: from every site, users see the same directories and file structure, independently on the physical location of every single part of file system could be on different disks.

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

The presented system a is based on an active replication scheme: all replicas of a file are considered at the same level. Caching is exploited to improve file availability: one local replica can be provided onto those nodes that intensively access it. In addition, replication is dealt dynamically depending on resource usage and in the respect of the locality principle. The file replication policy is adaptive, i.e., it is not based on a fixed replication scheme: new replicas can be created and/or deleted depending on the pattern of access: on the one hand, the policy can decide of fully replicate the file; on the other hand, it can decide of non replicating the file at all (section 3.2 presents the criteria the policy is based on to decide the creation/deletion of a replica). However, to grant fault tolerance, the system provides, similarly to backup techniques, the capacity of storing multiple copies of a file independently of the fact that client exists that needs to locally access them. Thus, even if only a single process access a file, it is anyway possible to recover faults on it. From the user point of view, replication is transparently managed by the system. A library of primitives, semantically equivalent to standard ones used in UNIXTM, deals with files without any idea of replication. Only one peculiar file creation primitive has been added that permits to define the replication degree of a file and the initial allocation of its replicas.

3.1 The System Architecture The implementation of the system is based on the multiple active agents model [Weg95], towards a distributed management: one agent called the file server runs in every node of the system. All requests to access to the file systems coming from the application processes are redirected to the local file server. The local server provides the access to the local physical replicas, if present, otherwise it redirects the access request to the others file servers. Coordination is needed among file servers in order to provide remote accesses to files and to issue the consistency protocols. 3.1.1 Structure of the Client In a more detailed structure of the system (see figure 1), the client works as follows: when the user program calls a primitive to handle a replicated file, a “stub” translates the standard UNIX call in a special call to the local server. When the server sends an answer, the same stub of the client translates it into a “program friendly” answer. To limit the duties of the servers, each server is stateless, i.e., it does not keep trace of the files in use by clients [LevS90]. Thus, the stub of each client has to keep track of opened files. The state consists of the

following items: the name of the file, kind of access (read-only, write-only, read-and-write), current position of the pointer to the file. The choice of integrating a stub for each client is in the direction of improving efficiency and concurrency, by allowing each process to directly access to the server without any kernel involvement. Anyway, this solution incurs in a larger occupation of memory for each process: information such as server location, the type of primitives, the protocols interface between the clients and the servers, must be duplicated within each client. User Program System Calls Interface Stub


Local Server

Figure 1. Structure of the client. 3.1.2 Structure of the Server As we have already stated, the implemented server is stateless and does not keep track of any file opened by clients. Every operation is self-contained, i.e. when the server receives a request, it opens the specific file, performs the operation and then it closes the. The major advantage of this solution is to diminish the computational load of the server, by distributing the control of files among clients; moreover, this choice can avoid situations in which the server can’t reach the client, and it doesn’t know if and how to maintain the state of files. The well-known drawback of this policy is the impossibility of implementing locking policies at level higher that the one of single write instructions. With regard to the implementation, the server is modular and structured in three layers (see figure 2): • the upper layer (interface layer) defines the interface to the clients and is in charge of receiving requests from clients and answering them; • the intermediate layer (coordination layer) serves the clients' requests by coordinating with other servers and by commanding the mechanisms for accessing local replicas;

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

• the lower layer (replication policy layer) implements the replication policies by exploiting information on the local replicas Whenever a client requests an operation for a file, the coordination layer checks whether it refers to a locally replicated file or not. In case one replica is locally present, it can be directly accessed via the local mechanisms, i.e., read and write operations. In case one replica is not present, the request is broadcast (via a broadcast RPC call [BirN84]) to all other servers and one answer is waited from them. In addition, when a write operation is invoked on a local replica, the coordination layer is in charge of commanding the integrated consistency protocol before accessing the replica. Since the server does not keep track of which servers maintain replicas of a given file, broadcast messages are used by servers to coordinate each other and to ensure that every interested server can take part to the consistency protocol. In any case, any server knows how many replicas of the file are present in the system (as explained in the section 3.2), making it capable to detect if all interested servers have correctly participated to the protocol. In the current implementation, the integrated consistency protocol is a write-through based one: when a process is in need of writing to a file, it should contact all other servers that hold one replica of the same file. When at least a half plus one of the contacted servers notifies that no other process is currently writing on the same file, the client is allowed to write a file and this change is suddenly notified to all other replicas. Anyway, the modular layout of the server and, in particular, of the coordination layer makes it possible to easily change the integrated consistency protocol without any side effect on other part of the server.


Interface Layer

Access Mechanism (read, write, ....)


Coordination Layer

Other Servers

Consistency Protocols

Replication Policies Layer

Figure 2. Structure of the server.

The third layer (replication policy layer) implements the replication policy for replica management. In particular, this layer is devoted to deciding whether to create a new replica or delete an already replicated one. This decision can be based on several information the replication policy has access to: • the number of replicas of a given file present in the system; • the number and the type of accesses (i.e., the number of read and write operations) that have been made by local clients to a given file, either locally replicated or not. Whenever a decision is taken, the replication mechanisms starts: is it realised via standard access mechanisms such as create, unlink, and by transferring data with the help of the coordination layer - is issued.

3.2 The Adaptive Replication Policy The implemented replication policy has been designed with the main goal of locality. In particular: • a replication policy module is present on each node of the system and made responsible of local replicas only, without having the capability of influencing the state of the replicas in other nodes; • at every node, decisions are based on local information only (i.e., the number of locally requested read and write operations). A great advantage of the adopted local approach is that it avoids the need of expansive coordination protocols among the policy modules of different nodes. In addition, locality grants more robustness to the system: one failure in a single site compromises only data of this site and does not interfere with the activities of other policy modules. A drawback of the local replication policy is the unreachability, because of the lack of a global vision. However, the advantages, in terms of robustness and low overhead a local policy provides, outweigh the advantage that would had come from a globally coordinated implementation, likely to be fault prone and expansive. The implemented replication policy is adaptive, i.e., it dynamically adapts the replication degree of files and the allocation of file replicas to the current state of the accesses. At each local site, the policy periodically evaluates the local statistical data and decides whether the current condition makes it convenient to remove the local replica, or to create one local replica of a remotely accessed file. When a server decides to change the current situation, it must notify of the change all other servers to update the current number of replicas of the stated file. The decisions about the new state of replicas are based on the following criteria.

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

and then Let us suppose one system with N hosts where the cost of a broadcast RPC call is b and the cost of a single answer is a. Let F be a file with a current replication degree of n and r and w be the number of read and write operations, respectively, recorded in local statistic information for the file. If the replica is local to a node, the total access cost to the local replica is: T1 = 0r+(b+(n-1)a)w In fact, the read operations do not cause any traffic on the network, while each write operation implies a broadcast call and n-1 answers from all other hosts that own one replica of the file. If the replica had been removed from the host, the cost of accessing to the file would become: T2 = (b+(n-1)a)r+(b+(n-1)a)w In fact, both read and write operations issued from the hosts imply a broadcast request and the reception of n-1 answer from all the hosts in the system that hold a local replica. Then, the decision to remove is convenient if: T2 < T1 By substituting: (b+(n-1)a)r+(b+(n-1)a)w < (b+(n-1)a)w and so: r a < w b + ( n − 1)a

In the case of a broadcast network, we can assume b = a and obtain the following simplified formula: r 1 (i) < w n The expression (i) represents the condition under which a local replica can be deleted (apart for the case in which n is the number of minimum number of replicas to be granted for the given file). Let us consider the case in which a host does not hold a replica of a file with a n-1 replication degree. The situation is complementary to the previous one. Since there is no local replica of the file, the cost of accessing to it is: T1=(b+(n-1)a)r+(b+(n-1)a)w while the cost to access to the file if a local replica were present would be: T2=0r+(b+(n)a)w In this case the server can take advantage of creating a local replica of the file if: T2 < T1 By substituting: (b+(na)w < (b+(n-1)a)r+(b+(n-1)a)w

r a > w b + ( n − 1)a

again, in the case of a broadcast network, we can use the reduced formula: r 1 (ii) > w


The expression (ii) represents the condition under which a local replica can be created from a file with a replication degree of n-1 (apart for the case in which n-1 is the maximum number of replicas allowed for the given file). The above policy can show an unstable behaviour: whenever the r/w ratio is close to 1/n, any dynamic variation of its value can cause to a replicas to be continuously created and deleted, leading to system thrashing. A corrective coefficient K, varying from 0 to 1, can be introduced to tune the inertia of the system and to smooth temporarily instability. K represents the weight old collected statistic information to be given to in the decision of the policy. In particular, the statistic data the policy takes into account are: Statitics(r,w) = NewlyCollected + K(Statistics) If K is zero, only the statistics collected during the last period are considered in the policy, without any smoothing factor. The more the coefficient grows, the more old statistics are important in the decisions and the less temporal variation are influent. Old data become less and less important because K is lower than 1. To summarise, the algorithm implemented by the replication policy on each node is reported in figure 3. do sleep (interval time) Statitics(r,w) = NewlyCollected+K(Statistics) For each file do if replica is local then begin if

( r
min_rep_deg) b + (n − 1)a

then begin remove local replica; advise all sites; end; end else begin if (

r a and n < max_rep_deg) > w b + na

then begin create a local replica; advise all sites; end; end; while true

Figure 3. Pseudo-code of the replication policy.

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

4. Evaluation The presented replication systems has been implemented on a network of Sun and HP of workstations connected by an Ethernet network and sharing a file systems via NFSTM [Sun90]. Because of the broadcast-based architecture of the network, the simplified formulas can be adopted by the policy to decide the presence of the replicas (see section 3.2) . To test the efficiency of the system, we performed repeatable tests under a variety of conditions, by varying both the patterns of access to the files by the clients and the internal parameters of the policy. The tests performed with static patterns of access to the files permit to evaluate the overhead of the policy. The tests with dynamic patterns of access to the files permits to evaluate the capacity of the systems of adapting to a changing situation depending on the inertia produced by the corrective coefficient.

4.1 Static Evaluation When the patterns of access to files do not substantially change in time, the distribution of the replicas in the systems is - after an initial transient situation - stable, and the policy does not produce dynamic movement of replicas. As a consequence, the decisional activities periodically issued within the policy module are not able to provide further benefits and, instead, overhead in the system. This overhead has been measured via throughput of the system both by activating the policy and by inhibiting it (including the inhibition of the statistics collection). Figure 4 reports the total number of operations that can be performed on the average by one host (in a system of 4 total hosts) when the policy is active, compared with the case in which the policy activities have been stopped. Xaxis reports different values for the interval of periodicity of the policy. As it can be seen, the overhead is limited: when the interval time is 30 seconds, only 2% less of operations are performed w.r.t when the policy is not working. The same tests have been performed by varying the number of hosts involved. The main result is that the average overhead imposed by the policy on one host is not dependent on the global system size, making it scalable.

Figure 4. Number of operations performed on one host with and without policy in static situations

4.2 Dynamic Evaluation In dynamic situations, the patterns of access to the files in the system change with the time. Dynamicity can show itself in the change of the r/w ratio of operations requested on files from the client to the nodes of the system. Several tests have been performed by varying the r/w ratio with different frequency. As stated in the previous section, the policy tends to remove replicas from nodes onto which a replica is intensively accessed with a high number of write operation: this may prevent a large number of consistency protocol to be issued from the node. Conversely, a file intensively accessed for reading from a node is likely to be locally replicated onto that node: owning a local replica for reading does not make the traffic increase too much because of the consistency protocols. Starting from these considerations, it is important is to tune the internal parameter of the algorithm so to make it neither too much inertial in reacting to a changed situation nor so prompt to lead to unstable behaviour. A first consideration is about the strict interrelation between the interval of periodicity of the policy and the K coefficient. In fact, being the statistic data updated by the policy every specified periodic interval, diminishing either the interval of periodicity or the K coefficient should produce the same effects on the system inertia. When the corrective coefficient K is high or the interval between two control activities is long, the system tends to presents high inertia, i.e., it tends to delay any reaction due to a changed situation. An example of this situation is shown in figure 5 and 6. Figure 5 reports the

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

temporal diagram of r/w ratio in the accesses to a given file in 3 (out of 5) hosts of the system. Figure 6 report the correspondent temporal diagram of the presence of replicas in these node when the policy is acting: the coefficient K is 0,5 and the interval of periodicity is 60 seconds. In host A, a local replica began to be intensively accessed for reading (r/w=10) at time 0. In this case the algorithm does not remove the replicas. After three minutes the local pattern of access to the replica change and it become intensively accessed for writing (r/w=0,1). In this case, the algorithm would locally decide to remove the local replicas. However, as it can be seen from figure 6, this decision is taken only 2 minutes after the situation has changed. An even worse situation occurs in host B. In this case, a local replica started, at time 0, to be intensively accessed for writing. However, the policy decides to remove it only after three minutes, when the replica has already began to be accessed mainly for writing. That is easily explained by: • the corrective factor quite high and, therefore, the great influence of past operations on the decisions of the policy; • the interval between two controls, long enough to block any prompt reaction to changes in the load.

local replica absent

local replica present host A




host B 3.02




host C 2.03 time 0 (minutes)








Figure 6. Temporal diagram of presence of replicas with K=0,5 and interval of 30 seconds.

local replica present host A

local replica absent 3:40


host B 1:20





host C 3:20

0:21 time (minutes)








Figure 7. Temporal course of the presence of replicas with K=0 and interval of 30 seconds.

Figure 5. Temporal diagram of the r/w ratio on hosts A, B and C

If either the coefficient K or the interval diminishes, the system becomes more reactive and can adapt replicas allocation to current load in a faster way. This situation is shown in figure 7, that refers again to the patterns of accessed reported in figure 5, but shows the temporal diagram of the presence of replicas when K is 0. In this case, we can see that the replica of host A, for example, is removed only one minute after the r/w ration has changed. A synthesis of all performed tests is shown in figure 8. This figure points out that a too high inertia could lead to a worse throughput than in the case in which the replication policy is not activated. An increased throughput can be reached either by diminishing the factor K or the interval of periodicity. However, the figure shows that diminishing too much the interval of periodicity is not effective: increasing the frequency at

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

which the policy is activated, in fact, increases its overhead on the system and, finally, the throughput. 

Number of Operations

10 s. 30 s.

40 s. 

12000 10000 8000 6000

10 s. 30 s.


40 s. 2000

Without policy


Without policy




























Corrective Coefficient (K)


Figure 9. Throughput for different value of K and for different interval of periodicity in case of highly dynamic patterns of access Figure 8. Throughput for different value of K and for different interval of periodicity (10, 30 and 40 seconds) Granting promptness in the algorithm is an important requirement in dynamic systems. However, it is equally important to grant the algorithm stability: it should not waste a lot of system resources without coming to a better situation. Further tests have shown that when the dynamicity of the patterns of access becomes higher and less regular changing every few seconds or less - a little inertia in the algorithm may cause unstable behaviour. Then, the situation reported in figure 8 reverts and lead to figure 9. If the inertia of the system is low, the overall throughput of the system worse because of the high number of movements of replicas and because the interval of validity of a decision is short (the information onto which to base a decision are likely to be obsolete). If the inertia of the system is high, instead, the throughput of the system can be increased. This can be achieved either by increasing K or the interval of periodicity.

As final remarks, during the dynamic tests we have rediscovered the locality principle upon which the system is based: only by respecting it one can bound the scope of decisions. With regard to figure 5, one can change the r/w ratio of access to a given file on a host, say B, while leave those of hosts A and C unchanged. Figure 8 reports the temporal diagram of the presence of the replicas of the specified file on hosts A, B and C. After a comparison with figure 6, one can point out that the presence of replicas on hosts A and C does not change and the change in B of the pattern of access to the local replica is locally confined. Global effects are limited and they are caused only by the fact that the algorithm uses the total number of replicas in the system to evaluate the appropriate decision in local sites. The orthogonality shown by this test is fundamental to give validity to a local algorithm. In fact, if the behaviour of one node can be kept quite independent from other nodes, every site can make good choices only on the basis of local data. The property of locality allows the use of more efficient protocols with less coordination.

Copyright 1996 IEEE. Published in the Proceedings of EUROMICRO '96, September 1996 at Praha, Chzech Republic. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.

6. References










C 0:50 0

3:20 3

6:50 6

9:20 9


time (minutes)

Figure 8. Temporal diagram of the presence of replicas with different values of load on host B.

5. Conclusions The paper presents a system for file replication in distributed file system implemented atop of commercial products. The replication policy integrated within the system is distributed and local: it can take decisions near to optimality, even if it uses only local data, and it can minimise its intrusion on the system. In addition, the policy is adaptive and can vary the replication degree of files depending on the patterns of access to them. Experimental results show that the implemented system can permit a significant performance improvement in the accesses to a distributed file system, even without any specialised hardware or operating system support. A problem of the implemented system is that the replication policy requires a tuning of its internal parameter to achieve efficiency: one the one hand, when the patterns of access to the file system are highly dynamic, it is convenient to increase the inertia of the policy to avoid thrashing; on the other hand, when the patterns of access change slowly, too much inertia can make the system not prompt. Future works will deal with the integration in the system of a replication policy able to automatically adapt its internal parameter to the dynamicity of the patterns of access to the file systems.

[Bal92] H. E. Bal et al., Replication Techniques for Speeding Up Parallel Applications on Distributed Systems”, Concurrency: Practice and Experience, Vol. 5, No. 5, Aug. 1992. [BirN84] A. D. Birrel, B. J. Nelson, “Implementing Remote Procedure Calls”, ACM Transactions on Computer Systems, Vol. 2, No. 2, Feb. 1984. [Bir85] A. Birman, “Replication and Fault-Tolerance in the ISIS System”, Proceedings of the 10th ACM Symposium on Operating Systems Principles, Dec. 1985. [Fra95] M. C. Franky, “DGDBM: Programming Support for Distributed Transactions over Replicated Files”, ACM Operating Systems Review, Vol. , No., 1995. [GavL90]B. Gavish, O. R. Liu Sheng, “Dynamic File Migration in Distributed Computer Systems”, Communications of the ACM, Vol. 33, No. 2, Feb. 1990. [Hac89] A. Hac, “ A Distributed Algorithm for Performance Improvement through File Replication, File Migration and Process Migration”, IEEE Transactions on Software Engineering, Vol. 15, No. 11, Nov. 1989. [HarO95]J. H. Hartman, J. K. Ousterhout, “The Zebra Striped Network File System”, ACM Transactions on Computer Systems, Vol. 13, No. 3, Aug. 1995. [HooM95]P. Hoogerbrugge, R. Mirchandeney, “Experiences with Networked Parallel Computing, ”, Concurrency: Practice and Experience, Vol. 7, No. 1, Feb. 1995. [HuaT93]Y. Huang, S. K. Tripathi, “Resource Allocation for Primary-Site Fault-Tolerant Systems”, IEEE Transactions on Software Engineering, Vol. 19, No. 2, Feb. 1993. [LevS90] E. Levy, A Silberschatz, “ Distributed File Systems: Concepts and Examples”, ACM Computing Surveys, Vol. 22, No. 4, Dec. 1990. [Lom93] M. E. S. Loomis, “Managing Replicated Objects”, The Journal of Object-Oriented Programming, Vol. 6, No. 5, Sept. 1993. [Lad92] R: Ladin et al., “Providing Availability Using Lazy Replication”, ACM Transactions on Computer Systems, Vol. 10, No. 4, April 1992. [Dis93] “Distributed Systems”, Editor. D. Mullender, Addison Wesley, 1993. [Sun90] Sun Microsystems, Network Programming Guide, 1990. [Weg95] P. Wegner, “Tutorial Notes: Models and Paradigms of Interaction”, Technical Reports No. CS-95-11, Brown University, Providence (RI), Sept. 1995.