Peter Lyman and Hal R. Varian. How much informajon 2003?. School of. InformaDon Management and Systems. University of California at Berkeley. â« D. A. Pa ...
Handling heterogeneous storage devices in clusters André Brinkmann University of Paderborn Toni Cortes Barcelona Supercompu8ng Center
Randomized Data Placement Schemes n Randomized Data Placement Schemes – – – – –
Introduc8on Randomiza8on – Balls into bins Randomized Data Placement Schemes – Distributed Hash Tables Consistent Hashing and Share Redundancy and Randomized Data Placement Schemes Distributed Metadata Management
Introduc?on Randomiza?on n Determinis?c data placement schemes suffered many drawbacks for a long ?me – Heterogeneity has been an issue – It has been costly to adapt to new storage systems – It is difficult to support storage-‐on-‐demand concepts
n Is there an alterna?ve to determinis?c schemes? n Yes, Randomiza?on can help to overcome these drawbacks, but … … new challenges are introduced!
Balls into bins Games I n Basic tasks of balls into bins games – Assign a set of m balls to n bins
n Mo?va?on – Bins = Hard disks – Balls = Data items – L = max number of data items on each disk
Where should I place the next item? ?
0
1
2
3
4
Balls into bins Games II n Basic Results: – Assign n balls to n bins – For every ball, choose one bin independently, uniformly at random – Maximum load is sharply concentrated: where w.h.p. abbreviates with probability at least , for any fixed
Balls into bins Games III n This sounds terrible: – The maximum loaded hard disk stores -‐8mes more data than the average – This seems not to be not scalable, or …
n The model assumes that only very few data items are stored inside the environment, – but each disk is able to store many objects – Let’s assume that many objects means – Then it holds w.h.p. that
see, e.g, M. Raab, A. Steger: Balls into Bins -‐ A Simple and Tight Analysis
Distributed Hash Tables n Randomiza?on introduces some (well known) challenges n Key ques?ons are: – – – –
How can we retrieve a stored data item? How can we adapt to a changing number of disks? How can we handle heterogeneity? How can we support redundancy?
n Key Tasks of Distributed Hash Tables (DHTs)
Consistent Hashing I n Introduced in the context of Web Caching n Bins are mapped by a pseudo-‐random hash func?on h: on a ring of length 1 n Bins become responsible for 1 6 “their” interval 3 n Balls are mapped by an addi?onal hash func?on g: onto the ring n Each bin stores balls in its interval 4
5 See D. Karger, E. Lehman et al.: Consistent Hashing and Random Trees: Tools for Relieving Hot Spots on the World Wide Web
2
Consistent Hashing II n Average load of each bin is , but devia?on from average can be high: The maximum arc length on the ring becomes w.h.p. n Solu?on: Each bin is mapped by a set of independent hash func?ons to mul?ple points on the ring – The maximum arc length assigned to a bin can be reduced to for an arbitrary small constant , if virtual bins are used for each physical bin
See I. Stoica, R. Morris, et al.: Chord: A Scalable Peer-‐To-‐Peer Lookup Service for Internet Applica8ons.
Join and Leave-‐Opera?ons I n In a dynamic network, nodes can join and leave any ?me n The main goal of a DHT is to have the ability to locate every key in the network at (nearly) any ?me n (Planned) removal of 1 6 bins changes the length 3 of their neighbor intervals – Data has to be moved to neighbor
n Inser?on of bins changes interval length of their new neighbors
7 2
4
5
Join and Leave-‐Opera?ons II n Defini?on of a View V: A view V is a set of bins of which a particular client is aware of. n Monotonicity: A ranged hash function f is monotone if for all views implies n Monotonicity implies that in case of a join opera?on of a bin i, all moved data items have des?na?on i n Consistent Hashing has property of monotonicity
Heterogeneous Bins n Consistent Hashing is (nearly) op?mally suited for homogeneous environment, where all bins (disks) have same capacity and performance n Heterogeneous bins can be mapped to Consistent Hashing by using a different number of virtual bins for each physical bin n The rela?on between the number of different bins constantly changes n Monotonicity (and some other proper?es) can not be kept up
Share Strategy I g(d)
l(cd)
0
1
d
p
o
n Share Strategy tries to map heterogeneous problem to homogeneous solu?on n Each bin d is assigned by a hash func?on g: to a start point g(d) inside [0,1)-‐interval n The length l of the interval is propor?onal to the capacity ci (performance, or other metric) of bin i See A. Brinkmann, K. Salzwedel, C. Scheideler: Compact, adap8ve placement schemes for non-‐uniform distribu8on requirements.
Share Strategy II
0
x h(x)
n How to retrieve loca?on of a data item x inside this heterogeneous sebng? n Use hash func?on h: to map x to [0,1)-‐Interval n Use DHT for homogeneous bins to retrieve loca?on of x from all intervals cubng h(x)
Share Strategy III
0
x h(x)
n Proper?es: – (Arbitrary) op8mal distribu8on of balls and bins – Computa8onal Complexity in O(1) – Compe88ve Ra8o concerning Join and Leave is (1+ε) for every ε>0
n But: – Share has been op8mized for usage in data center environments – Share is not monotone and only par8ally suited for P2P networks
V:Drive n V:Drive
SAN MDA
– out-‐of-‐band virtualiza8on environment – each (Linux) server includes addi8onal block-‐ level driver module – metadata appliance ensures consistent view on storage and servers – Share strategy used as data distribu8on strategy
See A. Brinkmann, S. Effert, et al.: Influence of Adap8ve Data Layouts on Performance in dynamically changing Storage Environments
Throughput (MB/s)
Performance V:Drive -‐ Sta?c Synthe8c random I/O benchmark, sta8c configura8on
Avg. latency (ms)
Physical Volumes VDrive LVM
Physical volumes VDrive LVM
Throughput (MB/s)
Performance V:Drive – Dynamic Synthe8c random I/O benchmark, dynamic configura8on
Avg. latency (ms)
Physical volumes VDrive LVM
Physical volumes VDrive LVM
V:Drive -‐ Reconfigura?on Overhead
Randomiza?on and Redundancy n Randomized data distribu?on schemes do not include mechanisms to safe data against dist failures n Ques?on: – How to use Randomiza8on and RAID schemes together
n Assump?on: – n copies of a data block have to be distributed over n disks – No two copies of a data block are allowed to be stored on the same disk
Trivial Solu?ons n Trivial Solu?on I: – Divide storage systems into n storage pools – Distribute first copies over first pool, …, n-‐th copies over n-‐th pool Ø Missing flexibility
n Trivial Solu?on II:
First Copy
– First copy will be distributed over all disks – Second copy will be distributed about all but the previously chosen disk, … Ø Not able to use capacity efficiently
p = 1− 1
(
2
) p = 1− 1
(
p = 1− 2
(
3
)
Second Copy
4
)
Observa?on n Trivial Solu?on II is not able to use capacity efficiently, because big storage systems will be penalized compared to smaller devices n Theorem: Assume a trivial replication strategy that has to distribute k copies of m balls over n > k bins. Furthermore, the biggest bin has a capacity cmax that is at least (1 + ε) cj of the next biggest bin j. In this case, the expected load of the biggest bin will be smaller than the expected load required for an optimal capacity efficiency.
See A. Brinkmann, S. Effert, et al.: Dynamic and Redundant Data Placement
Idea n Algorithm has to ensure that bigger bins get data items according to their capaci?es n This can be ensured by an algorithm that iterates over a sorted list of bins 1. At each itera8on, the algorithm randomly decides, whether or whether not to place the ball 2. If one of k copies of a ball has been placed, use op8mal strategy for (k-‐1) with remaining bins as input
n Challenge: – How to make random decision in step 1 of each itera8on
Example for Mirroring (k=2) 100 GB
100 GB
80 GB
80 GB
60 GB
0.24
0.24
0.19
0.19
0.14
0.24
0.31
0.36
0.57
1.00
0.48
0.62
0.72
1.14
2.00
n denotes the rela?ve capacity of disk i to all disks n denotes the rela?ve capacity of disk i to all disks star?ng with index i n is the weight for the random decision!
Example for Mirroring (k=2) 100 GB
100 GB
80 GB
80 GB
60 GB
0.24
0.24
0.19
0.19
0.14
0.24
0.31
0.36
0.57
1.00
0.48
0.62
0.72
1.14
2.00
n If, e.g., disk 2 is chosen as first copy of a mirror, just distribute the second copy according to Share over disks 3, 4, and 5 n Some adapta?on is necessary, if disk 3 is chose, because weight of disk 4 is greater 1
Observa?ons 100 GB
100 GB
80 GB
80 GB
60 GB
0.24
0.24
0.19
0.19
0.14
0.24
0.31
0.36
0.57
1.00
0.48
0.62
0.72
1.14
2.00
n Strategy can easily be extended to arbitrary k n Data distribu?on is op?mal n Redistribu?on of data in dynamic environment is k2-‐compe??ve n Computa?onal complexity can be reduced to O(k)
Fairness of k-‐fold Replica?on
Adap?vity of k-‐fold Replica?on
Metadata Management n Assignment of data items to disks can be solved efficiently for random data distribu?on schemes – Very good distribu8on of data and requests – Computa8onal complexity low – Adap8vity to new infrastructures op8mal without redundancy, ok with redundancy – Over-‐provisioning can be efficiently integrated
n … but how to find posi?on of data item on the disks? – Equal to the dic8onary problem – Requires O(n) entries to find loca8on of n objects! – Defines bulk set of metadata
Dic?onary Problem Extent Size vs. Volume Size 1 GB 64 GB 1 TB 64 TB 1 PB
4 KB
16 KB
8 MB 512 MB 8 GB 512 GB 8 TB
2 MB 128 MB 2 GB 128 GB 2 TB
256 KB 128 KB 8 MB 128 MB 8 GB 128 GB
4MB
16MB
256 MB
8 KB 512 KB 8 MB 512 MB 8 GB
2 KB 128 Byte 128 KB 8 KB 2 MB 128 KB 128 MB 8 MB 2 GB 128 MB
1 GB 32 Byte 2 KB 32 KB 2 MB 32 MB
n Extent: Smallest con?nuous unit that can be addressed by virtualiza?on solu?on n Dic?onary easily becomes too big to be stored inside each server system for small extent sizes n Solu?ons – Caching – Huge extent sizes – Object Based Storage Systems
Summary and Conclusions n Introduc?on into Disk Arrays n Why Heterogeneity? n Determinis?c Data Placement Schemes n Randomized Data Placement Schemes n Summary and Conclusions
Summary n Problem to be solved: scalable storage systems suppor?ng heterogeneous devices n Two solu?ons developed concurrently – Determinis8c • Modify RAID technology keeping its flavor
– “Non-‐determinis8c” • Distribute data blocks by using randomiza8on • RAID encoding on top of randomiza8on process
Conclusions n Advantages of each version – Determinis8c • Easy metadata management • Easy recovery
– “Non-‐determinis8c” • Good support for storage-‐on-‐demand concepts • Less probability to get to a degraded state?
n Both approaches are complementary concerning the advantages, but have many similari?es – A zone is very similar to a group of extents • Not fully described in the tutorial
n Next step: Work on a mixed version
Bibliography I n A. Brinkmann, S. Effert, F. Meyer auf der Heide, C. Scheideler: Dynamic and Redundant Data Placement. In Proceedings of the 27th IEEE Interna8onal Conference on Distributed Compu8ng Systems (ICDCS ), 2007 n A. Brinkmann, S. Effert, M. Heidebuer, M. Vodisek: Influence of Adap?ve Data Layouts on Performance in dynamically changing Storage Environments. In Proceedings of the 14th Euromicro Conference on Parallel, Distributed and Network based Processing, 2006 n A. Brinkmann, K. Salzwedel, C. Scheideler: Compact, adap?ve placement schemes for non-‐uniform distribu?on requirements. In Proceedings of the 14th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2002 n T. Cortes and J. Labarta: Taking Advantage of Heterogeneity in Disk Arrays: Journal on Parallel and Distributed Compu8ng (JPDC), Volume 63, number 4, pp. 448-‐464, April 2003 n J.L. Gonzalez and Toni Cortes: An Adap?ve Data Block Placement based on Determinis?c Zones (Adap?veZ): Interna8onal Conference on Grid compu8ng, high-‐performAnce and Distributed Applica8ons (GADA'07) Vilamoura, Algarve, Portugal, Nov 29 -‐ 30, 2007
Bibliography II n J. L. Gonzalez, T. Cortes: Evalua?ng the Effects of Upgrading Heterogeneous Disk Arrays: Interna8onal Symposium on Performance Evalua8on of Computer and Telecommunica8on Systems (SPECTS 2006), Calgary, Canada, July 31 -‐ August 2, 2006 n M. Holland G.A. Gibson: Parity declustering for con?nuous opera?on in redundant disk arrays: In Proceedings of the fish interna8onal conference on Architectural support for programming languages and opera8ng systems, Boston, Massachusets, 1992 n D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy: Consistent Hashing and Random Trees: Tools for Relieving Hot Spots on the World Wide Web. In Proceedings of Symposium on Theory of Compu8ng (STOC), 1997. n Peter Lyman and Hal R. Varian. How much informa?on 2003?. School of Informa8on Management and Systems. University of California at Berkeley n D. A. Paterson, G. A. Gibson, R. H. Katz: A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the Interna8onal Conference on Management of Data (SIGMOD), 1988
Bibliography III n M. Raab, A. Steger: Balls into Bins -‐ A Simple and Tight Analysis. In Proceedings of the 2nd Workshop on Randomiza8on and Approxima8on Techniques in Computer Science (RANDOM'98), 1998 n I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan: Chord: A Scalable Peer-‐To-‐Peer Lookup Service for Internet Applica?ons. In Proceedings of the 2001 ACM SIGCOMM Conference, 2001 n Ron Yellin. The data storage evolu?on. Has disk capacity outgrown its usefulness? Terada magazine 2006