Gfarm Grid File System â Global dependable virtual file system. Federates local disks in PCs. Scalable parallel & distributed data computing .... (Word,PPT,.
HPC File Systems: From Cluster to Grid Oct 4, 2007, IRISA, Rennes
Gfarm Grid File System Osamu Tatebe University of Tsukuba
Petascale Data Intensive Computing High Energy Physics z CERN LHC, LHC KEK-B KEK B Belle z ~MB/collision, 100 collisions/sec z ~PB/year z2000 physicists, 35 countries
Detector for LHCb experiment
Detector for ALICE experiment
Astronomical Data Analysis z data analysis of the whole data z TB~PB/year/telescope z Subaru telescope z 10 GB/night GB/night, 3 TB/year
Petascale Data Intensive Computing Requirements Storage Capacity Peta/Exabyte scale files files, millions of millions of files Computing Power > 1TFLOPS, hopefully > 10TFLOPS I/O Bandwidth > 100GB/s, hopefully > 1TB/s within a system and between systems Global Sharing group-oriented authentication and access control g p
Goal and feature of Grid Datafarm Project Began in summer 2000 2000, collaboration with KEK, Titech, and AIST Goa Goal Dependable data sharing among multiple organizations High-speed g p data access,, High-performance g p data computing Grid Datafarm Gf G id File il System S Gl b l dependable d d bl virtual i l file fil Gfarm Grid – Global system Federates local disks in PCs
Scalable parallel & distributed data computing Decentralization of disk access putting priority to local disk File affinity scheduling
Gfarm Grid File System [CCGrid 2002] Commodity-based distributed file system that federates storage of each site It can be mounted from all cluster nodes and clients Itt p provides o des sca scalable ab e I/O /O pe performance o a ce wrtt tthe e number u be o of parallel processes and users It supports fault tolerance and avoids access concentration b automatic by t ti replica li selection l ti
Global namespace /gfarm ggf aist file1
jp gtrc file2
file1 e file3
Gfarm File System
mapping file2 e
file4
File replica creation
Gfarm Grid File System (2) Files can be shared among all nodes and clients Ph i ll it may be Physically, b replicated li t d and d stored t d on any file fil system node Applications can access it regardless of its location File system nodes can be distributed GridFTP, samba, NFS server GridFTP, samba, NFS server Compute & fs node Compute & fs node
Gfarm metadata server
/gfarm metadata
Compute & fs node
File A
Compute & fs node
File File B C
Compute & fs node
Compute & fs node
File A
Compute & fs node
Compute & fs node
Compute & fs node Compute & fs node
US
File B
Compute & fs node
Japan
File C
Client PC
File A File C File B Gfarm file system
Note PC …
Scalable I/O Performance Decentralization of disk access putting priority to local disk When a new file is created, Local disk is selected when there is enough space Otherwise, near and the least busy node is selected
When a file is accessed, Local disk is selected if it has one of the file replicas Otherwise, near and the least busy node having one of file replicas is selected
File affinity scheduling S h d l a process on a node Schedule d having h i the h specified ifi d file fil Improve the opportunity to access local disk
Scalable I/O performance in distributed environment User’s User s view
Physical execution view in Gfarm (file-affinity scheduling)
User A submits Job A that accesses File A
Job A is executed on a node that has File A
User B submits Job B that accesses File B
Job B is executed on a node that has File B
network Cl Cluster, Grid G id
CPU
CPU
CPU
CPU
Gfarm file system Shared network file system
File system nodes = compute nodes
Do not separate storage and CPU (SAN not necessary) Move and execute program instead of moving large large-scale scale data exploiting local I/O is a key for scalable I/O performance
Particle Physics Data Analysis
• Construct 26 TB of Gfarm FS using 1112 nodes • Store all 24 24.6 6 TB of Belle experiment data • 52.0GB/s in parallel read → 3,024 times speedup • 24.0GB/s in skimming process for b → s γ decays using 704 nodes → 3 weeks to 30 minutes
I/O bandw width [GB/sec]
O. Tatebe et al, “High Performance Data Analysis for Particle Physics y using g the Gfarm File System”, y , SC06 HPC Storage Challenge, Winner – Large Systems, 2006 52.0 GB/sec /
60 50 40 30 20 10 0 0
200
400 600 800 number of client nodes
1000
1200
1112 nodes
PRAGMA Grid C. Zheng, O. Tatebe et al, “Lessons Learned Through Driving Science Applications in the PRAGMA Grid”, Int. J. Web and Grid Services, Inderscience de sc e ce Enterprise te p se Ltd., td , 2007 00 • Worldwide Grid testbed consisting of 14 countries countries, 29 institutes • Gfarm file system is used for file sharing infrastructure → executable, input/output data sharing possible in Grid → no explicit staging to a local cluster needed
More Feature of Gfarm Grid File System Commodity PC based scalable architecture Add commodity PCs to increase storage capacity in operation much more than petabyte scale Even PCs at distant locations via internet
Adaptive replica creation and consistent management Create multiple file replicas to increase performance and reliability C t fil Create file replicas li att distant di t t locations l ti for f disaster di t recovery Open Source Software Linux binaryy p packages, for *BSD,, . . . g , ports p It is included in Naregi, Knoppix HTC edition, and Rocks distribution
Existing applications can access w/o any modification
SC03 Bandwidth
SC05 StorCloud
SC06 HPC Storage Challenge Winner
Design and implementation of Gfarm v2
Problems in Gfarm v1 Security issue No access control of metadata Not enough g physical p y file protection p Inconsistency between physical file and metadata Renaming, removing, changing metadata not supported during file opening Metadata access overhead No user and group management No p privileged g user
Design policy of Gfarm v2 Inherit architectural benefit of scalable I/O performance f using i commodity dit platform l tf Design as a POSIX compliant file system Solve previous problems Improve performance for small files Reduce metadata access overhead Grid file system -> Distributed file system Still benefit from improvement of local file system y Compete with NFS, AFS, and Lustre
Software component http://datafarm.apgrid.org/ gfarm2fs and libgfarm Mount Gfarm file system and provide Gfarm API gfmd – metadata server Namespace, replica catalog, host information, process information gfsd f d – I/O server file access
Metadata server
application gfarm2fs libgfarm gfarm2fs,
CPU gfsd
CPU gfsd
CPU gfsd
Compute and file system nodes
CPU gfsd
...
Implementation of Gfarm v2 (1) Metadata server – gfmd F ll Follow directory-based di t b d access control t l faithfully f ithf ll Cache all metadata in memory No disk wait when responding a request Backend database access in background
Manage file f l descriptors d for f each h process and d processes that open each file C i t Consistency managementt among diff differentt processes
Monitor status of all IO servers H Heartbeat, b CPU load, l d capacity, i …
Introduce compound RPC procedures
Implementation of Gfarm v2 (2) IO server – gfsd Access files based of file descriptor Not path name based Obtain inode number from metadata server
Update metadata etadata at cclose ose ttime e Do not keep unnecessary metadata information rename chmod, chmod . . . No need to access gfsd for rename,
Opening files in read-write mode (1) Semantics (the same as AFS) [without advisory file locking] Updated content is available only when opening the file after a writing process closes the file [[with advisoryy file locking] g] Among processes that locks a file, up-to-date content is available in the locked region. g This is not ensured when a process writes the same file without file locking. g
Opening file in read-write mode (2) Process 1
fopen(“/grid/jp/file2”, “rw”) fopen(“/grid/jp/file2”, “r”)
Metadata server file2
/grid ggf file1
file2 FSN1 () fclose()
jp
Process 2
FSN2 File access
File access fclose()
file2
FSN1
Delete invalid file copy in metadata metadata, FSN2 but file access is continued Before closing, closing any file copy can be accessed
Consistent update of metadata (1) Gfarm v1 – Gfarm library updates metadata Application Gfarm library
open
Metadata server
FSN1 close
Update metadata
Metadata is not updated at unexpected application crash File system node
FSN1
Consistent update of metadata (2) Gfarm v2 – file system node updates metadata Application
open FSN1
Metadata server
Gfarm library close l or broken pipe
Update metadata
File system node
FSN1
Metadata is updated by file system node even at unexpected application crash
Software Stack Gfarm Application
UNIX Application
Windows Appli. (Word,PPT, Excel etc)
UNIX Command Gfarm Command
NFS Client
Windows Client
nfsd f d
samba b
FTP Client
SFTP / SCP
Web Browser / Client
ft d ftpd
sshd hd
htt d httpd
GfarmFS--FUSE (that mounts FS in UserGfarmFS User-space) libgfarm - Gfarm I/O Library
Gfarm Grid File System
Open Source Development Sourceforge.net http://sourceforge.net/projects/gfarm Source code svn co https://gfarm.svn.sourceforge.net/svnroot/gfarm/ https://gfarm svn sourceforge net/svnroot/gfarm/ gfarm/branches/gfarm_v2 svn co https://gfarm.svn.sourceforge.net/svnroot/gfarm/ gfarm2fs/trunk gfarm2fs
Summary Gfarm file system Scalable commodity-based architecture File replicas for fault tolerance and hot spot avoidance Capacity increase/decrease in operation Gfarm v1 U d for Used f severall production d ti systems t 1000+ clients and file system nodes scalability Gfarm v2 Plug up security hole in Gfarm v1, v1 and improve metadata access performance Comparable performance with NFS for small files in LAN 0.2 ~ 0.5 milliseconds
Scalable file IO performance even in distributed environment
1433 MB/sec parallel read IO performance from 22 clients in Japan and US
Open Source Development h http://sourceforge.net/projects/gfarm // f / / f Collaboration with JuxMem team Gfarm provides persistency for JuxMem