Catwalk is designed so that it can run on any Linux clusters without any ... dedicated file server nodes, network and a number of disks. The ratios of revenues ...
On-Demand File Staging System for Linux Clusters Atsushi Hori #1 , Yoshikazu Kamoshida #2 , Hiroya Matsuba #3 Kazuki Ohta ∗4 , Takashi Yasui †5 , Shinji Sumimoto ‡6 , Yutaka Ishikawa ∗#7 #
1
Information Technology Center, The University of Tokyo
The University of Tokyo, 2-11-16 Yayoi, Bunkyo-ku, Tokyo, 113-8658 JAPAN hori(a)cc.u-tokyo.ac.jp, ∗
2
kamo(a)cc.u-tokyo.ac.jp,
3
matsuba(a)cc.u-tokyo.ac.jp
Graduate School of Information Science and Technology, The University of Tokyo 4
The University of Tokyo, 7-3-1 Hongou, Bunkyo-ku, Tokyo, 113-8654 JAPAN kzk(a)il.is.s.u-tokyo.ac.jp, †
7
ishikawa(a)is.s.u-tokyo.ac.jp
Hitachi, Ltd.
Hitachi, Ltd., 1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, JAPAN 5
takashi.yasui.xn(a)hitachi.com ‡
Fujitsu Laboratories, Ltd.
Fujitsu Laboratories, Ltd., 4-1-1, Kamikodanaka, Nakahara-ku, Kawasaki, Kanagawa, 211-8588 JAPAN 6
s-sumi(a)labs.fujitsu.com
Abstract—An on-demand file staging system, Catwalk, is proposed. Catwalk is designed so that it can run on any Linux clusters without any special or additional hardware. By having hook functions on the system calls of file operations, a file staging system can be transparent from the view of users, and users can be free from having wrong file staging scripts. In Catwalk, the file copying is done via normal TCP protocol so that Catwalk can run over ordinary, widely-used Ethernet. The stage-in file copy is pipelined to maximize the bandwidth from single file server. The performance of Catwalk is evaluated and compared with NFS using synthetic but realistic workloads. The evaluations show the stage-in performance with the pipeline technique is much better than the performance of NFS. The stage-out performance is comparable with the NFS performance despite the extra copying of files, and the file server is lightly loaded with the Catwalk stageout while NFS entails much heavier server loads. The biggest problems of NFS are its centralized design and lack of scheduling for the parallel workloads. The performance of Catwalk shows that remote file access performance can be improved much better if file accesses are scheduled in a proper way. Thus the proposed file staging system can be a strong complement to NFS, especially for small clusters often having no dedicated parallel file system. Index Terms—file staging, cluster file system, network file system, parallel file system
I. I NTRODUCTION The technique of file staging has been used for decades. The idea of file staging is to copy input files to a directly accessible storage before computation begins and to copy the output files from the storage when a job ends. This explicit file copying can eliminate the need for distributed or parallel file systems in clusters. In many cases, file staging is a built-in function in a job scheduler. The batch schedulers widely used on clusters, such as SGE[1], Torque[2] and others, support file staging. One drawback of file staging is the need for scripting. Users must declare the input files for stage-in and output files for stage-out in a job script to be submitted. If a user has an
incorrect staging declaration in the script, the execution of the job fails, and waiting time, which can be hours, is simply wasted. The best way to avoid this is to declare nothing. In the clusters having a distributed or parallel file system, there is no need for scripting and no chance to have incorrect scripts. If a file staging system can get the information on the required files that must be copied at runtime, then there is no need for file staging scripting. Further, the file staging system can be transparent from the viewpoint of users, and as convenient as a network or parallel file system. Expensive Supercomputer
A Few
Technical Divisional Technical Depratmental Technical Workgroup
Cheap Fig. 1.
Many
Hierarchy of Cluster Users
According to a market research paper on high performance computing[3] published by IDC (International Data Corporation), the users of HPC clusters are categorized into 4 groups (Figure 1). It is natural to think that the number of users of categories form a pyramid shape as shown in Figure 1. In 2006, IDC also did a questionnaire survey on cluster users with the following multiple answer question, “Do you plan to use a cluster-wide file system with your clusters in the next 2 years, and if so which ?” The results showed 67.2% chose NFS (N=58)[3]. This is the largest choice, much bigger than the percentages of Lustre[4] and GPFS[5], both of which are 46.6%. Those parallel file systems require a certain budget for dedicated file server nodes, network and a number of disks. The ratios of revenues storage and compute resources between
2004 and 2006 are almost constant, roughly one to three[3]. Considering the cost of introducing a parallel file system and the pyramid shown in Figure 1, it is natural to think that the low-end users want to pay for parallel computing environment, rather than the parallel storage. Thus, many low-end users rely on NFS. It has been more than 20 years since NFS was first developed, and it was designed for a distributed computing environment, not a parallel computing environment. The most problematic point of the design is its centralized server architecture. In a distributed environment, client requests arrive at the NFS server in a random fashion. In a parallel computing environment, the NFS server can often be flooded by a number of simultaneous client requests. The recent trend towards multi-core makes things even worse. The cluster of the supercomputing center at the University of Tokyo, for example, consists of 952 nodes, and each node has 16 cores (4 sockets by 4 cores)[6]. Even a very small sub-cluster consisting of 4 nodes can run an MPI program having 64 processes. Thus the performance gap between the bandwidth which a single NFS server can provide and the required aggregate bandwidth of compute nodes is getting wider and wider when the number of cores in a compute node increases. File staging for clusters copies files from a file server to compute nodes (stage-in) and copies files from compute nodes to file server (stage-out) according to job scripts, or ondemand. Here, local disks of compute nodes are utilized and the number of local disks is proportional to the number of compute nodes. Thus the aggregate bandwidth for accessing local disks can scale according to the number of compute nodes. The stage-in process copies input files and then the copied files are read by user processes. The stage-out process copies output files after the files are created by user processes. Thus one may argue that this file copying overhead is large and there is no opportunity to have better performance than a network file system where no file copying takes place. This is not true. By combining an on-demand mechanism and the pipeline technique described in this paper, a file staging system can be used as a network file system and can exhibit better performance than that of the widely used NFS. In this paper, an on-demand file staging system, Catwalk, is designed, implemented, and evaluated. The performance of Catwalk is analyzed with relatively small file accesses rather than file accesses larger than the memory size of the compute node, and compared with the performance of NFS. Synthetic workloads are used for the evaluation, mimicking the access patterns of real parallel applications. II. D ESIGN AND I MPLEMENTATION OF C ATWALK A. Process Structure Figure 2 shows the distributed process structure of Catwalk. The Catwalk server runs on a server node and the Catwalk client processes run on each compute node. The server process and the client processes are connected with TCP connections
to form a distributed ring. It is assumed that the files to be staged in are located on the server node and the files are copied onto a local disk of the compute nodes before user job starts. It is also assumed that the files to be staged out are located on the local disks of the compute nodes and the files are copied onto the disk of the server node after the job finishes. Catwalk Server
Disk
File Server
Ring Distributed Processes (TCP/IP)
Catwalk Client
Catwalk Client
Catwalk Client
Catwalk Client
Disk
Disk
Disk
Disk
Compute Node
Compute Node
Compute Node
Compute Node
Fig. 2.
Ring Distributed Process
The distributed ring structure is chosen so that the files on the server node can be broadcasted on compute nodes in a parallel manner. The data packets are relayed by nodes in a bucket-relay scheme and pipelined to maximize the copying bandwidth from a single server node. Pipelined file copying is known to be the best algorithms for clusters[7]. B. On-Demand File Staging Linux allows its users to have a shared library to be loaded and executed before the execution of any other shared libraries and user programs. Once the pathname of a shared library is set into the LD_PRELOAD environment variable, then the shared library is loaded before any other shared libraries. If a function in the preloaded shared library has the same name as the functions in glibc, then the preloaded function is called instead of the glibc function having the same name. In this way, there is no need of re-compilation, or re-linking of user programs to have the hook functions needed by Catwalk. int open( const char *path, int flags ) { int ret = (*open_orig)( path, flags ); if( flags & ( O_RDONLY | O_RDWR ) && ret < 0 && errno == ENOENT && ( ret = catwalk_stage_in( path ) ) == 0 ) ret = (*open_orig)( path, flags ); } return( ret ); } Fig. 3.
Code Skeleton of the Read Open Hook
Figure 3 shows the code skeleton of the read-open hook function in Catwalk. First, it calls the original glibc open() function, If it can not find the specified file, then Catwalk tries staging in the file. If the stage-in succeeds, then the file is reopened by the glibc function, and finally, the hook function returns with the opened file descriptor.
In the current Catwalk implementation, there are hooks on the open(), creat(), fopen(), stat() and the family of the exec() system calls. Note that there is no hook function for the close() system call. The stage-out of Catwalk should be triggered after an output file is closed. However, if the output file happens to be a scratch file, the file may be deleted after calling the close() function. Thus stage-out on the close event may result in needless stage-out. To void this situation, Catwalk starts copying the files to be staged out when a parallel job is finished.
function re-opens the copied file and returns to the user program with the file descriptor of the file staged in (see also Figure 3). The extended double buffer technique is used for the disk write operation in the above procedure. There is a pthread dedicated for the write operation and there are eight 32-KByte buffers, so that OS jitter can be absorbed. D. Stage Out Procedure Figure 5 shows how the Catwalk stage-out takes place.
C. Stage In Procedure
Catwalk Server
Figure 4 shows how the Catwalk stage-in takes place.
(7) Write Data
Catwalk Server StageIn Queue (3) StageIn Req.
(4) StageOut Token
(6) StageOut Data
(8) StageOut Token
Catwalk Client
(4) StageIn Data
Catwalk Client
(2) StageIn Req.
StageOut Queue (2) StageOut Req.
Catwalk Client
(5) Read Data
Catwalk Client (5) Write Data
Catwalk Client (5) Write Data
Catwalk Client (5) Write Data
Catwalk Client
(3) SIGCHLD
(1) create()
(5) Write Data
User Process
(6) Notify
Fig. 5.
User Process
Fig. 4.
Stage-OUT Procedure
(1) open()
Stage-IN Procedure
1) When a user process calls the open() function, the Catwalk hook function is invoked, and if the open mode is READ, then the open information is passed to the local Catwalk client process. The hook function waits for the reply. 2) The client process passes the stage-in request to its neighbor client process along with the ring. Thus the stage-in request is relayed on each client process until it reaches the Catwalk server process. 3) The server process puts the request into the stage-in queue. The requests in the queue are processed one by one. 4) The server process opens the file requested, and the data block is read and passed to the first client process of the ring. The data is relayed along with the ring, and finally, the target file is copied on each compute node. 5) When the EOF is encountered at the Catwalk server, then the EOF information is also passed to each client process along the ring, and the client process closes the copied file, and then let the user process know the resulting status. If the stage-in succeeds, then the Catwalk hook
1) When a user process calls the creat() function, the Catwalk hook function is invoked, and if the open mode is WRITE, then the open information is passed to the local Catwalk client process. Unlike the former stage-in procedure, the hook function simply returns to the user program with the return value of the original open() function in glibc. 2) The client process puts the stage-out request into the stage-out queue. 3) When the user processes in a node terminate, this termination event is reported to the Catwalk server process along with the ring. 4) When the server process receives the termination events from all compute nodes, then the server process sends the next process in the ring a token. 5) The client process receiving the token starts processing the stage-out requests in the stage-out queue. First it creates a TCP connection with the server process, and the stage-out file information and data are sent via this connection. 6) When the server process receives the stage-out information, it creates a file according to the information, and writes the received data. 7) When the queue of the client process gets empty, the client process passes the token to its next neighbor. If the neighbor is another client process, then it repeats the
same procedure starting from 5 in this procedure. If the neighbor is the server process, then the server terminates all client processes and then terminates itself. The reason for creating a TCP connection from the client process to the server process is to save energy. The stage-out data can be passed along with the ring structure, in doing so, the client processes on the ring path to the server process are doing nothing but relaying packets. By having another TCP connection, there is no need for those relaying processes and the relaying energy can be saved. III. S TAGING M ODEL In the stage-in procedure of Catwalk, stage-in files are copied in a pipelined manner and distributed on each compute node. If this file distribution is done by the Linux rcp or rsync command, a file located on the server node must be copied N times, where N is the number of compute nodes. When a process running on a compute node reads the fraction of the file, i.e., 1/N of the file, then the time, T , needed to stage in and read the file is expressed in the expression (1), T = S × N/min(BR , BW , BN ) + S/(N × BR )
(1)
where S is the size of file to be staged in, BR is the disk read bandwidth, BW is the disk write bandwidth, and BN is the network bandwidth. The first term is the time needed to broadcast the file when the read operation, communication and write operation can be overlapped. The second term is the time to read the fraction of the copied file on each compute node. By pipelining the file copying in the way Catwalk does, the time needed to broadcast a file is almost equal to the time of one remote copy, if the file is big enough to hide the pipeline filling latency and the network hardware allows fullduplex communication. Thus the stage-in time, TI of Catwalk is expressed by the following expression (2). TI = S/min(BR , BW , BN ) + S/(N × BR )
(2)
Comparing expression (1) with expression (2), the second terms in both expressions are the same and can be ignored if N is large enough, and the speed of the Catwalk stage-in is N times faster than copying files one node at a time. TR = S/min(BR , BN )
the disk bandwidth is smaller than the network bandwidth. Thus, expressions (2) and (3) indicate that the file access pattern maximizing the disk I/O bandwidth and utilizing the page cache of the Linux kernel is very important. Si /min(BR , BW , BN ) + S/(N × BW ) (4) TO = Si /min(BW , BN ) (5) TW = Expressions (4) and (5) show the stage-out time TO and the NFS write time TW , respectively. Each process writes Si bytes. Here again, the second term of (4) can be ignored and the file access pattern is important to minimize the time of NFS write or stage-out. IV. E VALUATION Catwalk was evaluated on the T2K (University of Tokyo) cluster[6] and compared with NFS. The specification of the nodes are listed in Table I with the disk I/O performance measured by Bonnie++[8], a disk benchmark program. One node is assigned as a file server and the other nodes, up to 16, are used as compute nodes where evaluation programs are running. The file server and compute nodes are equivalent. They are connected with two NICs of Myrinet 10G and two NICs of 1 Gbps Ethernet. The evaluation programs used in the following section are written with MPI, but calling the smallest set of MPI functions, such as MPI_Init(), MPI_Barrier(). One of the 1 Gbps Ethernet NICs is used for the file transfer both in Catwalk and NFS. In this section, Catwalk and NFS are evaluated with the same evaluation programs. TABLE I S PEC . OF T2 K (T OKYO ) N ODE CPU # Sockets Memory Local Disk Ethernet OS File System NFS
Bonnie++
(3)
Expression (3) shows the NFS read time, TR , under the same conditions as the above stage-in case. Since there is no disk write operation involved, the BW variable does not appear in this expression. Unlike the stage-in expression above, there is no second term representing the local read time. Again, the second term in (2) can be negligible when N is large and the first term dominates the time in expressions (2) and (3). Note that the bandwidth parameters of the disk, BR and BW , are not constant and vary widely depending on file access patterns and the utilization of the page cache holding the file contents in the memory space of the Linux kernel. Nowadays,
AMD Barcelona, 2.3 GHz, 4 Cores 4 32 GB SATA Intel E1000 (1 Gbps) ×2 Myrinet 10G ×2 RHELS 5.1 EXT3 Version 3 async export option rsize=32768,wsize=32768 mount options Seq. Block Input: 49.52 MB/sec Seq. Block Output: 39.76 MB/sec
A. Stage In There are two access patterns used to evaluate Catwalk and to compare Catwalk and NFS (Figure 6). One is block access, where a large file is divided into N large blocks, and each process reads its assigned block. Another is striding access, where each process reads a 1 KByte record then advances the seek pointer N KBytes, and repeats this until EOF is encountered. The bandwidth is calculated as that the file size divided by the time of from the beginning of calling the open() function to the end of the read() loop. Since the
Block#2: Node#1 Proc#0
N Blocks
Block#1: Node#0 Proc#1
Rec#3: Node#1 Proc#1
120
Rec#4: Node#0 Proc#0
Block Access
Stride Access
Fig. 6.
the memory. The bandwidth is around 110 MB/sec, almost independent from the file size and process allocation. These bandwidths are better than the Catwalk bandwidths above. This is because most of the file transfer process is done in the kernel level with NFS, while Catwalk is implemented as a user application which introduces a larger number of process switching and memory copying beyond the process boundary. Aggregate Bandwidth [MB/sec]
Rec#0: Node#0 Proc#0 Rec#1: Node#0 Proc#1 Rec#2: Node#1 Proc#0
Block#0: Node#0 Proc#0
N Records
Catwalk stage-in takes place while in the open() function and actual file reading then follows, the time of open() must be included in the bandwidth calculation. The memory region used to hold the read data is allocated using the malloc() function when the file is opened so that it can hold the entire read data to mimic the memory pressure of real applications.
100 80 60
2x1
4x1
8x1
16x1
40
2x4
4x4
8x4
16x4
20
2x16
4x16
8x16
16x16
0 0.8
Access Pattern
2x1B
4x4B
8x16B
2x1S
4x4S
8x16S
2x4B
4x16B
16x1B
2x4S
4x16S
16x1S
2x16B
8x1B
16x4B
2x16S
8x1S
16x4S
4x1B
8x4B
16x16B
4x1S
8x4S
16x16S
Fig. 8.
Cached NFS Read Performance (Block Access)
Figure 9 shows the NFS performance with stride access. This time it shows totally different behaviour from those of NFS block access and Catwalk. The bandwidths are relatively constant due to the file size. However, they heavily depend on process allocation. For example, the bandwidths of the 2x1, 2 nodes, and one process on each node, are almost half of the theoretical bandwidth of the network. The bandwidth of 4x1 cases are almost a quarter of it. This is because of the NFS block size which is 32 KBytes (Table I). In the 2x1 cases, each node requires 1 KByte of data on each read iteration, but the NFS server sends 32 KByte blocks. When an NFS block is read on a node, a user process only requires half of the NFS block in 16 iterations. Thus the remaining 16 KBytes are simply discarded. In the 4x1 cases, the three quarter of an NFS block is discarded, resulting in a quarter of the network bandwidth.
100
2x1
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
120
80 60 40 20
1
10 File Size [1000MB]
Fig. 7.
Cached Stage-IN Performance (tmpfs))
Figure 8 shows the NFS performance with block access. This time, the evaluation program used to open and read a file is invoked immediately after the file is created so that the newly created file remains in the Linux page cache in
Aggregate Bandwidth [MB/sec]
Aggregate Bandwidth [MB/sec]
10 File Size [1000MB]
1) Cached File Access: The first evaluation is to measure the protocol overhead. In the Catwalk evaluation, tmpfs is used for volumes where Catwalk does file I/O, so that no actual disk operation takes place. In Figure 7, the “8x16B” notation, for example, in the figure means the evaluation is done with 8 nodes, 16 processes on each node, and block access pattern. The suffix characters “B” and “S” refer to the block access and the striding access, respectively. The bandwidths are constant around 90 MB/sec, which is slightly lower than the theoretical maximum network bandwidth of 125 MB/sec. There might be room for improvement in the Catwalk protocol handling, however. The graph shows that the bandwidths are independent of the number of nodes, the number of processes in a node, and the access pattern.
0 0.8
1
100 80 60 40 20 0 0.8
1
10 File Size [1000MB]
Fig. 9.
Cached NFS Read Performance (Stride Access)
The worst bandwidths, almost 22 MB/sec, are obtained by
2x1
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
30 25 20 15 10 5 0 0.8 1
10
20
File Size [1000MB]
Fig. 11.
Stage-IN Performance (Stride Access)
by a timer, every 5 seconds by default. When the pipeline stall happens because of running of the pdflush daemon process on one node, then the possibility of triggering the pdflush daemon on the other nodes gets higher. Thus the larger the file size and the number of nodes, the higher the possibility of a pipeline stall. 2x1
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
Aggregate Bandwidth [MB/sec]
20
35 Aggregate Bandwidth [MB/sec]
2x1
35 Aggregate Bandwidth [MB/sec]
the 8x1 cases and the above block effect can not explain this phenomenon. One possible explanation is because of the NFS read-ahead which is NOT controllable by any options of the Linux mount or export command. How and when the NFS read-ahead takes place is unclear, but it is easy to suppose that it does not work well on parallel file access because the access patterns of parallel applications are very different from those of sequential programs. Going into more details of NFS is out of the scope of this paper, however, it is clearly shown that the NFS file read bandwidth can vary depending on the patterns of file access and process allocation, even if the reading file is being cached in the memory, whilst the stage-in bandwidth of Catwalk is independent of the patterns of file access or process allocation when a file is being cached. 2) No-cached File Access: The previous evaluation shows the behaviour with no actual disk accessing. Here, the evaluation results of Catwalk and NFS accompanied by actual disk accesses are shown. To flush the page cache of the Linux kernel, the file system holding the target files is unmounted and then mounted before the evaluation program starts. Catwalk: Figures 10 and 11 show the bandwidths of block access and stride access for Catwalk, respectively. Both graphs are very similar. This means the stage-in bandwidth of catwalk is independent of the file access patterns and node allocation patterns. The bandwidth is 30-35 MB/sec when the file size is 8 MBytes or smaller, but steeply decreasing when the file is 16 MBytes.
30 25 20 15
16 12 8 4
10 0 0.8 1
5
10
20
File Size [1000MB]
0 0.8 1
10
20
File Size [1000MB]
Fig. 10.
Stage-IN Performance (Block Access))
Evaluations on larger files, not shown in these graphs, showed the stage-in bandwidth decreasing much more, and eventually it is too slow to be able to measure the bandwidth when the file size is 32 MBytes or larger. This is why the numbers of the cases with larger files are not shown in those graphs. This slowness is because of sporadic trigger of the pdflush daemon of Linux to flush the page cache on compute nodes. It sometimes takes tens of seconds, enough to stall the pipelined copy of Catwalk. Once this happens, the situation gets even worse. The pdflush daemon is triggered
Fig. 12.
Stage-IN Performance (Block Access, O_DIRECT))
Our experience shows that this phenomenon can be avoided by opening the files on compute nodes with the O_DIRECT flag. The O_DIRECT flag instructs the Linux kernel to bypass the page cache and results in less frequent scheduling of the pdflush daemon and shorten the time used for cache flushing. The O_DIRECT flag also has a negative impact on file access bandwidth. Figure 12 is the graph of stage-in, block access, bandwidth with the O_DIRECT flag. The bandwidths are around 16 MB/sec which is roughly half of the bandwidths without the O_DIRECT flag cases, however, they are relatively constant and independent from the file size and the node allocation pattern.
2x1
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
20 Aggregate Bandwidth [MB/sec]
20 Aggregate Bandwidth [MB/sec]
2x1
16 12 8 4 0 0.8 1
10
20
16 12 8 4 0 0.8 1
Fig. 13.
Stage-IN Performance (Stride Access, O_DIRECT)
The Figure 13 shows the stage-in, stride access bandwidth with the O_DIRECT flag. In the Figure 13, the bandwidths are independent from the file size, however, the lines are roughly categorized into two groups. In the upper group, the bandwidths are around 12 MB/sec, while the bandwidths are around 6 MB/sec in the lower group. The node allocation pattern of the lower groups is “x16”, 16 processes in a node. The bandwidth of lower group is almost the half of the bandwidth of the upper group. One possible explanation of this is the read-ahead effect by the Linux kernel[9]. In the x4 cases, where one disk block (8 KByte local disk block, while the NFS block size of 32 KBytes) is enough to cover the simultaneous access in a node, but in the x16 cases, 2 disk blocks must be read to feed the 16 read requests. This contiguous block access may misslead the kernel into thinking the application is reading a file sequentially, and read-ahead is triggered. However, in the 16x16 case, for example, there is a 1 Mbyte gap between the current position and the next read position. Thus, the file block brought by the read-ahead is wasted, and the bandwidth is halved. The most important point here is that the Catwalk stage-in bandwidths with the O_DIRECT flag are independent of the file sizes. Thus the Catwalk stage-in problem on larger files without the O_DIRECT flag can be avoided. The best way is to switch the O_DIRECT flag according to the file size to be staged in. NFS: Figures 14 and 15 show the bandwidths of NFS block access and stride access, respectively. The bandwidths of both cases are far below those of Catwalk stage-in especially when the number of processes is large. Comparing the evaluation of cached access, disk accessing adds another factor and affects performance. In the block access cases, a file is accessed in N parallel, different position on each process. Thus, the larger the file and larger the number of processes involved in the parallel file access, the more frequent seek operations and the lower the bandwidth.
10
20
File Size [1000MB]
File Size [1000MB]
Fig. 14.
NFS Read Performance (Block Access)
The bandwidth numbers of NFS stride access show less than 1 MB/sec in the cases of the node allocation patterns 8x16 and 16x16. There is a tendency in which seems to indicate the larger the number of processes, the smaller the bandwidth. This situation is similar to the lower group of the Catwalk stage-in with the O_DIRECT flag (Figure 13), due to the readahead effect. However, the Catwalk bandwidths of the x16 cases with the O_DIRECT flag are much better than those of NFS. What happens if the record size is larger than 1 Kbyte ? For example, assume that the record length is a 4 KByte size which is the same as the NFS block size. There will be no block effect and no data in a block will be discarded, and better performance might be obtained. However, a larger number of blocks are required by a compute node. For example, in the cases of x16, 16 processes are running and require 16 blocks at a time. If the above discussion on 8x16 and 16x16 cases is true, then the performance of x16 cases could be worse. Thus the larger record size has a good side and a bad side on performance. Unfortunately this paper focuses on the Catwalk performance, so going into more details of NFS performance is out of the scope of this paper. Catwalk copies the stage-in files in a purely sequential way. Thus the disk seek operation is minimized. Further, the copied file can be expected to be in the page cache of the Linux kernel and the read in the application program is done by memory copying without disk access when the O_DIRECT flag is not set. Indeed, most of the file access times of the cases in Figures 10 and 11 are stage-in copying which is hidden in the open() function call. With the O_DIRECT flag, the cache effect can not be counted on and Catwalk exhibits lower bandwidths than the bandwidths without the flag, however, the stage-in performance of Catwalk is still better than the performance of NFS in the cases having a larger number of processes.
2x1
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
Aggregate Bandwidth [MB/sec]
24 20 16 12 8 4 0 0.8 1
10
and decreases when the number of processes is getting larger. This is because the degree of parallelism is equal to the frequency of the seek operation. The larger the parallelism, the more frequent disk head movement. Thus the NFS write performance decreases as the number of processes increases. In the Catwalk stage-out, the same as in the stage-in of Catwalk, the file access is purely sequential and the seek operation is minimized on both the compute nodes and the server node. The local files of the compute nodes are supposed to be in the page cache of the Linux kernel when the evaluation program is terminated, and the time of Catwalk stage-out is dominated by the time requested for file write on the server node.
20
Fig. 15.
NFS Read Performance (Stride Access)
B. Stage Out The evaluation program is used to create a file having a different file name on each process. The bandwidth of stageout is calculated as the sum of the created file sizes divided by the time of the sum of program execution time and the execution time of the Linux sync command on the server node after the evaluation program terminates. The reason for the addition of the sync time is fairness. Since the NFS volume is exported with the async option and the NFS write time may be the time as that of just writing data into the page cache of the Linux kernel, and it does not reflect the actual file I/O time. Unlike the Catwalk stage-in, actual file transfer takes place after the application program terminates. In the stage-out performance evaluation, there is only one file access pattern, sequential write on each process. 2x1
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
Aggregate Bandwidth [MB/sec]
30 25 20 15 10 5 0
10
20
File Size [1024MB]
Fig. 16.
Stage-OUT Performance
Bandwidths: Figures 16 and 17 show the performance of Catwalk stage-out and NFS write, respectively. The NFS write performance is better when the number of processes is small,
Aggregate Bandwidth [MB/sec]
File Size [1000MB]
2x1
2x16
4x4
8x1
8x16
16x4
2x4
4x1
4x16
8x4
16x1
16x16
30 25 20 15 10 5 0
10
20
File Size [1024MB] Fig. 17.
NFS Write Performance
Server Load: The bandwidth comparison between Catwalk and NFS is not obvious. In some cases NFS is better, and Catwalk is better in other cases. Indeed, while the evaluation program is running, the NFS server is heavily loaded, and often it is hard to use. In contrast, Catwalk does not load the file server as much. Figure 18 shows the sampled series of the Linux load average (1 minute) with a 10 second interval (thick line, left Y-axis). At the same time, the time of how much the command execution for the sampling, “cat /proc/loadavg,” takes is also measured (thin line, right Y-axis). This time, each process writes 1 GByte of file running on 4 nodes, 16 processes on each node (4x16). As soon as the file write begins, the load average reaches almost 40, and at the very end of the program execution, it goes up nearly 100. During the execution of the parallel write, a number of nfsd (NFS daemon) processes are waiting for disk operation (the “D” status in the ps command). The value of the load average reflects the number of I/O waiting processes and kernel threads. The execution time of the very simple cat /proc/loadavg command often takes several seconds. Each sampling has a time stamp, and sampled data shows that there is an almost 8 minutes insensitive zone (Figure 18). Thus the kernel seems to be very busy for scheduling a number of daemon processes, and the node is hard to use by the other users.
120
Load Average (1 Min.)
100
25
80
20
60
15
Load Average (Left Y-Axis) 40
10
time cat /proc/loadavg (Right Y-Axis)
20 0
0
600
1200
1800 2400 Time [Sec.]
Fig. 18.
NFS Server Load
5
3000
3600
"time cat /proc/loadavg" [Sec.]
30 8 Minutes Insensitive Zone
0
Figure 19 shows the same sampling on the same program but with Catwalk stage-out. The first few minutes comprise the local write operation on each node and the server is just idling. As soon as the Catwalk stage-out begins, the load average goes up, but it is always less than 3, and there is no insensitive zone as found in the above NFS case. The response time of the cat /proc/loadavg command is always less than 1 second. Thus Catwalk loads the file server much less than NFS. 1.2
Load Average (Left Y-axis)
1
2
0.8 Local Write
Load Average (1 Min.)
2.5
1.5 1
0.6 Stage-OUT Phase 0.4
time cat /proc/loadavg (Right Y-Axis)
0.5 0
0.2
0
600
1200
1800
2400
3000
3600
"time cat /proc/loadavg" [Sec.]
3
0
Time [Sec.]
Fig. 19.
Catwalk Server Load
V. R ELATED W ORK Various methods for file distribution on a cluster have already been discussed in the paper [7], and they concluded that the pipeline method is the best. Dolly+[10] is also using pipelined file distribution. The nettee[11] utility is to run any Unix command along with the pipeline. However, none of them has pointed out the sporadic sync problem discussed in Section IV-A2. The system call hooking technique of Catwalk is also described in the PVFS parallel file system[12]. There are many parallel file systems for clusters, such as Lustre[4], GPFS[5], pNFS[13], Gfarm[14] and so on. However, those assume the existence of multiple file servers or multiple copies of target files in different nodes, so that higher I/O bandwidth can be achieved. Whilst Catwalk assumes a single file server. It is
not fair to compare Catwalk with those parallel file systems assuming a rich hardware environment. VI. C ONCLUSIONS AND F UTURE W ORK We proposed an on-demand file staging system, Catwalk. There are two unique features in Catwalk, one is its on-demand feature which frees users from writing staging scripts, another is pipelined file copying in stage-in. The performance of Catwalk is evaluated, and compared with NFS. In the stage-in evaluation, one file is read by all MPI processes with block access and stride access. Catwalk exhibits around 30 MB/sec bandwidths with both block and stride access patterns when the file size is not so large. In contrast, NFS exhibits much smaller bandwidths accompanied by a lack of scalability. Catwalk performs well when the file can fit in a page cache of the Linux kernel, however, its performance degrades rapidly when the file gets larger. It is also shown that this problem can be avoided by opening an output file on each compute node with the O_DIRECT flag. This flag degrades the stage-in performance of Catwalk, but it still exhibits better performance than that of NFS. Implementation and evaluation of the algorithm used to decide when to switch the O_DIRECT flag is already on our to-do list. In the stageout evaluation, Catwalk exhibits almost the same performance as NFS, however, Catwalk loads the file server much less than NFS. Through the evaluations of Catwalk and comparisons with NFS, we found the most important things needed to maximize the disk I/O performance, are sequential access and utilization of file cache. These techniques are well-known in sequential programs, but the disk I/O of parallel programs is inherently non-sequential access. What Catwalk is doing is serializing the file I/O of parallel programs. Therefore there is no mystery in the Catwalk performance shown in this paper. The on-demand feature of Catwalk hides not only the staging description but also the staging mechanism. Thus users can use Catwalk as if they are using a distributed or parallel file system with which remote files can be seen as if they are located on the local disk. Conversely speaking, a file staged in by Catwalk can be thought as a cache of the file on the server node. If a user program opens a file in the middle of program execution, the file can be prefetched by calling the open() function at the very first stage of the program execution to reduce the latency of file copying. This explicit open of the stage-in file looks similar to the explicit description of file staging in a job script. However, even if the wrong file is prefetched, the user program can run because the proper file is automatically staged in by Catwalk, although this is not the user’s intention. One of the drawbacks of staging is that files larger than the local disk can not be staged in. Catwalk can not handle this situation even if a huge input file is divided into a number of small files. According to the current Catwalk implementation, all the stage-in files are copied onto every compute node no matter which a node requires the file. Thus the total number
of stage-in files must be smaller than the free space of a local disk. This problem can be avoided by copying the files onto the local disk if and only if the file is demanded. Another option is to delete the files being staged in but not yet opened or already read in when the file system gets almost full. The problem of the former copy-on-demand policy is that there can be a skew among the opens of individual compute nodes, and it is difficult to avoid copying the same file multiple times. The latter deletion policy incurs some additional file deletion cost. Which way to go is left for future work. Anyhow, attacking the problem of how to handle huge files is one of the most important issues of Catwalk. It is also reported there is a severe performance problem when a parallel file system tries to handle a number of small files ([15], for example). This is because of the bottleneck of the meta-data server. Thus there can be a situation where Catwalk may win over a parallel file system. The Catwalk performance handling a number of small files is also left for future work. It is our belief that the number of possible Catwalk users who are using relatively small clusters, having no parallel file systems and suffering from the low performance of NFS, is large enough and hard to ignore. It is also our belief that the techniques described in this paper can also be applied for large clusters with parallel file systems having trouble accessing a number of small files. Catwalk is designed so that it is independent from the underlying file system. Thus it can work with not only any Linux files system but also with any distributed or parallel file system. It is our goal that Catwalk be the complement of a distributed or parallel file system to provide cluster users a better file I/O environment. Catwalk is open source and freely available. It is builtin to the newest SCore cluster software package and can be downloaded from the PC Cluster Consortium’s web page[16].
ACKNOWLEDGMENT This research is partially supported by the “eScience project” of MEXT (Ministry of Education, Culture, Sports, Science, and Technology), Japan. R EFERENCES [1] “SGE.” [Online]. Available: http://gridengine.sunsource.net/howto/filestaging/ [2] “Torque.” [Online]. Available: http://www.clusterresources.com/torquedocs21/6.3filestaging.shtml [3] “HPC Market Trends,” April 2008. [Online]. Available: http://www.hpcuserforum.com/presentations/Norfolk/IDC HPC Market Overview 4.14.2008.ppt [4] “Lustre.” [Online]. Available: http://www.lustre.org/ [5] “GPFS.” [Online]. Available: http://www.ibm.com/systems/clusters/software/gpfs/ [6] “T2K.” [Online]. Available: http://www.open-supercomputer.org/ [7] F. Rauch, C. Kurmann, and T. Stricker, “Partition Cast - Modelling and Optimizing the Distribution of Large Data Sets in PC Clusters,” in EuroPar 2000 – Parallel Processing, A. Bode and T. Ludwig, Eds. Munich, Germany: Springer, August-September 2000. [8] “Bonnie++.” [Online]. Available: http://www.coker.com.au/bonnie++/ [9] F. Wu, H. Xi, J. Li, and N. Zou, “Linux readahead: less tricks for more,” in Proceedings of the Linux Symposium, June 2007, pp. 273–284. [10] S. Takizawa, Y. Takamiya, H. Nakada, and S. Matsuoka, “A Scalable Multi-Replication Framework for Data Grid,” in International Symposium on Applications and the Internet (SAINT 2005 Workshops), January, pp. 310–315. [11] “nettee.” [Online]. Available: http://saf.bio.caltech.edu/nettee.html [12] P. H. Carns, W. B. L. III, R. B. Ross, and R. Thakur, “PVFS: A parallel file system for linux clusters,” in In Proceedings of the 4th Annual Linux Showcase and Conference. USENIX Association, 2000, pp. 317–327. [13] “pNFS.” [Online]. Available: http://www.pnfs.com/ [14] O. Tatebe, N. Soda, Y. Morita, S. Matsuoka, and S. Sekiguchi, “Gfarm v2: A Grid file system that supports high-performance distributed and parallel data computing,” in Proceedings of the 2004 Computing in High Energy and Nuclear Physics (CHEP04), September 2004. [15] P. Carns, S. Lang, R. Ross, M. Vilayannur, J. Kunkel, and T. Ludwig, “Small-file access in parallel file systems,” in Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, April 2009. [16] “SCore.” [Online]. Available: http://www.pccluster.org/