[21] Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson, Brian. Mueller, Jason Small, Jim Zelenka, and Bin Zhou. Scalable Performance of the Panasas ...
Optimizations on the Parallel Virtual File System Implementation Integrated with Object-Based Storage Devices Cengiz Karakoyunlu
John
A.
Chandy
Department of ElectricaJ and Computer Engineering University of Connecticut Storrs, CT USA Email: {cengiz.k,john.chandy}@uconn.edu Abstract-Object-Based Storage Devices
(OSDs)
offer
an
object-based data layout instead of using the traditional block based design. Object storage ideas are increasingly being used in parallel file systems and cloud storage. Previous work has implemented the Parallel Virtual File System (PVFS) on OSDs by moving the metadata and data operations to the OSDs. In this paper, we present two new methods to further improve the performance of the existing PVFS-OSD implementation. These methods are using object collections to represent traditional directories on OSDs and delaying the creation of an OSD object until it is accessed by a write operation.
The optimization
methods are not particular to PVFS-OSD implementation and could be used with other parallel file systems-object storage implementations too. Experimental evaluations show that using collections yields up to 29 times of improvement in throughput;
whereas delaying the creation of OSD objects provides 25 % throughput improvement.
I.
INTRODUCTION
Object-based storage systems store and access data in an object, a logical container consisting of data and attributes and identified by a unique numerical identifier. Traditional storage systems have limited flexibility since they store and access data in a block. Object-based storage increases the awareness of the storage systems and enables moving data management operations from user to the storage system. Having more intelligence at the storage, increases the interaction between the user and data and improves 110 performance and security. Increasing interest in object-based storage has led to the development of the OSD standard [19]. Parallel file systems, on the other hand, scaJe networked computers and achieve high 110 throughput by striping data across 110 servers and having parallel access to data. They typicaJly have separate servers responsible for metadata and directory operations. PVFS [6], PanFS [21], Lustre [7] and GPFS [IS] are some of the commonly-used parallel file sys tems. In an effort to combine parallel file systems with object based storage systems, Ohio Supercomputing Center (OSC) integrates PVFS with a software implementation of object based storage (PVFS-OSD) [12], [10], [S], [3], [2], [9]. Object based storage gives users more control to manage the stor age, an important feature for performance-critical applications,
978-1-4799-0898-1/13/$31_00 ©2013 IEEE
since high-performance clusters overload the 110 servers by accessing and writing data frequently. Performance evaJuations from OSC shows that PVFS-OSD performs comparable and in some cases better than the stock PVFS. Although there have been many studies on the integration of PVFS with object-based storage, we believe more can be done to further improve these integrations. Existing open source PVFS-OSD is outdated and suffers from problems common in parallel file systems in generaJ; such as reading multiple entries in a large directory and creating large number of files. We propose two optimization techniques, using OSD collections as traditionaJ directories and postcreating objects, and implement them on existing PVFS-OSD implementation. While we focus on PVFS, our optimizations can be used with any parallel file system. The rest of this paper is structured as follows; Section II and III describe the optimization tech niques, Section IV presents the experimental results, Section V introduces related work and we conclude in Section VI. II.
COLLECTIONS AS DIRECTORIES
In this section, we propose our first optimization technique;
representing traditional directories as collection objects in OSDs. We describe two different configurations of the existing PVFS-OSD implementation and explain how we apply our optimization techniques to them.
A. Configuration 1: OSDs as 110 Servers In this configuration, OSDs serve as 110 servers only; they store objects corresponding to regular files. PVFS metadata servers handle metadata and directory operations.
1) Current metadata operation methods in Configuration 1: In this section, we explain how various metadata operations create file/directory, insert file, and stat directory - are currently implemented in Configuration 1. a) Create a file and insert it into a directory: To create a new file, PVFS client sends a create command to both OSD 110 server and PVFS metadata server. OSD 110 server creates data object(s) and PVFS metadata server creates a metadata object for this file and they return data and metadata identifiers to the client. To create a directory, the client sends a create directory command to PVFS metadata server only. To insert a file into a directory, PVFS metadata server performs a lookup
on the directory path and if the directory object exists, its identifier is returned to the client, which in turn inserts data and metadata identifiers into the directory object. Creating a file "sample" and inserting it into a directory "foo" is illustrated in Figure l. PVFS
M�t�ad ta
Server
-.::;
"'"
Create file "sample"
Return identifier for metadata object
Create file "sample"
I Jlj I I I I
:
I'
I
25! lV
rver
, , , , , , , , , ,
�'
I Return identifier for data Object
oS
:
retrieve logical sizes of all data objects in a collection at once. OSD standard [19] includes four different types of data structures; root object (descriptive data about entire OSD), partitions (encloses collections), collections (logical grouping of user objects) and user objects (any data or metadata ob ject). Get_member_attributes primitive finds data objects in a collection and retrieves the desired attributes (i.e. logical size) from each object in a single hop without returning back to the client. Creating a file with our optimization is the same with the file creation procedure shown in Figure 1. Inserting a file into a directory differs from what is shown Figure 1 though. Metadata and data object identifiers are inserted into the OSD collection object in addition to the directory object on the PVFS metadata server. Creating a directory "foo", inserting file "sample" into it and performing a stat on "foo" is shown in Figure 3.
Lookup directory "foo"
I
�
PVFS M ...tad ta Server Return identifier for "foo"
"
"
Update "foo" with identifiers of "sample"
""
,I
Create directory "foo"
e �
Lookup directory "foo"
Read data and metadata identifiers
I 'I
I I 'I I
Fo, all
metadata objects
Fo'all
data objects
Fig. 2.
,!
Return common attributes . . ---������'----
I
Ask logical sizes using data identifiers
'II
..------������------� Return logical sizes
I
Update "faa" with the identifiers of "sample"
' I I 'I
Read data and metadata identifiers
I For all
metadata objeo"
[I
Return identifiers
•
Get common attributes using metadata identifiers
I
,;
Re tu m oomm on att r bute, j
==============
�====
�I
====
__________
Ask logical sizes using collection identifier
,I
�,
Return logical sizes
Fig. 3. Proposed method to create a directory, insert a file into it and stat it in Configuration I
1) Current metadata operation methods in Configuration 2: In this section, we explain how various metadata operations create file/directory, insert file and stat directory - are currently implemented in Configuration 2.
:
Get oommon attdbute, u,ing metadata identlfie"
I
In this configuration, OSDs serve as both I/O and metadata servers; whereas PVFS is only used at the client.
Re tu id ti",fie ", ,,,--____ ", ", en ", ", I ____-", �. m,-,
[II [:II
V
B. Configuration 2: OSDs as 110 and Metadata Servers
""
Return identifier for "foo"
...
Create collection for "foo" Return identifier for collection
PVFS M tad ta Server
�
Return identifier for "foo"
Fig. 1. Current method to create a file and insert it into a directory in Configuration I
b) Stat the entries of a directory: If a client wants to stat the entries of a directory, at first PVFS metadata server performs a lookup on the directory path as shown in Figure 2. As discussed in Section II-Ala, each directory object stores data and metadata object identifiers corresponding to each directory entry. Thus, the client reads data and metadata identifiers stored in that directory object. For each metadata identifier, it contacts with PVFS metadata server to retrieve metadata, and for each data identifier, it contacts with OSD I/O server(s) to retrieve logical size that is stored in a specific attribute. This process is illustrated in Figure 2 for directory "foo".
OS
Current method to stat a directory in Configuration I
2) Optimizing metadata operations by representing direc tories with collections: As can be inferred from Figure 2, communicating with the OSD I/O server each time to re trieve logical sizes is very inefficient. In order to alleviate this problem, we are proposing using OSD collections to represent directories on OSD I/O servers in addition to the directory objects on PVFS metadata servers and using the gecmember_attributes primitive in the OSD standard [19] to
a) Create a file and insert it into a directory: To create a new file, PVFS client sends a create command to the OSD I/O server, which in turn creates data object(s) and returns identifier(s) to the client. In Configuration 2, metadata is stored on OSDs using two different methods; using a dedicated OSD metadata server or storing it in data objects [3]. We chose the latter since it is more flexible and efficient. To create a directory, the client sends a create directory command to OSD metadata server. Directory object attributes store name and identifier of each entry Name and identifier of each entry can be stored in directory object attributes using two different methods [2]. In the first method, directory object is locked at first, entry name and identifier are inserted into corresponding attribute and the lock is released. In the second method, atomic compare-and-swap primitive is used to insert name and
identifier into an attribute [9]. We chose the former since it was more readily available. Creating a file "sample" and inserting it into directory "foo" is illustrated in Figure 4.
os�rver
1
Create file "sample" (metadata stored in dflta object)
I I,
Return identifier for data Object
Lock directory "fDa"
Lock granted
Insertion done
Release lock
I "
'i
:
,I
Insert "sample" and its identifier into attribute hash("sample") of directory object
I'
lV
I I 'I 1 1 1 1 ,
2) Optimizing metadata operations by representing direc tories with collections: As can be inferred from Figure 6, multiple communication steps between OSD 110 server and the client is very inefficient. To alleviate this problem, we are proposing replacing directory objects on OSD Metadata servers shown in Figure 5 with collections on OSD 110 servers and using get_member _attributes primitive to retrieve the attributes of all objects in a collection at once. Since directory objects on OSD Metadata servers are replaced with collections on OSD 110 servers, OSD metadata servers are not necessary anymore. Creating a file with our optimization is the same with the file creation procedure shown in Figure 4. While inserting a file into a directory, data object identifier is inserted into the OSD collection object. To stat the entries of a directory, PVFS client calls gecmember _attributes on that directory's collection object. As a result, data object attributes can be retrieved from the OSD 110 server at once. Creating a directory "foo", inserting file "sample" into it and performing a stat on "foo" is shown in Figure 7.
:
Lock released
Fig. 4. Current method to create a file and insert it into a directory in Configuration 2
Previous work [2] shows that when OSDs serve as both
110 and metadata servers, they can store the directory entries
Create collection for "foo"
I.
Return identifier for collection Update collection with identifier of "sample"
in the directory object attributes. Each entry's name is hashed to map that entry to an attribute that stores entry's name and identifier as illustrated in Figure 5 for "file x", "file y" and directory object with identifier n.
Fetch attributes using collection identifier
'I 1 ,I
Return attributes
Fig. 7. Proposed method to create a directory, insert a file into it and stat it in Configuration 2 ash("file XU) ash("file y")
Fig. 5. Current method to store the entries of a directory in a directory object in Configuration 2
b) Stat the entries of a directory: To stat the entries of a directory, OSD metadata server performs a lookup on the directory path at first returning the identifier to the client. The client then reads the attributes of the directory object. As discussed in Section II-Bla, each attribute stores directory entry names and identifiers. Using these the client retrieves data object attributes from the OSD 110 server. This process is illustrated in Figure 6 for directory "foo". PVFS
lIent
ti '" �
1 1 1
1 I
I.
1
:1
For all data objects
Fig. 6.
[
I
III
OSD M tad ta Server
� � �
lookup directory "faa"
•
Return identifier for "faa" Read attributes of the directory object for "faa" Return directory object attributes
I
OS
Ol
V
I
:
•
I
!
Fetch attributes using data identifier
:
II:
Return data object attributes I I .------------------�----------------� .
Current method to stat a directory in Configuration 2
ver
III.
POST-CREATING OBJECTS
Our second optimization is to improve performance while creating large numbers of files at once. Creating a file and inserting it into a directory is a slow process due to steps shown in Figure 1. It requires at least 2n steps just to create n data objects. If the input file is striped over k OSD 110 servers, 2n + k steps are needed. Previous studies [11], [5] tried to solve this problem by precreating data files in a parallel file system with a batch create operation. Large number of data objects are precreated during system initialization with pre-allocated datafile handles that are used while creating metadata objects. Metadata server does not wait for newly created data handles. Number of precreated objects are kept above a certain threshold. Precreating is particularly useful while creating many small files. We propose postcreating that is similar to precreating and that works as follows; create a metadata object, create a directory entry, create and write to a data object. Our approach benefits from create and write primitive in the OSD standard. In stock case, each metadata object stores corresponding data handles. Since we are delaying the creation of an object until it is accessed, metadata objects do not store any handles. This is how PVFS client understands the data objects has not been created yet. As a result, PVFS client calls create and write on OSD 110 server and creates data objects when they are needed. Post-creating reduces the communication overhead between
the PVFS client and OSD 110 servers dramatically. Creation of an object is delayed as much as possible. Creating n data objects requires at most n + k - 1 steps, where k is the number of stripes for a given file.
.100.1000 .10000
03 ,-
3025 v
� � �
c.
IV.
EXPERIMENTAL RESULTS
We performed our evaluations using an HPC cluster of sixteen 64-bit x86 nodes with Linux kernel 3.2 connected through a gigabit network switch. Each node has 4 GB RAM and 70 GB storage. The write caches are disabled on each node and the read caches were flushed during the tests to make the 110 operations touch the disks. OrangeFS (new version of PVFS) 2.8. 6 and OSC's PVFS-OSD emulation is used in the tests.
A. Post-create
1) Single client: We first evaluate the postcreate optimiza tion. In this test case, there is one PVFS client, one PVFS metadata server and the number of OSD 110 servers is varied. PVFS client writes ten thousand 1 KB files with and without postcreate optimization and the average time per create is measured. As can be seen in Figure 8, performance is improved by up to 10% for a single client. One can argue that 1 KB file size is too small to evaluate postcreate optimization. W hen this test is done with 1000 1 MB files on seven OSD 110 servers, postcreate offers little benefit; since writes dominate total time. The post-create optimization works better with small files that can be commonly seen in HPC systems [14], [IS]. ,-------�----_10000
0.12
3
!
� �
i=
0.1 0.08 0.06 0.04 0.02
i=
--
+-------=-
0.2 +------�
015 +------....1 0.1 +-------rll-�=�.
0.05 +--f""I__-....__ .. ---...J single elient stock
"' --.:5�=-...._ ... -
-j-
______
+-----
O+---.----,--� 8
12
16
Num_ofOSD 110 Servers
Fig. 8. Time per create for 10000 files (each 1 KB) with or without poslcreme support versus number of OSD VO servers (single PYFS metadata & directory server and single PYFS client)
2) Multiple Clients: In this test case, multiple clients write hundred, one thousand and ten thousand 1 KB files to eight OSD 110 servers with or without postcreate support and average time per create is measured. Figure 9 shows that, as the number of clients increase, performance gain from postcreate optimization increases up to 2S%. This is comparable to the 19% improvement with precreate optimization in [S] that does not time the actual object creates that we include, since objects are batch created beforehand. We believe postcreate can be further improved using stuffing (reducing the number of data objects allocated for small files) and coalescing (queuing operations in intensive workloads). Figure 9 also shows that postcreate in a multiclient environment performs better with more OSD 110 servers, since it achieves 2S% improvement with eight OSD 110 servers.
two single cI ient poc cI ients stock
two cI ients poc
frur clients stock
frur clients poe
Test cases
Fig. 9. Time per create for 100, 1000 and 10000 files (each 1 KB) with or without poslcreale support for various number of clients (single PYFS metadata & directory server and eight OSD 1/0 servers)
B. GeCMember _Attributes
1) Configuration 1: In this section, we evaluate the gecmember _attributes optimization for Configuration 1. Cur rently we use a single OSD 110 server for this test case, leaving the test with multiple OSD 110 servers as part of future work. Figure 10 shows the average time per stat of a 1 MB file with or without gecmember _attributes optimization. The results show that gecmember _attributes can speed up stats up to 29 times and the more files are created, the bigger the speed-up is. The stock case suffers from accessing OSD 110 server for each stat; whereas with get_member _attributes, all attributes are retrieved in a single call. �regulGV"
"'10000'postcreate
+----.:-- -------- +--.::: � �:::>. � ------- ��� +----�o::: : ------
-1---""'--
Wi: � §
E-