SemFS: A Semantic approach to File Systems - Semantic Scholar

4 downloads 0 Views 217KB Size Report
driver also interacts with the daemon to populate vir- tual directories (described in Section 1 and Section 4). SemFS has been prototyped using the FUSE (File.
SemFS: A Semantic approach to File Systems Prashanth Mohan, Venkateswaran S, Raghuraman M, Arul Siromoney College of Engineering, Guindy (Anna University) Chennai, India {prashmohan, wenkat.s, raaghum2222}@gmail.com, [email protected]

Abstract Hard Disk capacity is increasing all the time. This has brought with it a new problem; the task of locating files is getting harder by the day. Retrieving a document from the myriad of files and directories is no longer an easy task. SemFS integrates searching of files based on metadata as a basic function of the file system. SemFS provides Virtual Directories which list the results of a query. The contents of the Virtual Directory are formed at runtime. Virtual Directories are used mainly to facilitate the searching of files. This can also be used by plugins to provide interfaces for accessing the functionality of other processes through the file system.

1. Introduction File and directory management is an essential and inevitable part of everyday computer usage. With hard disks growing in size by leaps, we are faced with the problem of locating files efficiently. Conventional file systems impose a hierarchical structure of storage on the user – a combination of its location and filename[3]. Features like ‘symbolic links’ allow a file to be accessed through more than one path. However, a strict enforcing of the path which does not necessarily depict the meaning of the file itself still exists for all files. Users rarely name files properly, however the human mind is more tuned to remember the contents of the file rather than its name itself[5]. SemFS overcomes this problem of a strictly hierarchical organization of files in the UNIX world by using ‘Virtual Directories’[7, 9, 2, 6]. virtual directories (described later in Section 4) are directory listing of files, however the directory itself does not exist on disk. The virtual directory is a conceptual entity that lives in memory and dies upon each restart of the machine. These virtual directories are populated based on their

names. The name of the virtual directory is expected to be a query (in a SemFS defined language) and contains links to a set of files that answer the query. Section 1 introduces the concept of semantic addressing of files. In section 3 we look at the design of our prototype file system – SemFS. The performance of SemFS is studied in Section 6. We look at how the semantic addressing of files can be extended to user space applications in Section 5. Section 2 looks at the related projects which aim at improving end user usability by allowing semantic addressing of files.

1.1

Hierarchical Storage Model

The problem with the current ‘hierarchical’ storage system is that the ‘semantic’ i.e. the meta information of the file is not given due importance. In a hierarchical storage system, the file draws its semantics only from the directory in which it is stored in. O. Gorter’s thesis[3] looks at this shortcoming. In the case of a file such as ‘/home/user/docs/univ/project/file’, the property associated with the file is that of ‘project’. Although ‘univ’ and ‘docs’ are also associated with the file, their association is much weaker when compared to the association of ‘project’ with the file. A listing of the /home directory does not list the file but a listing of the ‘/home/user/docs/univ/project’ directory lists the file. Hence, the file seems to lose the semantics of ‘docs’ and ‘univ’.

1.2

Semantic Storage Model

The ‘Semantic’ storage model[2, 10, 6] contrary to the above specified ‘Hierarchical’ storage model lets you access/locate a file based on the semantics of the file rather than the hierarchical location of the file. SemFS provides the user with the choice of a traditional hierarchical mode of access or a query based mechanism which will offer to the user ‘virtual directories’. The name of the directory itself will be a query

which the file system driver will parse and populate. The metadata of files will not be common. All files will possess the common property of ‘owner’ and ‘last modified date’, however a JPEG file would also contain EXIF data like ‘height’ and ‘width’ as it’s metadata, while an MP3 file would contain id3 data like ‘length’, ‘artist’, ‘album’, etc as it’s metadata. We thus infer that the only meta-information that is pervasive to all files are the file attributes themselves, and each file would contain special meta information that depends on the format in which it is stored.

2

Related Work

SpotlightFS is similar to SemFS. It also defines virtual directories similar to SemFS and populates them based on the output of Spotlight, an integrated desktop search client for the Mac OSX. BeagleFS is a FUSE based file system that represents a live Beagle query. TrackerFS is inspired from BeagleFS and represents queries on Tracker (another file indexer). The GNU/Linux Semantic Storage System (GLS3 ) provides an API to allow Developers to integrate searching and organization capabilities in their application, an extensible plugin-based Type System, shared schemas between applications through an API, a pseudo file system for backward compatibility, a web interface, incremental searching and more. SemFS is different in many ways compared to these file systems. It is an integrated entity complete with the indexing daemon, the file system driver and its plugins, whereas the other projects are different projects meshed together. The most important aspect that distinguishes SemFS from the others is the ability to import application specific namespaces into the file system namespace.

3

Design

The design of the file system is modular and uses a Client-Server architecture. The file system is defined distinctly as the module which provides the system call interfaces and the file storage functionality. The indexing daemon extracts the meta data from files and stores the index in a special database while the plugins provide the logic for extracting and querying the meta data from the files. The messaging module is responsible for signaling events between the driver and the indexing daemon.

Figure 1. FUSE Module in Linux Kernel Architecture

3.1

File System Driver

The file system driver is responsible for providing normal POSIX interfaces. Apart from this, the file system driver is also responsible for notifying the file indexing daemon about any file modifications. The driver also interacts with the daemon to populate virtual directories (described in Section 1 and Section 4). SemFS has been prototyped using the FUSE (File Systems in User Space) library and reuses a FUSE based implementation of the ext2 file system[1] which was created as part of the chunkfs project[4]. 3.1.1

Event Notification

The driver notifies the daemon about various events such as the creation of a new file, deletion of a file, write to a file, etc. The messaging module is responsible for the event notification between the driver and the daemon. In our prototype, we use TCP sockets for sending event notification messages. However, we could also use UNIX sockets, dbus, avahi or other messaging libraries. Events are cached at the driver and sent to the daemon bunched together. This helps improve performance of the file system. FUSE does not currently group together write calls in a manner the Linux VFS does. Hence, this results in removing redundant ‘write’ notifications. This could however result in inconsistent metadata index in the case that the driver is shut down without the cached details being sent to the daemon. This can be solved by using a write ahead journal containing the cached details.

• File Analysis • Query Processing and • Result Reporting

3.3

Plugin Modules

SemFS is designed with ‘extensibility’ in mind. Plugins can be added to the indexing daemon, which allows indexing of different files. For instance, there can be plugins for MP3 files which will extract the id3 data and plugins for JPEG file which extract the EXIF data, etc. The plugins also define the allowed keywords in the search query. The plugins are allowed to have their own persistent databases which can be used to create plugins to other desktop search engines. In the prototype, the libextract library is used to extract the metadata from different files. However, for a more optimized system, the metadata extraction can be built right into the daemon/plugins module. 3.3.1

Figure 2. Design of the SemFS Server

3.2

File Indexing daemon

The file indexing daemon, is the module which receives notifications about modifications, file additions, file deletions, etc from the driver. The index of the file’s metadata is stored in a custom database which follows the MBSM[11] storage model. In the MBSM storage model pages which contain the metadata of the files are oriented in a fashion such that similar metadata of different files are stored sequentially. The different metadata are then stored one after the other (again sequentially). Searching based on a particular field of the file is optimized, since the same field of all the files are grouped together in a mini page[11] thereby reducing the number of block accesses. A mini page is a logical separation of similar attributes in the table. It maximizes inter-record spatial locality within each column in the page, thereby eliminating unnecessary requests to main memory without incurring space penalty. The File Indexing daemon performs the following activities: • Message processing

Plugins for Desktop Search Engines

The generic nature of SemFS allows us to import the semantics of the Desktop search engines into the file system namespace. This is accomplished using the software API that the Desktop Search engine provides for accessing its query functions. This is done on the lines of BeagleFS and TrackerFS. The importing of the namespace[8] from its application specific domain to the global file system name space allows for easier access to the same from other applications.

4

Virtual Directories

Virtual Directories form the integral part of the file system. They form the basis for solving our problem of efficiently locating files based on its meta data in the global namespace[5, 8]. SemFS provides access to the queried files through virtual directories. The name of the virtual directory is a query (defined in Section 4.3), and the contents are symbolic links to files which answer to the above query. The implications of the virtual directory being a part of the file system namespace are huge. All applications will be able to access virtual directories. Many applications have been developed in recent years, which provide search mechanisms in their own domains, for e.g. a media library support searching for media files. However, the namespace in which applications work is limited to within itself. For a different application to be able to access the results of a query provided to the search application, the application would need to

connect to the search engine using an API. With the use of virtual directories, applications are not required to recompiled to make use of different APIs. All of this functionality is exported to the file system name space. This was one of the driving motivations behind the development of the Plan 9[7] operating system. The contents of the virtual directories are symbolic links to the files which belong to that particular query. A listing of the mounted directory will not list the virtual directories. However, one will be able to chdir into the virtual directory and list the files under it. The driver uses a set of Hash Tables in order to allow for efficient virtual directory Management

4.1

Dynamic Updation of Virtual Directory contents

A list of currently active virtual directories is maintained at the indexing daemon. Each time a file is modified or added, the file is indexed for it’s metadata and then the data itself is compared to each virtual directory query. In the case that the updated metadata values answer the query, the daemon should be able to notify this to the driver. This could be easily achieved by blocking the file system notification until the indexing of the file takes place. However, this could lead to deadlocks (if the driver isn’t multi threaded) as well as a decrease in performance. Hence, we take the middle way out by sending this information at the next event notification.

4.2

Software API

A software API is also provided with SemFS for new applications to interact with the file indexing daemon. This will restrict the semantic nature of files to the application’s namespace. The existence of the software API means that a new and fresh outlook can be given to the file system’s nature. The API provides for asynchronous notification to the application about updates to the virtual directory.

• The metadata name should be preceded by the symbol ‘#’ • The logical operators that can be used are ‘and’, ‘or’ even • The two parts are to be separated by the symbol ‘:’(semi-colon) A sample query is ‘type=mp3:#artist=Michael and #album=Heart Breaks’. Conjunction and Disjunction are allowed when specifying the query. Use of regular expressions are also supported.

5

Application of User Space Semantics

SemFS provides the mechanism for easily accessing files based on their metadata from all applications. Apart from this there are extensible features that can be added to SemFS to reveal its potential.

5.1

File Tagging

Tagging improves the user’s logical perception of the file. Although tags do not necessarily define the semantics of the data, they are interpreted by the end-user as being related to a subset of his data, a subset that he logically creates. It represents a relation between files. This way the end-user can retrieve all documents related to a specific tag. Tags can either be added to a file or a group of files. Multiple tags can be associated with each file. This overcomes the problem that a file can only be associated with the directory it is in (As described in Section 1.1).

5.2

File Versioning

Versioning is one of the features that SemFS supports: • Checkpointing and

4.3

Query Syntax • Versioning of independent files

The syntax of the virtual directory query follows the following rules. • SemFS Queries consist of two parts. First the type of the file is specified. e.g: ‘type = mp3’ • Next, The metadata of the file is specified. The operators allowed are =, , =, , ! =, (, )

The plugin can be used to notify a user space application about which sections of the file has been modified. This will allow the user space application to check only the changed data and re-index it. Although very precise consistency cannot be obtained due to lack of synchronisation, it can suffice for normal day to day activities of users. In this way SemFS can act as an archival file system too.

6

Performance

All readings were taken on a Pentium 4 (2.4 Ghz) Prescott processor with 512 MB of RAM and a 5200 rpm seagate barracuda 80 GB hard disk.

6.1

Comparison with normal driver

Figure 5. Write related system calls timing

Figure 3. Directory related system call timing

Figure 6. Comparison of time taken by SemFS Virtual Directories and Unix Utilities

of a UNIX socket for the inter process communication, event caching at the driver, use of the MBSM model of storage (the prototype was modeled using an SQLite based database), etc.

6.2

Comparison with UNIX utilities

Figure 4. Read related system calls timing A comparison of the time taken for system calls by the SemFS driver, fuse-ext2 driver and the kernel ext2 driver are given in figures 3, 4 and 5. The performance of SemFS is slightly worse compared to that of the fuse-ext2 driver. This is largely because of the event notifications that the driver has to do. The current graph is produced from the output of the iozone performance measurement framework. This has been done on a prototype version of SemFS and many more optimizations can be carried out. Some of them are the use

Section 6.1 shows that SemFS is not as fast as fuseext2 when processing normal system calls. However, the basis of this paper is that user spends a lot of time searching for files. Figure 6 shows the timing comparison for evaluating a simple query – searching for MP3 files based on the artist metadata. This clearly shows that the performance of SemFS virtual directories beats searching for files (using a combination of libextract and find) hands down. To be noted is that the performance advantage of SemFS will be even more apparent, if the same virtual

directory is accessed by different clients. Since the virtual directory is cached at the Driver, it will be seen that the time for populating the virtual directory will be lesser for subsequent accesses to the same virtual directory. The performance can also be improved by using the MBSM models of storage.

7

Conclusion

SemFS is a robust, modular, extensible and easy to use file system. SemFS is a new age file system built to meet the increasing demands of users around the world in keeping up with the ever increasing hardware functionality. SemFS also allows the importing of application-level semantics into the global file system namespace. SemFS has brought together useful features from earlier semantic file systems. Additional features are provided that overcome the difficulties found in earlier work. An innovative design philosophy has balanced performance with extensibility and neatness of design. Performance studies show that virtual directory querying is much faster than searching on ordinary Unix file systems. Write and read system calls are slower than ordinary Unix file systems, as is to be expected, but the performance degradation is not very much.

References [1] R. Card, T. Ts’o, and S. Tweedie. Design and implementation of the second extended filesystem. In Proceedings of the First Dutch International Symposium on Linux, Dec. 1994. [2] D. K. Gifford, P. Jouvelot, M. A. Sheldon, and J. W. O. Jr. Semantic file systems. In Proceedings of 13th ACM Symposium on Operating Systems Principles, pages 16–25. Association for Computing Machinery SIGOPS, Oct. 1991. [3] O. Gorter. Database file system - an alternative to hierarchy based file systems. Master’s thesis, University of Twente, August 2004. [4] V. Henson, A. van de Ven, A. Gud, and Z. Brown. Chunkfs: Using divide-and-conquer to improve file system reliability and repair. In Proceedings of HotDep ’06: Second Workshop on Hot Topics in System Dependability, Nov. 2006. [5] J. Nielsen. The death of file systems, Feb. 1996. [6] Y. Padioleau and O. Ridoux. A logic file system. In Proceedings of FAST 03: 2nd USENIX Conference on File and Storage Technologies. USENIX Association, Mar. 2003. [7] R. Pike, D. Presotto, and S. D. et al. Plan 9 from bell labs. Technical report, AT & T Bell Laboratories, Murray Hill, NJ, 1995.

[8] H. Reiser. Name spaces as tools for integrating the operating system rather than as ends in themselves, Jan. 2001. [9] A. Salama, A. Samih, A. Ramadan, and K. M. Yousef. GNU/Linux Semantic Storage System, 2006. [10] Z. Xu, M. Karlsson, C. Tang, and C. Karamanolis. Towards a semantic-aware file store. In HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, pages 145–150. USENIX Association, May 2003. [11] J. Zhou and K. A. Ross. A multi-resolution block storage model for database design. In Database Engineering and Applications Symposium, pages 22 – 31, 16 – 18, July 2003.

Suggest Documents