File metadata management in Embedded Linux Simone Bolognini, Nicola Corriero, Vittoria Cozza University of Bari Department of Computer Science Via Orabona 4, Bari, Italy
[email protected], ncorriero,
[email protected]
Abstract In embedded Linux systems low power consumption and memory usage are strict constraint. The problem to manage a large quantity of file metadata, that usually represents the most expensive task of such systems, is a task delegated to user space programs, generally, database based. In this work we propose how to deal with metadata at file system level with Hixosfs. Hixosfs is an ext2 based Linux filesystem, designed to easily cataloging and retrieving e-mails, musical files and logs files according final user selected metadata. The core idea is to move the problem complexity from userspace to kernel level to speed up the overall process. The implementation mainly required ext2 file system structure to be extended to store file tags, then they can be accessed and modified directly by system calls. Analysis and comparison works state this approach to be suitable for a large amount of data.
1
Introduction
Hixosfs is here introduced as an innovative way to manage files metadata specially in an embedded system. With embedded system here we mean an electronic system with a microprocessor or a computer-based system controlling a device that interacts with the physical world for a specific purpose. Generally speaking such systems have strict constraints about size of flash memory used, central memory required, booting time, power consumption and usually the whole system is optimize to save processor activities. With Embedded Linux we intend a Linux distribution for embedded devices. Respect to standard distribution, the kernel is extended with some extra drivers or new functionality, it uses embedded file system (fs) and the application layer includes specific purpose programs. We analyse Linux embedded systems used to handle with a big amount of data in a fast and safe way, for example
Internet service providers, e-mail servers, multimedia players. The files under analysis are files such as log files, e-mail and musical files which store metainformation in their content. In literature there are several solution to manage this kind of files, in next sections we analyse existing approach and we present the Hixosfs solution. We present Hixosfs implementation in kernel space and we introduce ad hoc user space programs created to access and modify tags inside hixosfs files in the form of shell command or applet of Busybox[11]. Finally we compare Hixosfs approach and performance in time to access to the information respect to the usage of server and database in three scenarios.
2. State of the art Every file can be indexable by related content tags, but only in a limited set of file type we have tools to separate the content of file from metainformation. In the case we analyse here such e-mails, log and musical files, we choose a well defined subset of metadata and we used for their extraction standard libraries or ad hoc programs to tag files. Extracted metadata management then is demanded to low footprint memory database, for example the light weight sqlite3[3] is used for storing and efficient retrieving of information even in embedded systems. Anyway this approach requires to install the sqlite3 packet and to offer an user friendly interface to store and query them within the database. The alternative approach here proposed is to store extracted tags directly in the fs structure itself and then to access them with high priority kernel space task. Linux offers already a way to extend inode fs attributes with xattr. Here the attributes are stored as a couple attribute-value out of the inode as a variable length string. Extended attributes have been so far implemented in the case of linux fs as ext2[8], ext3, reiserfs. The approach is not well suited when we deal with an huge amount of queried data. Finally a general purpose and hightly customizable solution is offered by fs fuse library
based[6, 9] that handle metadata in user space layer. In both cases the obvious advantage of easily customizing the attributes, becomes a disadvantage in terms of performance.
3
HIXOSFS: design and implementation
High efficiency in managing and searching meta information is offered by Hixosfs. The goal is achieved by choosing to add in the Linux virtual file system (VFS) struct inode, usually used to contain all the information related to a file except its content, tags describing its content. It’s not enough to implement hixosfs as a kernel module under vfs, because it changes the linux headers and required to add new system calls, the approach has been to patch the vanilla kernel and you need to recompile it to support hixosfs. In the beginning it was a patch to ext2, then it became a new fs ext2 based optimized to handle file content related information, keeping the compatibility with existing Linux fs a common extended interface as you can notice from the following diagram:
new required system calls.
3.1
VFS
The Virtual File System (VFS) is the kernel layer that allow to keep filesystem independent the file management. All the function and struct used by each linux fs have to be defined at this level. It’s here where the inode structure as well all the functions that deal with data on the disk serialization definition can be found. To implement hixosfs let’s say for the case of musical file, you need to modify the header ”linux/fs.h” modifying the definition of iattr struct holding musical tag. Specially the function setattr with the task to propagate at the inode the change related to iattr struct, has been extended specifying the new tags definition that will be stored inside the inode.
3.2
Inode
The kernel struct for all the file type management is the inode. In the case of musical file, hixosfs extends the inode definition with a struct tag: struct tag { #ifdef CONFIG_hixosfs_MUSIC char author [30]; char title [30]; char album [30]; char year [4]; unsigned int tag_valid; #endif }
The struct tag has four fields for a total of about 100 byte of stored information, theoretically an inode can be extended until 4 kb then it’s possible to customize it with many tags for your purpose. It’s convenient to choose tags that are most of the time used in the file search to discriminate the files depending their content. We choose here what was able to maximize the time of search musical files by most commonly used criteria as album or author name and so on. In the case of hixosfs for log files we have about 50 bytes:
Figure 1. Linux with Hixosfs All the modification at vanilla kernel have be done following a not invasive style; in fact the final user, at kernel compilation time can choose to support hixosfs fs time or not. It’s not allowed anyway to include runtime hixosfs as kernel module since the struct inode modification, affects the VFS and the system calls too. In the following a detailed description about hixosfs module and about how this new fs design affects kernel inside the VFS, linux headers,
struct tag{ #ifdef CONFIG_hixosfs_LOG char srcip[16]; char dstip[16]; char date[9]; unsigned int tag_valid; #endif }
Finally in the case of hixosfs for e-mail files we have about 100 bytes: struct tag{ #ifdef CONFIG_hixosfs_MAIL
char sender[30]; char receiver[30]; char date[9]; char subject[30]; unsigned int tag_valid; #endif }
Even if it’s possible to change the inode dimension this is not automagically done, in the fs creation phase it has been needed to increase the inode dimension from 128 byte to 256 byte, introducing inside the inode definition a reference to struct tag, explicitly defined out of inode struct. Beside the inode struct, the iattr struct has been changed, this is the struct act to contain the inode field modifiable and directly accessible by the user. Certain functions as hixosfs read inode, hixosfs update inode and hixosfs new inode don’t differ too much respect to read and update for ext2 fs. In addition there is a new part for the management of the new content based file attribute in the struct iattr and inode.
3.3
Chtag and retag system calls
We implemented two new system calls to allow to write and read the struct tag from hixos inode. • RETAG: to read the struct tag content inside the hixos inode, the struct tag is the parameter passed at the system call that fill it and send it in userspace; • CHTAG: to modify one or more fields inside the hixos inode struct tag by a flag. # ifdef CONFIG_HIXOSFS_MUSIC if ( ia_valid & ATTR_TAG_SET ) { if ((attr ->ia_tag).tag_valid & 2) { strcpy ((inode ->i_tag).author, (attr->ia_tag).author); } if ((attr->ia_tag).tag_valid & 4) { strcpy ((inode->i_tag).title, (attr->ia_tag).title); } if ((attr->ia_tag).tag_valid & 8) { strcpy ((inode->i_tag).album, (attr->ia_tag).album); } if ((attr->ia_tag).tag_valid & 16) { strcpy ((inode->i_tag).year, (attr->ia_tag).year); } } #endif
You need specific userspace applications to create, access and modify the tags value. The idea is to use a function in user mode that calls one or more system calls, and these system calls execute on the program function’s behalf, but do so in supervisor mode since they are part of the kernel itself. In fact using a system call implies to execute kernel mode tasks, and the operating system gives priority to such tasks by taking the control after a software interrupt (int 0x80) has been generated. The interrupt demands the control to
the kernel that find the syscall inside a table and execute it, finally it returns the control and the result, if one, in user space unblocking the execution of the other programs. The user space programs hixosfs requires to read/write tags from/to inodes are chtag and stattag. So far all described modifications have been done starting from 2.6.23 kernel[10], but can easily integrated in newer kernel versions.
4
User Space programs within Busybox
In Linux embedded devices utility programs are incorporated inside a unique executable program containing different applets for different commands. It’s the case of the hightly customizable busybox or of the smaller project toybox[7]. BusyBox combines tiny versions of many common UNIX utilities into a single small executable, well suited for embedded system where small space usage is a constraint. It provides replacements for most of the utilities you usually find in GNU fileutils, shellutils, and so on. The utilities in BusyBox generally have fewer options than their fullfeatured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts[11]. The hixosfs system includes user mode tools to manipulate meta information in files at higher level and no more in kernel space. There are two kind of programs hixosfs requires, to read/write tags from/to inodes and as well to interface with the tag based file management offered by hixosfs. At the first group belong programs as stattag (think at shell command stat) and chtag (think at chown and so on), at the second orderby, ls, find, with an intuitive meaning. Every program has been implemented as simple shell program as well as applet inside busybox.
5
HIXOSFS: usage
Hixosfs code is a Linux kernel patch. Its implementations refers to log, music, or e-mail files. The final user can choose at kernel configuration time which hixosfs type enable and recompile the kernel. Embedded devices users, as analyzed in next sections, can efficiently use this solution inside their systems. Changing the tag set means to change the itag struct definition inside VFS. This extension implementation requires an heavy process including kernel compilation and even user space administration tools developments and can’t be delegated to a non Linux programmer. When the aim is to use hixosfs but at the some time to have an high customizable solution, then hixosfs tag struct must be redesigned as general purpose metadata struct. We could
Figure 3. orderby srcip date
Figure 2. Hixosfs imagine an extended tag struct containing n tags of size m for a total of 4kb that it’s the upper bound size for inode.
5.1
-datestart DATE -dateend DATE -date DATE -srcip IP -destip IP
Files organization and retrieving For example:
The user process interface has been extended with the two programs stattag and chtag to access or modify the new inode information for one file. Ad hoc user mode tools have been implemented to extract metadata and populate the whole fs in one step. One example is the command addmusic, based on the library TAGLIB[12], that extract author and so on from musical file to fill the inode. In the case of e-mail, tags are subject, sender, receiver and date and they are extracted with ad hoc programs. About wireless routers, for every transaction they store internet connection log header of transmitted packets. Internet connection logs contain information with different relevance. The minimal set of data to identify every packet has been identified as source ip, destination ip and data. In the following we show in details how use hixosfs by analyzing for example the case of log files. To create a directory tree with tagged files, there is the command orderby, with syntax: orderby [-a tag1 | -b tag2]
scan -log -date all -srcip 192.168.0.5 is the command to find all activity of ip 192.168.0.5.
6
HIXOSFS: advantages and limitation
In Linux file info are stored inside the struct inode of the VFS. Generally a Linux user with the command stat can access some properties for each kind of file as file owner, group, data of creation and modification. Anyway for certain kind of files it can be useful to store extra information needed to label a file respect its content. Hixosfs[1] starts from the idea that allocating a greater amount of memory for the inode, offers the chance to hightly optimizing the file management, for example in ordering phase. This obviously brings the disadvantages of higher memory consumption and overall system performance decrease.
The final user can choose any files organization by orderby command. For example by using:
Open Read vs Hixosfs
#orderby srcip date
In this section we anlyze how retrieval performance change when extract tags from the file content respect to having tags in the hixosfs file structure. Open Read is an ad hoc program that open a music file, extract tags and then it shows them on the screen:
obtaining: where 192.168.0.1, 12.01.2008 and 13.01.2008 are folders. Once you have a files tree you need to use a specific userspace routine to fast search a file inside the fs. The command scan -log can find inside fs files with a specific content of tags and has the syntax. scan -log -option [ value | ALL ] Options are:
TagLib_File* tlf = taglib_file_new(argv[i]); TagLib_Tag* tlt = taglib_file_tag(tlf); printf("\nAlbum: %s", taglib_tag_album(tlt));
Open read uses the taglib library that works with the standard system call fopen or/and lseek and print on the screen
with printf C standard function. We compared the use of Open read with the use of statmusic in hixosfs that recalls the system call retag and read the tags directly from the inode and shows them on the screen. Comparing the time spent by both approach we had less then 1 second (0.03 sec) needed from statmusic, respect 4.19 minutes required in the case of the ad hoc program. The ad hoc program Open read reminds to the well known mp3info command but as the name says it works only in the case of mp3 format files while we based our comparison on a musical dataset of heterogeneous file formats.
Drawback Hixosfs is a straightforward but effective solution for managing metadata at an embedded level. Hixosfs was born in a non embedded environment with the aim best manage musical file[4]. Performance obtained seems to be comparable to other solutions[5]. Anyway it doesn’t appear a good solution because waste of memory due the cost of increasing the inode size, storage, disk cache copy and so on even for file without any tags or other files such as system files. Although it seems a very useful approach in the case of a disk partition dedicated to a specific kind of data.
7
HIXOSFS Embedded
The application scenario of our fs are embedded devices dedicated to the specific purpose of manage files respect to metadata. We based or study on processor families already supported from the Linux Kernel. Only we need to patch the vanilla kernel with the hixosfs patch and download and to compile ad hoc hixos user space programs by source code or inside busybox.
7.1
Internet provider
Hixosfs has been tested within a local internet provider that manage several gigabyte of data of log every year. One important task of Internet provider, according national law, is storing Internet connection logs of its clients for some years. To give a numeric example a small dimension company with 300 clients, is able to store 500 Gb of data for every year. It’s clear that to search inside a set of log files collected for a couple of years, means to perform a big dimension db search. Although these are very remote demand, they are very useful whether on reply of legal authority or as dataset for mining analysis. The testing scenario was a router the wireless internet provider of a company that generally use a server with a mysql database to store informations about logs. In our analysis the router was connected to network hard disk, hixosfs formatted, to store logs. When someone want to know information about
Figure 4. Internet provider schema logs can mount this harddisk in an other personal computer and use our user space programs (orderby and scan) with advantages about time of access of informations and power consumption since they have no server turn on (Figure 4). In this scenario we use an ad-hoc script to intercept all packets of the router to index informations with the inode data inside the network hard disk.
7.2
Multimedia player
We speak about systems with many data that users need to catalogue and rearrange according their needs, in this case hixosfs can speed up the information retrieval task. With our solution the Linux user can easily personalize the order of data in the player by our user space program and can mount the filesystem in own distribution by the standard mount command. Even if only at configuration and compilation time, user can anyway personalize also the data in the struct tag in the struct inode. He can choose this information in the configuration of the kernel among author, title, year, album, playlist and so on. These informations will be used to personalize user space programs too. To test hixosfs filesystem for music files we have choose Openmoko Freerunner[2], an open source project with the aim to port Linux on mobile device. Inside Freerunner Linux os already works, we had just to patch the kernel with our modifications and we choose to create hixosfs partition inside a minisd with 2 gb of stored music file. Obviously to extract tags and import data in the hixosfs we offer an ad-hoc program to read musical tag by taglib and to store these data in the struct tag inside the struct inode. Linux by default offers a similar tool called id3info but we aimed to use libraries able to extract tag from different musical data format[12].
7.3
Mail server
The management of e-mail by textual clients is not a new, many Mail User Agent have been implemented with a textual interface as Pine and Mutt. These tools allow to organize to create e-mail, send receive e-mails and finally to efficiently catalogue them. Anyway to deal with e-mail as simple files can be useful and more efficient, gmailfs fuse
based for example offers to deal with gmail messages at user space fs level. When the need is fast computing time and high efficiency the task can be demanded to kernel space, file used to store the content of e-mail can be used as another type of hixosfs file. In this type of file, in fact, we can extract metadata to identify every file in the fs. Usually in the e-mail server it is possible to have more data that must be organized below different criteria. In the struct tag of the struct inode we have choose to store information about sender, receiver, date and subject of every e-mail while in the content of file we have store the content of e-mail. We have tested hixosfs inside a lpc2468 of EmbeddedArtist1 in which we have installed a small e-mail server. We have configured kernel to read hixosfs and we have connect an external hard disk to store e-mails. It has been implemented a tools of scripts to read the content of e-mail, to identify sender, receiver and subject and to write these informations inside the struct tag in the inode.
8
Conclusion and future works
Hixosfs is an alternative approach to best manage files inside embedded system in all the case when it’s possible to identify a small set of metainformations from such files. The innovative idea is to handle meta data in kernel space giving kernel task priority to the metadata management. To design such fs a deep analysis of Linux VFS and ext2 fs has been required. The main disadvantage of this approach is that the solution is not easy to customize, in fact only at creation time the designer can choose which tags to store. As further development this work can be extended by: • creating a common layer tag independent; • introducing a new file typology called hixosfs file in the vfs; • extending the inode of Linux fs with hixosfs tag without depending from ext2 fs.
References [1] Hixosfs file system. http://www.di.uniba.it/ hixos/hixosfs. Home page. [2] Openmoko. http://www.openmoko.org, april 2009, March. Home page. [3] Sqlite. http://www.sqlite.org/, June. [4] C. Corriero. Hixosfs music: a filesystem in linux kernel space for musical files. MMEDIA 2009. [5] C. Corriero. The hixosfs music approach vs common musical file management solutions, SIGMAP 2009. 1 http://www.embeddedartists.com/
[6] D. t. Z. Corriero, Cozza. A configurable linux file system for multimedia data, SIGMAP 2008. [7] R. Landley. Toybox. http://landley.net/code/toybox/about.html, April 2009. Home page. [8] S. T. Remy Card, Theodore Ts’o. http://e2fsprogs.sourceforge.net/. [9] M. Szeredi. Fuse. http://fuse.sourceforge.net, March 2008. Home page. [10] L. Torvalds. www.kernel.org. [11] D. Vlasenko. Taglib. http://www.busybox.net/, May 2009. Home page. [12] S. Wheeler. Taglib. http://developer.kde.org/wheeler/taglib.html, March 2008. Home page.