Implementation of Concurrent Access to File Systems in USB Devices Bogdan Vacaliuc NGI Technology, LLC. 15 Pleasant St. Suite 9 Concord, NH 03301, USA +1.603.226.0777
[email protected]
Introduction The combination of a removable storage device with a communication channel and a computation node is a common architectural theme in many system-on-chip (SOC) designs. The systems in which many of these devices are placed produce or consume digital media content from their removable storage devices, which are formatted with industrystandard file systems. When the communication channel is USB, the systems exchange this media content with a host via the USB Mass Storage Device Class (MSDC) specification. It is useful to allow a connected host access to the removable storage device concurrently with the operation of the computation node, the embedded SOC processor. There are difficulties in supporting such concurrency. This paper presents a method of achieving concurrency based on serialization of access to file system management structures (metadata).
The Concurrency Problem Shared file systems must provide correct operation in the presence of multiple entities performing I/O on the same files. When a device exposing a USB/MSDC interface is connected to a USB host, the host performs I/O operations to the storage
media directly, and keeps cached copies of data from those I/O operations. Processes on the USB device can simultaneously perform file operations on the same media. If the caching performed by the host is not reconciled with the operations on the device, the file system on the media will become inconsistent, leading to possible software failure. Terms Client refers to either the host or the processes on the device that require access to the file system. Server refers to the one process on the device that manages access to the device and file system. Read Sharing occurs when multiple clients access the same file for read access only. Sequential Write Sharing occurs when one client closes a file it had opened with write access before other clients open the same file for read access. Concurrent Write Sharing occurs when writes are mixed with reads by either client on the same file. [1] Models The system model can be represented as a modification of a network file system, with two clients. In this variation, the clients are replaced by file system drivers, and the server is replaced by a media access controller.
Host
App
Client A
App
Device
...
App
App
File System
...
App
File System
App
Client B
USB Connect USB Interface
USB Interface
Cache/Block Device File System Mutex
Server (Device)
Figure 1 System Model 1
Two clients share access to a raw block device on a server. The clients are each responsible for interpreting the server’s data in terms of a file system. The USB constitutes the network. See Figure 1, above. A cache is attached to the block device where it provides performance benefits to all clients and enforces concurrency.
Related Work Distributed File Systems Concurrency in network file systems has been approached in a variety of ways. NFS and its derivatives operate on the granularity of the file object. This addresses the issue of file system metadata concurrency by using the native file system drivers on the NFS server. In order to improve overall system performance, network file systems define some amount of file buffer caching on the clients. Shared access to files in the network file system model thus becomes a problem of cache consistency across clients. NFS [2] and the Andrew File System (AFS) [3] maintain consistency for sequential write sharing. That is, they require a writer to flush all uncommitted writes to the server on a file close. AFS guarantees consistency if the open occurs after subsequent closures by other clients. In NFSv3, the modification timestamp of a cached file is compared with the server upon open in order to determine if any cached data for that file must be invalidated. AFS uses a server-to-client callback system to invalidate cached data on the client. Concurrent write sharing is achieved in some variants of NFS. Sprite [1] disables client caching whenever write sharing is detected. Spritely NFS [4, 5] adds state information on the server, allows caching on the client, and uses server-toclient callbacks to inform clients that they must stop caching files. NQNFS [6] employs a system of Leases [7] in which a client obtains the right to perform one of four kinds of I/O: Read, Read Caching, Write, and Write Caching. There is a mechanism for ‘eviction’ which causes a client holding a write cache lease to flush back all of its dirty blocks. After completing that task, it must apply for another lease, and continue operations under a new lease contract. Similarly, the ECHO Distributed File System [8] uses a read and write token which must be obtained from the server before caching or writing to the file, respectively. Shared Block Devices Recent work in parallel file systems and network attached storage has focused on the use of shared block devices. These are access controllers which allow multiple entities (generally termed initiators) to perform I/O operations to the storage media. The General Parallel File System (GPFS) and the Global File System (GFS) both operate over shared block device servers. In GPFS [9], a token management scheme is used to provide access control between the clients attached to a given device server. A centralized lock manager hands out lock tokens to the local lock managers in each file system node. In this way, lock/unlock operations are performed locally. A
lock token may be revoked by the centralized lock manager as other nodes require access. GFS and OpenGFS [10, 11] use an inter-node distributed lock system separate from the I/O of the data blocks. As the file system metadata or file data is referenced in one client, that client acquires locks to just the blocks that it has read and/or cached. When another client attempts to lock the data blocks, the block device server will callback to the client holding the lock and request it to flush its cache and release the lock. In this way, file operations can run concurrently on many OpenGFS clients. Finally, the SCSI protocol provides a means to achieve block locking operations via the SCSI Persistent Reserve mechanism. [12] A SCSI device, upon receipt of the SCSI_RESERVE_OUT command, will note the initiator ID and (optionally) the range of logical blocks (LBN) which apply to the reservation. Once reserved in this way, the SCSI protocol defines a matrix of actions and responses to any command by any initiator. The reserving initiator is granted access, and other initiators’ commands return a RESERVATION_CONFLICT status. By this method, SCSI devices (and their file systems) may be shared among initiators at a level of granularity that is agreed upon among them.
A Solution A good implementation for a real-world embedded USB computing device is subject to the following two constraints: • •
It must be simple. It must operate with existing host MSDC drivers.
An embedded device operates with limited memory, compute and power resources. Implementing a complex software system for this would take up resources that are better utilized for the devices’ primary function. This effectively rules out NFS and other network file system variations. What remains is to approach the problem from a shared block device perspective, leading us into the next constraint: interoperability. Neither the block device model nor the MSDC enforce the use of any one file system. If data is to be shared, however, the device software must be capable of interpreting the underlying file system structure that the host has organized on the media. The lowest common denominator is the File Allocation Table (FAT) File System [13], which is commonly used in small to medium capacity removable media cartridges and flash memory. The problem has thus become one of: How to share a file system that is not explicitly designed for sharing? Our solution borrows concepts from OpenGFS and SCSI persistent reserve to define epochs of exclusive access to file system metadata. A general solution is to prevent the file system clients on the device from performing any operations on the volume until the host has written back cached metadata and invalidated its cache. Correspondingly, the host must not be allowed to load its cache until it has obtained exclusive access to the metadata structures. 2
The challenge is to do this in such a way that does not involve installing any new software into the host operating system, but merely uses the existing and available interfaces.
have obtained exclusive access to the metadata in order to extend the file. Client A (Host)
USB Server
Lock Manager
Client B (Device)
Host Command
Read/Write Data
Acquire Try Later
Media Access
Success Read/Write Data
Yes
Success Eject Media Success
No
Acquire Granted
Acquire Try Later Release Acknowledge Acquire Granted
Yes
START/STOP, LoEj=1
Acquire Volume Lock
Read/Write
Acquire Try Later
No
Acquire Try Later Release Volume Lock
Process Command
Perform Media Access
Release Acknowledge Acquire Granted Data Success
Done
Cache owned by USB
Figure 2 Serializing access during host transactions We define implicit criteria for obtaining exclusive access to the media by the host. At any point where a media access is required (READ/WRITE), a lock on the volume must be in hand. Further, we define explicit criteria for the host releasing its lock. The host shall issue a START_STOP_UNIT command with the fields Start=0; LoEj=1 (Eject Media). Figure 2 describes the processing of host commands. On the device side, the volume lock is held briefly during file operations. This implies that, depending on conditions in the host, the device operations may stall. Typically, the device software architecture is defined in such a way as to forward file system operations to a thread which is allowed to block. The following are use cases for file sharing on embedded devices.
Cache owned by Device
Figure 3 Timeline showing a typical session Figure 3, above, shows a timeline of a host and device exchanging ownership of the cache. It shows how the lock serializes ownership of the cache and ensures that all file systems as interpreted by their clients are consistent.
Implementation Details When a device reporting a MSDC interface is plugged into a USB port, the host enumerates it and begins to issue I/O command transactions to it. This section describes basic principles in the operation of the USB/MSDC, FAT File Systems and the specific requirements of our proposed implementation on both the device and the host software.
USB/MSDC Principles Client Creates New File File creation involves access to file system metadata structures. In our solution, we require this access to be exclusive in order to present a consistent state to the file system code on any client. Client Modifies File Contents In this case, existing file data blocks are accessed or modified by a client and no change to the metadata structures is made. If caching is performed on the clients, written blocks will not be visible to another reader until the following occur: 1. 2.
The writer commits its modified blocks. The reader invalidates any cached blocks it holds.
Client Extends File This case is the amalgam of the two previous cases. Here, the file contents have been modified by the writer and it must
A USB device consists of both hardware and software components. Under USB, the host is always the initiator of a data transfer, be it inbound (to the host) or outbound (to the device). The nomenclature is host-centric; out refers to hostto-device transfers, while in refers to device-to-host transfers. Data is exchanged between endpoints which are one of four types: COMMAND, INTERRUPT, BULK, and ISOCHRONOUS. A USB device must identify itself by providing a series of data structures, on demand, to the host. These data structures, called descriptors, tell the host what device class it belongs to, what the vendor and product identification strings contain, what the power consumption requirements are, etc. This information collectively defines the device class. The latest specification of the MSDC defines two types of endpoints. Endpoint 0 is always of type CONTROL. The control endpoint is used to exchange descriptors, to enumerate and select active configurations and to deliver class-specific commands to the device. Endpoints 1 and 2 are of type 3
BULK. Hence, the term Bulk-Only Transport. All transactions for media access and control occur over these two endpoints. [15] Command Block Wrapper (CBW) At the heart of the MSDC is the notion that the USB exists solely to transport commands, status responses and data between the device and the host. It does this via the definition of a command block wrapper. A CBW takes an existing storage device protocol command and packages it for delivery through the BULK-out endpoint to the device. The size of the CBW is always 31 bytes. Once the 31 bytes of the CBW are received by the device, it must validate them and perform the command according to the specification of its supported protocol. The following two protocols are relevant: Transparent SCSI and the reduced block command set (RBC). The transparent SCSI command set provides transport of any SCSI revision. The peripheral device type, the revision supported and other information are communicated to the host via the INQUIRY command. In this case, the device will declare itself to be a “removable drive” in the INQUIRY data. The RBC command set was developed to provide a smaller set of required commands and options for storage devices. According to the MSDC overview, it is the preferred command set for flash and removable media devices. It is not supported by Microsoft Windows operating systems at this time, and therefore could not be considered for this project. [14] After the validation of the command wrapper, any data which must be transferred is sent via the BULK endpoints. For example, a WRITE command will be followed immediately by the number of bytes specified in the command, which the device is expected to receive and write onto the media. At the end of the transfer, the USB device must send to the host via the BULK-in endpoint a status word. There are provisions in the protocol for what to do when there is less data transferred than was specified in the command. [15]
next command. This will retrieve the failure reason and inform the host regarding the condition of the device and of the media. Detailed information regarding sense codes is found in the SCSI Primary Command Reference. [12] Our implementation uses sense information to return a unit attention condition.
FAT File System Principles The FAT on-disk format consists of the following four sections: Boot/Parameter Block (BPB), File Allocation Table (FAT), Root Directory Table (RDIR), and File Data Area (FDAT) The BPB, FAT and RDIR represent the file system’s metadata which are used to inform the organization of the file system. The BPB contains information about the type, size and geometry of allocation units. Allocation units are the minimum chunk size that is apportioned to a file when it is created and extended. It is always a power-of-two multiple of the media block size. Every allocation unit in the FDAT is accounted for by a word in the FAT. This allocation unit is called a cluster. The total count of clusters determines whether a file system is a 12-bit (FAT12), 16-bit (FAT16), or 28-bit (FAT32) file system, and the file system type determines the word size of each cluster entry in the FAT (1.5 byte, 2 byte or 4 byte respectively). The words in the FAT area itself form a singlylinked list of clusters belonging to a certain file in the FDAT. [13] The RDIR contains an array of 32-byte structures that contain information about a file. Most importantly, it contains the starting cluster number. A “directory” is just a file that contains a collection of these structures. In the case of FAT12/FAT16 the RDIR is in a determined location and size just before the FDAT. In FAT32, there is no such special treatment of the RDIR and it occupies space in the FDAT. In our implementation, the host is likely to cache much of the BPB, FAT and RDIR metadata and some of the FDAT metadata.
Device-Side Requirements Command Status Word (CSW) Upon the completion of every CBW, a 13 byte CSW is transmitted to the host. The CSW contains 4 items of information: A signature, a tag that matches the tag sent in the CBW to which this CSW pertains, the quantity of residual bytes not transferred, and a status code. The status code is either 0x00 (success), 0x01 (fail), or 0x02 (phase error). If the status is phase error, then the host performs a reset recovery on the device. Following the sending of the CSW, the device must be prepared to receive another CBW from the host via the BULK-out endpoint. Sense Information Upon sending a failure code to the host, the device will save information regarding the condition which led to the failure. This is called sense information. The host, upon receiving a failure code in the CSW, has the opportunity to send a REQUEST_SENSE command to the device as the very
Our implementation imposes a number of requirements on the device-side software. Most of these requirements are already part of good software architecture and the USB/MSDC/SCSI specification. We simply define the roles these components play within the system. Block Device Interface It is a common practice to create layers between physical devices and the software components that access them. A block device interface provides a layer of abstraction that allows different block device drivers to be interchanged. A block device driver moves data to/from its underlying media in groups of bytes called blocks. A minimal device driver exports the following interfaces: init() fini() read() 4
initialize the device driver shutdown the device driver transfer data from media to memory
write() transfer data from memory to media ioctl() perform miscellaneous operations In addition, we define our read() and write() interface to accept a synchronization object which will be signaled when the I/O is complete. This creates an asynchronous I/O system and follows a model used in much of the rest of the native code for the target system (A ChipWrights CW4512 processor). Using asynchronous I/O does not preclude synchronous operation; the software layers above the device driver decide if and when they need to wait on the synchronization object. The ioctl() interface exists to provide a general method of making class-specific and driver-specific procedure calls. In our implementation, we require these functions for controlling and obtaining information from the device driver: 1. 2. 3. 4. 5.
Reset Media present Media block size Total number of blocks available on the media Synchronize cache.
Synchronize cache exists in order to provide support for device drivers, such as NAND Flash, that as part of their operation, maintain some amount of internal buffering. This functionality exists in the SCSI specification, where it is implicit upon the device entering a state which prevents access to the media [12]. Cache Interface Ideally, a cache interface that exactly matches the block device interface is defined and implemented. This allows for the seamless transition between cached and non-cached operation. To provide a zero-copy API to the native file system driver, an ioctl() is used to expose a logical block ‘reference’ and ‘release’ operation. It is important for the device file system operations to use this cache for two reasons: First, the cache speeds up the local file system operations. Second, the cache provides the common data area where the USB client data and the local client data are synchronized. The USB client and the local client use the same cache to guarantee that they reference identical contents. File System Locking Interface The native file system is designed to be accessible by multiple threads running within the context of the application. To make metadata modifications atomic, the file system serializes access to its structures with a mutual-exclusion lock. The file system exposes this locking mechanism to the USB thread so that it can perform atomic accesses as well. Any media access performed by the USB will require that the file system lock be held. Note that the lock will not be released until the START_STOP_UNIT (Start=0;LoEj=1) command is received. This lock will effectively signify a cache lock on the volume.
Unit Attention Implementation Unit attention is used by the SCSI protocol to inform the initiator that a service-affecting event has occurred on the target. Once the attention condition has been established, the next command, with the exception of INQUIRY and REQUEST_SENSE, must fail with a CHECK_CONDITION status. [16] Our implementation uses unit attention to inform the host that the data on the media has changed. The host will, upon receiving the MEDIA_CHANGED sense qualifier in its (subsequent) REQUEST_SENSE command, invalidate any cached blocks it holds on the volume. A unit attention will be set if, during the device’s ownership of the cache, the cache’s contents are modified via a write operation. This does not relate to the fact that the cache has dirty data blocks, although at least one block in the cache should be dirty. Rather, it signifies that since the last time the host released the file system lock, a cache write operation had taken place. START_STOP_UNIT Implementation The START_STOP_UNIT command is used to request that the target enable or disable the volume for media access. Two control bits in the command determine whether media should be loaded or ejected. Our implementation uses Eject Request (Start=0;LoEj=1) to signal the end of caching by the host. When this is received, the file system lock will be released, and the device will once again have an opportunity to perform I/O through the cache.
Host-Side Requirements On the host-side, we use only the native MSDC support to access the storage media. Due to OS behavioral differences which we describe below, we were forced to implement an explicit lock/release mechanism. Eject Media Operation Once the host has finished accessing the media, it performs a media-eject operation to release the mutex which it implicitly acquired by accessing the device. For a user, this can be accomplished using the published methods. For example, on Microsoft Windows, the drive letter in the Windows Explorer folder view has a context sensitive menu, where ‘Eject’ is an option. There are also system calls that are available to perform this operation under software control. [17, 18] Behavioral Differences between OS We would have liked to have used the SCSI command PREVENT_ALLOW_MEDIUM_REMOVAL (PAMR) or the SCSI RESERVE commands as the criteria for acquiring/releasing the mutex. Unfortunately, empirical analysis showed that under the Windows OS, RESERVE is never issued and PAMR is not a definitive indicator that the host will not issue additional media operations. We captured
5
traces in which a PAMR (Lock=0, “allow removal”) was followed immediately by a WRITE!1 We therefore sought an alternative, and found it in the explicit eject operation, which we implemented successfully.
[5] J. Mogul, Recovery in Spritely NFS, Computing Systems, 7(2): pp. 201-262, 1994
Summary
[6] Macklem, Rick, "Not Quite NFS, Soft Cache Consistency for NFS," Winter USENIX Conference Proceedings, USENIX Association, Berkeley, CA, Jan. 1994.
This paper has presented the issues and software components needed to achieve device and host concurrency over the USB/MSDC. By using implicit locking at the USB media access and requiring an explicit release via the media eject feature, a robust and interoperable system has been implemented. No new drivers were added to the host operating system to support this, and the software requirements on the device, apart from the typical MSDC implementation, are minimal. Implementing a standards-based protocol for USB devices is a scalable method for quick deployment of embedded systems. This paper has shown how to implement a simple and effective method of maintaining file system consistency on devices which export content over their USB interface.
Acknowledgements I would like to thank Franz Weller of Weller Engineering for his many contributions to the thought, designs and programs that led to this article. Frank Boddeke of ChipWrights, Inc. for his guidance and a patient ear. I would also like to recognize Ville Miettinen of Hybrid Graphics, Ltd. for his inspirational stylistic comments.
Copyrights Windows is Copyright ©Microsoft, Inc. CW4512 is Copyright ©ChipWrights, Inc. GFS is Copyright ©Red Hat, Inc. GPFS is Copyright ©IBM Corporation
[7] C. Gray and D. Cheriton, "Leases: An Efficient FaultTolerant Mechanism for Distributed File Cache Consistency," Proc. the 12th ACM Symp. Operating Systems Principles, pp. 202-210, 1989. [8] A. Birrell, A. Hisgen, C. Jerian, T. Mann, and G. Swart. The Echo Distributed File System. Technical Report 111, Digital Equipment Corp. Systems Research Center, Sep. 1993. [9] Schmuck and R. Haskin, "GPFS: A Shared-Disk File System for Large Computing Clusters," In Proc. of the First Conference on File and Storage Technologies (FAST), Jan. 2002. [10] S. R. Soltis, T. M. Ruwart, and M. T. O'Keefe. The Global File System, Proceedings of the Fifth {NASA} Goddard Conference on Mass Storage Systems, pp. 319-342 1996 [11] B. Cahill, et. al. “OpenGFS Locking Mechanism”, May 2004 [12] Ralph O. Weber, et. al., “SCSI Primary Commands – 2,” Project T10/1236-D, Revision 20, July 2001 [13] Microsoft Corporation, FAT: General Overview of OnDisk Format, Version 1.03, Dec. 2000
References
[14] Microsoft Corporation, “USB Storage - FAQ for Driver and Hardware Developers”, Nov. 2003
[1] M. Nelson, B. Welch, and J. Ousterhout. Caching in the Sprite Network File System. ACM Transactions on Computer Systems, 6(1): pp. 134--154, Feb. 1988.
[15] USB Implementers Forum, “Universal Serial Bus, Mass Storage Class, Bulk-Only Transport”, Revision 1.0, Sep. 1999
[2] R. Sandberg, D. Boldberg, S. Kleiman, D. Walsh, and B. Lyon. Design and Implementation of the Sun Network Filesystem. Summer Usenix Conference Proceedings, pp. 119-130, June 1985. [3] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West. Scale and Performance in a Distributed File System. ACM Transactions on Computer Systems, 6(1): pp. 51-81, Feb. 1988 [4] V. Srinivasan and J. Mogul. Spritely NFS: Experiments with Cache Consistency Protocols. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, pp. 45--57, Dec. 1989 1
In Windows; the Linux USB MSDC correctly issues a PAMR (Lock=0) as the last command following an unmount operation.
[16] Ralph O. Weber, et. al., “SCSI Architecture Model – 2,”, Project T10/1157-D, Revision 24, Sep. 2002 [17] Microsoft Knowledge Base, Article #165721, “How To Ejecting (sic) Removable Media in Windows NT/Windows 2000/Windows XP” [18]
Jeff Tranter, “Eject 2.0.12”, Dec. 2002
Additional Recommended Reading: [19] J. Yin, L. Alvisi, M. Dahlin, and C. Lin. Volume Leases for Consistency in Large-Scale Systems. IEEE Transactions on Knowledge and Data Engineering, Jan. 1999 [20] P. J. Braam. “The Coda Distributed File System”, Linux Journal #50, June 1998 6