Chapter 12. Parallel Filesystems

If you are certain that your cluster will only be used for computationally intensive tasks that involve very little interaction with the filesystem, you can safely skip this chapter. But increasingly, tasks that are computationally expensive also involve a large amount of I/O, frequently accessing either large data sets or large databases. If this is true for at least some of your cluster's applications, you need to ensure that the I/O subsystem you are using can keep up. For these applications to perform well, you will need a high-performance filesystem.

Selecting a filesystem for a cluster is a balancing act. There are a number of different characteristics that can be used to compare filesystems, including robustness, failure recovery, journaling, enhanced security, and reduced latency. With clusters, however, it often comes down to a trade-off between convenience and performance. From the perspective of convenience, the filesystem should be transparent to users, with files readily available across the cluster. From the perspective of performance, data should be available to the processor that needs it as quickly as possible. Getting the most from a high-performance filesystem often means programming with the filesystem in mind-typically a very "inconvenient" task. The good news is that you are not limited to a single filesystem.

The Network File System (NFS) was introduced in Chapter 4. NFS is strong on convenience. With NFS, you will recall, files reside in a directory on a single disk drive that is shared across the network. The centralized availability provided by NFS makes it an important part of any cluster. For example, it provides a transparent mechanism to ensure that binaries of freshly compiled parallel programs are available on all the machines in the cluster. Unfortunately, NFS is not very efficient. In particular, it has not been optimized for the types of I/O often needed with many high-performance cluster applications.

High-performance filesystems for clusters are designed using different criteria, primarily to optimize performance when accessing large data sets from parallel applications. With parallel filesystems, files may be distributed across a cluster with different pieces of the file on different machines allowing parallel access.

A parallel filesystem might not provide optimal performance for serial programs or single tasks. Because high-performance filesystems are designed for a different purpose, they should not be thought of as replacements for NFS. Rather, they complement the functionality provided by NFS. Many clusters benefit from both NFS and a high-performance filesystem.

There's more good news. If you need a high-performance filesystem, there are a number of alternatives. If you have very deep pockets, you can go for hardware-based solutions. With network attached storage (NAS), a dedicated server is set up to service file requests for the network. In a sense, NAS owns the filesystem. Since serving files is NAS's only role, NAS servers tend to be highly optimized file servers. But because these are still traditional servers, latency can still be a problem.

The next step up is a storage area network (SAN). Typically, a SAN provides direct block-level access to the physical hardware. A SAN typically includes high-performance networking as well. Traditionally, SANs use fibre channel (FC) technology. More recently, IP-based storage technologies that operate at the block level have begun to emerge. This allows the creation of a SAN using more familiar IP-based technologies.

Because of the high cost of hardware-based solutions, they are outside the scope of this book. Fortunately, there are also a number of software-based filesystems for clusters, each with its own set of features and limitations. While many of the following might not be considered a high-performance filesystem, you might consider one of the following, depending upon your needs. However, you should be very careful before adopting any of these. Like most software, these should be regarded as works in progress. While they may be ideal for some uses, they may be problematic for others. Caveat emptor! These packages are generally available as both source tar balls and as RPMs.

ClusterNFS: This is a set of patches for the NFS server daemon. The clients run standard NFS software. The patches allow multiple diskless clients to mount the same root filesystem by "reinterpreting" file names. ClusterNFS is often used with Mosix. If you are building a diskless cluster, this is a package you might want to consider (http://clusternfs.sourceforge.net/).
Coda: Coda is a distributed filesystem developed at Carnegie Mellon University. It is derived from the Andrew File System. Coda has many interesting features such as performance enhancement through client side persistent caching, bandwidth adaptation, and robust behavior with partial network failures. It is a well documented, ongoing project. While it may be too early to use Coda with large, critical systems, this is definitely a distributed filesystem worth watching (http://www.coda.cs.cmu.edu/index.html).
InterMezzo: This distributed filesystem from CMU was inspired by Coda. InterMezzo is designed for use with high-availability clusters. Among other features, it offers automatic recovery from network outages (http://www.inter-mezzo.org/).
Lustre: Lustre is a cluster filesystem designed to work with very large clusters-up to 10,000 nodes. It was developed and is maintained by Cluster File Systems, Inc. and is available under a GPL. Since Lustre patches the kernel, you'll need to be running a 2.4.X kernel (http://www.lustre.org/).
OpenAFS: The Andrew File System was originally created at CMU and now developed and supported by IBM. OpenAFS is source fork released by IBM. It provides scalable client-server-based architecture with transparent data migration. Consider OpenAFS a potential replacement for NFS (http://www.openafs.org/).
Parallel Virtual File System (PVFS): PVFS provides high-performance, parallel filesystem. The remainder of this chapter describes PVFS in detail (http://www.parl.clemson.edu/pvfs/).

This is only a partial listing of what is available. If you are looking to implement a SAN, you might consider Open Global File System (OpenGFS) (http://opengfs.sourceforge.net/). Red Hat markets a commercial, enterprise version of OpenGFS. If you are using IBM hardware, you might what to look into General Parallel File System (GPFS) (http://www-1.ibm.com/servers/eserver/clusters/software/gpfs.html). In this chapter we will look more closely at PVFS, an open source, high-performance filesystem available for both Rocks and OSCAR.

Table of Contents