The InterMezzo High Availability File System HOWTO

Peter J. Braam braam@cs.cmu.edu, Rob Simmonds, Gordon Matzigkeit gord@fig.org and Christopher Li chrisl@mountainviewdata.com

v1.8, March 2, 2001


This document explains the configuration and operation of the InterMezzo file system on Linux.

1. Disclaimer and License

InterMezzo is an experimental file system. It contains kernel code and daemons running with root permissions and is known to have bugs. Please back up all data when using or experimenting with InterMezzo.

InterMezzo is covered by the GPL. The GPL describes the warranties made to you, and can be found in the file COPYING.

Copyright on InterMezzo is held by Peter J. Braam, Stelias Computing, Carnegie Mellon University, Phil Schwan, Los Alamos National Laboratory and Red Hat, Inc, TurboLinux, Inc., Tacitus Systems, Inc. and Mountain View Data, Inc.

2. Introduction

2.1 What is InterMezzo?

InterMezzo is a file system that keeps replicas of folder collections, a.k.a. fileset residing on multiple computers in sync. The computers that express an interest in the replica are called the replicators of the fileset. InterMezzo has one server for the fileset, which plays an organizing role in exchanging the updates with replicators.

InterMezzo has disconnected operation, i.e. it maintains a journal to remember all updates that need to be forwarded when a failed communication channel comes back. This is a best effort synchronization since during disconnected operation conflicting updates are possible.

InterMezzo uses an existing disk file system, in practice ext3, as the storage location for all data. When an ext3 file system is mounted as file system type InterMezzo instead of ext3, the InterMezzo software starts managing all access to the file system. It keeps the logs of modification records and negotiates permits to modify the disk file system, to avoid conflicting updates during connected operation.

2.2 Current Limitations

Security

Currently you should run InterMezzo only on trusted networks -- there is NO security built into the system yet. A good way to get a trusted network is to use IPSEC (see FreeSwan http://www.freeswan.org) or CIPE (see http://sites.inka.de/sites/bigred/devel/cipe.html)

Recovery

The system currently has journal recovery in combination with Ext3. After system crashes the local disk system with the KML, LML and last_rcvd file which contain distributed state will recover automatically. Recovery with peers will normally also be seemless. Even greater file content recovery is possible, and this will be implemented shortly.

Conflict Handling

The system does not currently have conflict handlers and only crude conflict detection. More extensive conflict resolution tools are being developed and should be available with the next major release. The design of the system means that conflicts can only occur when reconnecting after a period of disconnected operation and that conflicts can only occur on a client.

Fetch on demand

At the moment InterMezzo replicates an entire filesystem. However, a fetch on demand system will appear in a future version, which will allow partial replication of a filesystem.

Serializing Fetches

Due to an unfortunate snag we presently serialize fetches of files. This is not good for concurrent access. We will fix this shortly using the Local Modification Log (LML).

3. Using InterMezzo

Here we describe how to set up a server and clients.

3.1 Prereqisite packages

InterMezzo uses several packages which need to be installed before it can be used. Here is a checklist of the required packages:

e2fsprogs-1.20.WIP.sct.tar.bz2

NOTE: This package is still considered ALPHA, and it is not on the e2fsprogs homepage. It can be downloaded from http://www.kernel.org/pub/linux/kernel/people/sct/ext3/e2fsprogs/ Please make sure that the e2fsprogs you download is new enough to have the -J (uppercase `j') option, which the mkizofs utility requires.

[root@chris e2fsprogs-1.20.WIP.sct]# mke2fs 
mke2fs 1.20-WIP, 17-Jan-2001 for EXT2 FS 0.5b, 95/08/09
Usage: mke2fs [-c|-t|-l filename] [-b block-size] [-f fragment-size]
        [-i bytes-per-inode] [-j] [-J journal-options] [-N number-of-inodes]
        [-m reserved-blocks-percentage] [-o creator-os] [-g blocks-per-group]
        [-L volume-label] [-M last-mounted-directory] [-O feature[,...]]
        [-r fs-revision] [-R raid_opts] [-s sparse-super-flag]
        [-qvSV] device [blocks-count]
[root@chris e2fsprogs-1.20.WIP.sct]# 

expat-1.95.1

This is a C library for XML parsing.

Perl packages:

All of these packages (except SetFS, which is in the InterMezzo tarball), can be found at ftp://ftp.inter-mezzo.org/pub/intermezzo/. To install them,


cd

into the unzipped package's top directory, then run:
perl Makefile.PL; make install

3.2 Installing the software

  1. Build the executables and prepare the software for installation. (Be sure only to run make from the top level directory, so that the installation defaults are set correctly):
    $ cd .../intermezzo
    $ make
    
    You may be prompted to install additional software dependencies: just follow the online instructions and choose your preferred automatic installation method, or interrupt the process. You may install the dependencies manually, and try the make again.
  2. Install the InterMezzo kernel module, daemon, and management software:
    $ su
    Password:
    # make install
    

3.3 Config files

Your default config directory is /etc/intermezzo. You may use the interactive inconfig command to generate the following configuration files, or manually create them.

The config files have been changed significantly in the new version of Intermezzo. New config files use the XML format instead of the Perl style one.

/etc/intermezzo/sysid

Holds a name of your system, the presto device name and the IP bind address. Suppose your server has the name muskox, with IP address 192.168.0.3, and your clients are clientA and clientB. The sysid file on each host would contain the host name, the presto device and the IP bind address. i.e., on muskox the file would contain:

<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />

Note that in early versions of InterMezzo, this file did not contain the name of the presto device; this field is now required.

/etc/intermezzo/serverdb

Holds a database of servers. The server structure is a XML server element, as follows:

<serverdb>
  <server name="muskox" ipaddr="192.168.0.3" port="2222" 
    bindaddr="192.168.0.3" />
</serverdb>

The above contains a single server description for the server muskox with IP address "192.168.0.3". The port and bindaddr are optional; the default port is 2222. Without a bindaddr the server listens to all interfaces for requests, with it, the server only listens on the bindaddr address. If you are running both a client and a server on the same system, you need to specify a different bindaddr for the server and the client(s).

/etc/intermezzo/fsetdb

Holds a database of filesets. The fsetdb structure is a XML fileset element, as follows:

<fsetdb>
<fileset name="yourfsetname" servername="muskox" >
<replicator>clientA</replicator>
<replicator>clientB</replicator>
</fileset>
</fsetdb>

The above contains a single fileset description for a fileset called yourfsetname which is served by muskox. The fileset is replicated on hosts clientA and clientB.

/etc/fstab

To ease the mounting of InterMezzo filesets add one of the following to the /etc/fstab file. For testing and developing using a loop device as the cache is easiest:

/tmp/cache  /izo0      InterMezzo loop,fileset=fsetname,mtpt=/izo0,
prestodev=/dev/intermezzo0,cache_type=ext3,noauto 0 0

where /tmp/cache is a file associated with a loop device, /izo0 is a mount point (a directory), fsetname is the name of the fileset and /dev/intermezzo0 is the name of the presto device. The creation of the cache file and the presto device is explained in the examples at the end of this section. The kernel must be configured with loopback device support enabled to do this.

To use a genuine block device is a little easier, because you do not need to set up a loop device. To use the block device /dev/hda9, the /etc/fstab file should contain:

/dev/hda9  /izo0      InterMezzo fileset=fsetname,mtpt=/izo0,
prestodev=/dev/intermezzo0,cache_type=ext3,noauto 0 0
NOTICE:
/izo0/.intermezzo/fsetname/kml

The kernel modification log (KML) keeps track of all of the changes made in an InterMezzo filesystem.

/izo0/.intermezzo/fsetname/last_rcvd

The last_rcvd file keeps track of the last record in the KML file that the kernel has handled. In the current release of InterMezzo, the KML and last_rcvd files need to be created (usually by running mkizofs) before first mounting an InterMezzo filesystem.

mkizofs -v fsetname /tmp/cache

See mkizofs -h for options, such as specifying the filesystem type. If you have already initialized your cache filesystem, then you must manually create the needed InterMezzo metadata files:

mount -o loop /tmp/cache /izo0
mkdir -p /izo0/.intermezzo/fsetname
touch /izo0/.intermezzo/fsetname/{kml,last_rcvd}
umount /izo0

These example assumes that we are using the loopback device with the /tmp/cache filesystm, and that the fileset will be called fsetname.

Let's consider three common system configurations, for each we will give the config files and the correct invocations to start the server/cache manager.

One client and one server (typical use: laptop - desktop syncing):

In this case we assume that the host muskox is serving the fileset shared and the host clientA is replicating the fileset. The following files are placed on both muskox and clientA.

/etc/intermezzo/serverdb

<serverdb>
  <server name="muskox" ipaddr="192.168.0.3" />
</serverdb>

/etc/intermezzo/fsetdb

<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>clientA</replicator>
</fileset>
</fsetdb>

/etc/intermezzo/sysid

On muskox this contains:

<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />
On clientA this contains:
<sysid name="clientA" psdev="/dev/intermezzo0" bindaddr="192.168.0.20" />

/etc/fstab

The following line is added on both muskox and clientA:

/tmp/fs0 /izo0 InterMezzo loop,fileset=shared,prestodev=/dev/intermezzo0, mtpt=/izo0,cache_type=ext3,noauto 0 0

/tmp/fs0

This file and the filesystem is created using the following commands:

dd if=/dev/zero of=/tmp/fs0 bs=1024 count=10k
mkizofs -F /tmp/fs0

/izo0/.intermezzo/shared/kml

If we didn't run mkizofs above, we create the KML and last_rcvd files by first mounting the filesystem as ext3:

mkdir /izo0
mount -o loop /tmp/fs0 /izo0
mkdir -p /izo0/.intermezzo/shared
touch /izo0/.intermezzo/shared/{kml,last_rcvd}
umount /izo0

/dev/intermezzo0

This is created using the following commands:

mknod /dev/intermezzo0 c 185 0
chmod 700 /dev/intermezzo0

/etc/conf.modules

Your modules configuration file may also be called /etc/modules.conf. Add the lines:

alias char-major-185 presto
alias InterMezzo presto

Before starting lento, mount the cache:

mkdir /izo0; mount /izo0

Now lento can be started on both muskox and clientA by typing

lento

Two clients and one server (typical use: replicate a WWW server):

/etc/intermezzo/serverdb

The can be the same as for the one client and one server case above.

/etc/intermezzo/fsetdb

<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>clientA</replicator>
<replicator>clientB</replicator>
</fileset>
</fsetdb>

This is the same as in the first example, but clientB is added to the replicators list.

/etc/intermezzo/sysid

This is the same as in the first example for muskox and clientA, and on clientB contains the following:

<sysid name="clientB" psdev="/dev/intermezzo0" bindaddr="192.168.0.21" />

/etc/fstab

This is the same as used with the one client and one server case above.

One client and one server on same host (typical use: testing InterMezzo):

Suppose that we are running on the host muskox. To run multiple lentos on one host we need to use ip-aliasing; the ip-aliasing option must be compiled into your kernel (CONFIG_IP_ALIAS). This allows one interface to have more than one IP address associated with it. Suppose the name muskoxA1 and the IP address 192.168.0.100 are available. In:

/etc/hosts

Add the line:

192.168.0.100   muskoxA1        

Then add the ip-alias by typing:

    ifconfig eth0:1 muskoxA1 up

Then create two configuration files containing the following:

/etc/intermezzo/sysid

<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />

/etc/intermezzo/sysid.muskoxA1

<sysid name="muskoxA1" psdev="/dev/intermezzo1" bindaddr="192.168.0.100" />

The latter file will act as a sysid file for the lento running on the aliased IP address. Note that because we are running both the client and the server on the same system, we have to specify different devices for each, namely /dev/intermezzo0 and /dev/intermezzo1.

/etc/intermezzo/fsetdb

<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>muskoxA1</replicator>
</fileset>
</fsetdb>

To run the second lento, a second presto device and loopback cache are required. These are made as follows:

mknod /dev/intermezzo1 c 185 1
dd if=/dev/zero of=/tmp/fs1 bs=1024 count=10k

mkizofs -F /tmp/fs1
chmod 700 /dev/intermezzo1

/etc/fstab

Note that two entries are needed here:

/tmp/fs0  /izo0      InterMezzo loop,fileset=shared,prestodev=/dev/intermezzo0,
mtpt=/izo0,cache_type=ext3,noauto 0 0
/tmp/fs1  /izo1      InterMezzo loop,fileset=shared,prestodev=/dev/intermezzo1,
mtpt=/izo1,cache_type=ext3,noauto 0 0

Now mount the two InterMezzo filesystems:

mount /izo0
mount /izo1

The lento acting as the server can be started as before:

lento

The lento acting as the replicator has to be told which sysid file to read (which tells it which presto device to use). The second lento is started as follows:

lento.pl --idfile=sysid.muskoxA1

3.4 Configuration Checking

This section have been obsoleted. The XML version of the config check is not ready yet.

A script is provided to perform simple checks on the configuration files. The script is called config_check and can be found in the .../intermezzo/tools directory.

If Lento is using the standard system id file, /etc/intermezzo/sysid, the script can be run without arguments. If a different system id file is being used the --idfile=my_idfile flag can be used to indicate this.

It is also possible to use a configuration directory other than /etc/intermezzo by using the --configdir=my_confdir flag.

4. Recovery from conflicts

The current version of InterMezzo has a built in recovery mechanism to deal with most situations of system crashes. Through configuration choices, conflicts, i.e. inconsistent updates to client and server caches can be avoided.

However, during disconnected operation, conflicts can be generated if the configuration does not explicitly avoid them through enforcing the file system to be readonly. Where the client and server have inconsistent caches, only manual recovery can recover the system.

The system can be recovered manually as follows:

  1. When a conflict happens, the lento which is reintegrating changes will die. This Lento is receiving updates from its peer in this replicator and typically the peer will have the latest updates. So we are going to synchronize from the lento that survived to the lento that died.
  2. Shutdown the server and client(s), unmount the caches, and remove the presto module from the kernel: umountizo ; rmmod presto
  3. Mount each cache as an ext3 filesystem: mount -o loop /tmp/fs0 /izo0
  4. Use rsync or tar, or another tool, to synchronize the caches on the clients and server. Make sure to remove files from the client that you don't have on the server, the caches need to be identical.
  5. Set the synced flag on the clients - this prevents the system from resyncing on startup. This is done using the command below where SYSID is replaced with the client's sysid, and FSETNAME is replaced with the name of the fileset: touch /var/intermezzo/SYSID/FSETNAME-synced e.g. on client iclientA with fileset shared use: touch /var/intermezzo/iclientA/shared-synced
  6. The persistent databases will be out of sync at this point, so you must clear the KML and last_rcvd records on both the client and the server: cp /dev/null /izo0/.intermezzo/shared/kml cp /dev/null /izo0/.intermezzo/shared/last_rcvd
  7. Unmount the caches and mount them again as InterMezzo file systems. Restart Lento on the server and client.

This is cumbersome, but journaled recovery is on its way.

5. Debugging

To help us find bugs we need logging information. The logs come in two places, from the kernel in /var/log/messages, and from lento on stdout and stderr.

The kernel debugging log slows things down enormously and is activated with:

 
echo 4095 > /proc/sys/intermezzo/debug
echo 1 > /proc/sys/intermezzo/trace

The lento log can be captured from the terminal, and is activated using the --debuglevel=N. With N=1 you get many things, with N=100, all of it.

Mailing us the logs as well as a precise description of what you did to produce the bug might be enough to see what's happening.

6. Using the test framework for testing and debugging

Read the README file in the ../intermezzo/tests directory. This can save all information for you conveniently and runs the client(s) and server on a single system.

7. How does InterMezzo work?

InterMezzo was heavily inspired by Coda, and its current cache synchronization protocol is one of the many protocols that Coda supports. It is likely not the best for every situation but it is as simple as we could make it.

InterMezzo's mechanisms are very different from those of Coda. We employ very different kernel code which maintains the cache in another file system (typically ext3/Reiserfs/XFS/JFS). The kernel code also uses the journaling support in the kernel to make transactional updates (with lazy commits) to the file space and update journals.

7.1 InterMezzo's protocol

The primary reason for keeping it simple is that we wanted to use it as soon as possible. It is also hoped that it will not be too confusing to the end user, as is frequently the case with advanced network file systems.

InterMezzo divides the file space up in filesets. Typically a fileset is much larger than a directory and smaller than a full disk partition. Good examples of filesets might be /usr or someone's home directory.

The typical event sequence for a fileset in InterMezzo is as follows:

Creation

the fileset is created on the server, possibly populated, possibly empty. The file server and the kernel on the server are now aware of the fileset.

Client needs the fileset

A client which needs the fileset is told about the fileset and its server. The client is added to the servers list of replicators of the fileset and the server is made a replicator of the cache on the client. The fileset is mounted on the client, and the client cache manager and kernel know about it.

A replicator has the following state:

peer

replicator describes replication between this system and peer.

fsetname

replicator describes replication of fileset fsetname

next_to_expect

the next update record to expect from the peer in the peer's numbering sequence

next_to_send

the next update record to send to the peer in this systems numbering sequence

Under normal operation there are a collection of replicators that are connected to the server, and some replicators are in disconnected operation (the latter can also happen when the server fails).

The kernel and server/cache manager keep the log of updates to the fileset transactionally in sync with the contents of the file system, i.e. under all circumstances any update applied to the file system is also entered in the update database. Also, the systems transactionally update the next_to_expect counters as updates are entered in the cache.

All transactions have lazy commits. This means that in the case of a system failure, it is possible that not all data was saved to the disk, but it is guaranteed that what was written to disk is consistent between the filesystem and the KML/last_rcvd records. Upon system recovery missing data can be re-fetched, and the filesystem will be consistent.

The following rules govern the operations:

Permits

Before an update can be made to the file system, a permit is acquired. A permit acquisition consists of:

  1. Notifying the server of the request.
  2. The server revokes the permit from the current holder. The current permit holder will reintegrate its changes to the server before giving up its permit.
  3. The server propagates the changes to other synced replicators, and then grants the permit.

Read access

Read access on a synced fileset is unrestricted.

Disconnections

When a client or server notices that a peer is no longer available it does the following.

  1. The client grants itself a permit for the fileset.
  2. The server notices that a client has gone away and if that client held the permit on a fileset it grants itself the permit.

Reconnection

The reconnection protocol is the most complicated:

  1. When a client rediscovers a server, it binds a connection to the server.
  2. The client discards its permits for all filesets served by the peer.
  3. The server forwards its updates on the filesets to the client. The client tries to apply these updates but verifies that the versions to which the updates apply are correct. (If not the client declares a conflict, the handling of which we postpone.)
  4. The client verifies that its update journal is adjusted so as not to be in conflict with the current state propagated by the server.
  5. When no records are left to be reintegrated on the client, the client sends its update journal to the server.

Normal operation can now resume.

8. Internals

InterMezzo can send six types of packets across the network. In practice the MSG and EOR types are not used:

REQ

This type of packet is delivered by the connection to the Lento::ReqDispatcher session. This session in turn invokes a method in Lento::InterMezzo::ReqHandler. The sender of a REQ packet will be called the client the receiver of the packet the server for the request.

The client sending a REQ includes the ctoken to allow replies to be dispatched to the same session. Following the REQ header is always the request type, leading to correct dispatch in the ReqDispatcher.

Any packet that is sent in response to a REQ will contain the ctoken, to allow the client to dispatch the packet correctly. It also contains an stoken which the server includes in the packet header. This allows the client to send further packets in the exchange (MSG packets) to the correct server session.

The tokens are included to find the session that is meant to handle the reply to the request, or to send further data to the server session which received the request.

REP

These packets always indicate the final packet in an exchange, and are always sent from server to client, i.e. dispatched to the session that created the ctoken. They include an integer return code by default in their payload. The final packet may also be an EOD packet.

EOR

sent from server to client. Not currently in use.

MSG

sent from client to server, and dispatched using an stoken. Not currently in use.

XXX are EOR/MSG packets still present?

InterMezzo's bulk transfer uses three further packet types. This model is similar to that used in the RPC2 side effects in Coda:

DAT

packets are sent from source to sink

EOD

indicates no more DAT packets are following in an exchange, also sent from source to sink.

START

indicates to a client that it can start sending DAT packets. This is sent from server to client when the source of a bulk transfer is the client.

8.1 Reintegrate

The reintegrate protocol consists of the following exchange:

  1. The client sends a reintegrate request to the server. It is a REQ packet with a payload containing: fsetname, seq_no and an array of CML records .
  2. The server finds the replicator for the request based on the sysid of the connection and the fileset name.
  3. The server finds the first record it needs to reintegrate. It may declare an error if the client has skipped records (i.e. not sent the correct record number to the server, but omitted one or more), or the client may discard records it has already reintegrated.
  4. The server processes the CML records. When 8 records are done, or when no records are left to be done, an array with completed records in returned to the client in a DAT packet.
  5. When no records remain to be integrated, an EOD packet is sent to the client.
  6. When an error occurs, a REP packet is sent.
  7. The server appends the CML records it receives to the CML it maintains for that fileset, in order to propagate such records to other replicators.
  8. Forwarding the records received from the client to other replicators of the fileset is done when the EOD packet has been sent. The ReqHandler will instantiate a new session the handle this forwarding (a session of type send_CML).
  9. When the server receives a CLOSE CML record, it sends a FetchFile request to the client, through a yield to do_backfetch, (a backfetch is a term used to indicate that the server retrieves data from a client). This request runs in a separate Fetchfile session that calls back at the done_fetching event.

9. Contact Information

The InterMezzo web site is http://www.inter-mezzo.org.

General questions about InterMezzo can be sent to intermezzo-discuss@lists.sourceforge.net . This along with other InterMezzo related mail lists are archived on the InterMezzo web site, so it may be worth checking here to see if your question has already been answered.

Bug reports should be filed on sourceforge (XXX). Please include the version of InterMezzo you are using and a description of your system configuration and the problem observed.

It would be useful if you could run the config_check script on your hosts before sending the bug report to ensure that the system is configured correctly. The config_check script can be found in the ../intermezzo/tools directory. (XXX is this still working).

Also, please include all relevant logs: /var/log/messages, and the output of Lento (run with debugging) on server and clients.