braam@cs.cmu.edu
, Rob Simmonds,
Gordon Matzigkeit gord@fig.org
and Christopher Li chrisl@mountainviewdata.com
InterMezzo is an experimental file system. It contains kernel code and daemons running with root permissions and is known to have bugs. Please back up all data when using or experimenting with InterMezzo.
InterMezzo is covered by the GPL. The GPL describes the warranties
made to you, and can be found in the file COPYING
.
Copyright on InterMezzo is held by Peter J. Braam, Stelias Computing, Carnegie Mellon University, Phil Schwan, Los Alamos National Laboratory and Red Hat, Inc, TurboLinux, Inc., Tacitus Systems, Inc. and Mountain View Data, Inc.
InterMezzo is a file system that keeps replicas of folder collections, a.k.a. fileset residing on multiple computers in sync. The computers that express an interest in the replica are called the replicators of the fileset. InterMezzo has one server for the fileset, which plays an organizing role in exchanging the updates with replicators.
InterMezzo has disconnected operation, i.e. it maintains a journal to remember all updates that need to be forwarded when a failed communication channel comes back. This is a best effort synchronization since during disconnected operation conflicting updates are possible.
InterMezzo uses an existing disk file system, in practice ext3, as the storage location for all data. When an ext3 file system is mounted as file system type InterMezzo instead of ext3, the InterMezzo software starts managing all access to the file system. It keeps the logs of modification records and negotiates permits to modify the disk file system, to avoid conflicting updates during connected operation.
Currently you should run InterMezzo only
on trusted networks -- there is NO security built into the system yet.
A good way to get a trusted network is to use IPSEC (see FreeSwan http://www.freeswan.org
) or CIPE (see http://sites.inka.de/sites/bigred/devel/cipe.html
)
The system currently has journal recovery in combination with Ext3. After system crashes the local disk system with the KML, LML and last_rcvd file which contain distributed state will recover automatically. Recovery with peers will normally also be seemless. Even greater file content recovery is possible, and this will be implemented shortly.
The system does not currently have conflict handlers and only crude conflict detection. More extensive conflict resolution tools are being developed and should be available with the next major release. The design of the system means that conflicts can only occur when reconnecting after a period of disconnected operation and that conflicts can only occur on a client.
At the moment InterMezzo replicates an entire filesystem. However, a fetch on demand system will appear in a future version, which will allow partial replication of a filesystem.
Due to an unfortunate snag we presently serialize fetches of files. This is not good for concurrent access. We will fix this shortly using the Local Modification Log (LML).
Here we describe how to set up a server and clients.
InterMezzo uses several packages which need to be installed before it can be used. Here is a checklist of the required packages:
NOTE: This package is still considered ALPHA, and it is not on the e2fsprogs homepage. It can be downloaded from http://www.kernel.org/pub/linux/kernel/people/sct/ext3/e2fsprogs/ Please make sure that the e2fsprogs you download is new enough to have the -J (uppercase `j') option, which the mkizofs utility requires.
[root@chris e2fsprogs-1.20.WIP.sct]# mke2fs
mke2fs 1.20-WIP, 17-Jan-2001 for EXT2 FS 0.5b, 95/08/09
Usage: mke2fs [-c|-t|-l filename] [-b block-size] [-f fragment-size]
[-i bytes-per-inode] [-j] [-J journal-options] [-N number-of-inodes]
[-m reserved-blocks-percentage] [-o creator-os] [-g blocks-per-group]
[-L volume-label] [-M last-mounted-directory] [-O feature[,...]]
[-r fs-revision] [-R raid_opts] [-s sparse-super-flag]
[-qvSV] device [blocks-count]
[root@chris e2fsprogs-1.20.WIP.sct]#
This is a C library for XML parsing.
All of these packages (except SetFS, which is in the
InterMezzo tarball), can be found at
ftp://ftp.inter-mezzo.org/pub/intermezzo/
. To install them,
cd
perl Makefile.PL; make install
make
from the top level
directory, so that the installation defaults are set correctly):
$ cd .../intermezzo $ makeYou may be prompted to install additional software dependencies: just follow the online instructions and choose your preferred automatic installation method, or interrupt the process. You may install the dependencies manually, and try the
make
again.
$ su Password: # make install
Your default config directory is /etc/intermezzo
. You may
use the interactive inconfig
command to generate the following
configuration files, or manually create them.
The config files have been changed significantly in the new version of Intermezzo. New config files use the XML format instead of the Perl style one.
Holds a name of your system, the
presto device name and the IP bind address. Suppose your server has
the name muskox
, with IP address 192.168.0.3
, and your
clients are clientA
and clientB
. The sysid
file
on each host would contain the host name, the presto device and the IP
bind address. i.e., on muskox
the file would contain:
<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />
Note that in early versions of InterMezzo, this file did not contain the name of the presto device; this field is now required.
Holds a database of servers. The server structure is a XML server element, as follows:
<serverdb>
<server name="muskox" ipaddr="192.168.0.3" port="2222"
bindaddr="192.168.0.3" />
</serverdb>
The above contains a single server description for the server
muskox
with IP address "192.168.0.3"
. The port
and
bindaddr
are optional; the default port is 2222. Without a
bindaddr
the server listens to all interfaces for requests, with
it, the server only listens on the bindaddr
address. If you
are running both a client and a server on the same system, you need
to specify a different bindaddr
for the server and the client(s).
Holds a database of filesets. The fsetdb structure is a XML fileset element, as follows:
<fsetdb>
<fileset name="yourfsetname" servername="muskox" >
<replicator>clientA</replicator>
<replicator>clientB</replicator>
</fileset>
</fsetdb>
The above contains a single fileset description for a fileset called
yourfsetname
which is served by muskox
. The fileset is
replicated on hosts clientA
and clientB
.
To ease the mounting of InterMezzo filesets add one of the following to
the /etc/fstab
file. For testing and developing using a loop
device as the cache is easiest:
/tmp/cache /izo0 InterMezzo loop,fileset=fsetname,mtpt=/izo0, prestodev=/dev/intermezzo0,cache_type=ext3,noauto 0 0
where /tmp/cache
is a file associated with a loop device,
/izo0
is a mount point (a directory), fsetname
is the
name of the fileset and /dev/intermezzo0
is the name of the
presto device. The creation of the cache file and the presto device
is explained in the examples at the end of this section.
The kernel must be configured with loopback device support enabled to
do this.
To use a genuine block device is a little easier, because you do not
need to set up a loop device. To use the block device
/dev/hda9
, the /etc/fstab
file should contain:
/dev/hda9 /izo0 InterMezzo fileset=fsetname,mtpt=/izo0,
prestodev=/dev/intermezzo0,cache_type=ext3,noauto 0 0
NOTICE:
The kernel modification log (KML
) keeps track of all of the changes
made in an InterMezzo filesystem.
The last_rcvd
file keeps track of the last record in the KML file
that the kernel has handled. In the current release of InterMezzo,
the KML and last_rcvd files need to be created (usually by
running mkizofs
) before first mounting an InterMezzo filesystem.
mkizofs -v fsetname /tmp/cache
See mkizofs -h
for options, such as specifying the filesystem
type. If you have already initialized your cache filesystem, then you
must manually create the needed InterMezzo metadata files:
mount -o loop /tmp/cache /izo0 mkdir -p /izo0/.intermezzo/fsetname touch /izo0/.intermezzo/fsetname/{kml,last_rcvd} umount /izo0
These example assumes that we are using the loopback device with the
/tmp/cache filesystm, and that the fileset will be called fsetname
.
Let's consider three common system configurations, for each we will give the config files and the correct invocations to start the server/cache manager.
In this case we assume that the host muskox
is serving the fileset
shared
and the host clientA
is replicating the fileset.
The following files are placed on both muskox
and clientA
.
<serverdb>
<server name="muskox" ipaddr="192.168.0.3" />
</serverdb>
<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>clientA</replicator>
</fileset>
</fsetdb>
On muskox
this contains:
<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />
On clientA
this contains:
<sysid name="clientA" psdev="/dev/intermezzo0" bindaddr="192.168.0.20" />
The following line is added on both muskox
and clientA
:
/tmp/fs0 /izo0 InterMezzo loop,fileset=shared,prestodev=/dev/intermezzo0,
mtpt=/izo0,cache_type=ext3,noauto 0 0
This file and the filesystem is created using the following commands:
dd if=/dev/zero of=/tmp/fs0 bs=1024 count=10k
mkizofs -F /tmp/fs0
If we didn't run mkizofs above, we create the KML and last_rcvd files by first mounting the filesystem as ext3:
mkdir /izo0
mount -o loop /tmp/fs0 /izo0
mkdir -p /izo0/.intermezzo/shared
touch /izo0/.intermezzo/shared/{kml,last_rcvd}
umount /izo0
This is created using the following commands:
mknod /dev/intermezzo0 c 185 0
chmod 700 /dev/intermezzo0
Your modules configuration file may also be called /etc/modules.conf
.
Add the lines:
alias char-major-185 presto
alias InterMezzo presto
Before starting lento, mount the cache:
mkdir /izo0; mount /izo0
Now lento can be started on both muskox
and clientA
by typing
lento
The can be the same as for the one client and one server case above.
<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>clientA</replicator>
<replicator>clientB</replicator>
</fileset>
</fsetdb>
This is the same as in the first example, but clientB is added to the replicators list.
This is the same as in the first example for muskox
and
clientA
, and on clientB
contains the following:
<sysid name="clientB" psdev="/dev/intermezzo0" bindaddr="192.168.0.21" />
This is the same as used with the one client and one server case above.
Suppose that we are running on the host muskox
. To run multiple
lentos on one host we need to use ip-aliasing; the ip-aliasing option
must be compiled into your kernel (CONFIG_IP_ALIAS
). This allows
one interface to have more than one IP address associated with it.
Suppose the name muskoxA1
and the IP address 192.168.0.100
are available. In:
Add the line:
192.168.0.100 muskoxA1
Then add the ip-alias by typing:
ifconfig eth0:1 muskoxA1 up
Then create two configuration files containing the following:
<sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />
<sysid name="muskoxA1" psdev="/dev/intermezzo1" bindaddr="192.168.0.100" />
The latter file will act as a sysid
file for the lento running on
the aliased IP address. Note that because we are running both the client
and the server on the same system, we have to specify different devices
for each, namely /dev/intermezzo0
and /dev/intermezzo1
.
<fsetdb>
<fileset name="shared" servername="muskox" >
<replicator>muskoxA1</replicator>
</fileset>
</fsetdb>
To run the second lento, a second presto device and loopback cache are required. These are made as follows:
mknod /dev/intermezzo1 c 185 1 dd if=/dev/zero of=/tmp/fs1 bs=1024 count=10k mkizofs -F /tmp/fs1 chmod 700 /dev/intermezzo1
Note that two entries are needed here:
/tmp/fs0 /izo0 InterMezzo loop,fileset=shared,prestodev=/dev/intermezzo0, mtpt=/izo0,cache_type=ext3,noauto 0 0 /tmp/fs1 /izo1 InterMezzo loop,fileset=shared,prestodev=/dev/intermezzo1, mtpt=/izo1,cache_type=ext3,noauto 0 0
Now mount the two InterMezzo filesystems:
mount /izo0 mount /izo1
The lento acting as the server can be started as before:
lento
The lento acting as the replicator has to be told which sysid
file to read (which tells it which presto device to use).
The second lento is started as follows:
lento.pl --idfile=sysid.muskoxA1
This section have been obsoleted. The XML version of the config check is not ready yet.
A script is provided to perform simple checks on the configuration
files. The script is called config_check
and can be found in the
.../intermezzo/tools
directory.
If Lento is using the standard system id file,
/etc/intermezzo/sysid
, the script can be run without
arguments. If a different system id file is being used the
--idfile=my_idfile
flag can be used to indicate this.
It is also possible to use a configuration directory other than
/etc/intermezzo
by using the --configdir=my_confdir
flag.
The current version of InterMezzo has a built in recovery mechanism to deal with most situations of system crashes. Through configuration choices, conflicts, i.e. inconsistent updates to client and server caches can be avoided.
However, during disconnected operation, conflicts can be generated if the configuration does not explicitly avoid them through enforcing the file system to be readonly. Where the client and server have inconsistent caches, only manual recovery can recover the system.
The system can be recovered manually as follows:
umountizo ; rmmod presto
mount -o loop /tmp/fs0 /izo0
touch /var/intermezzo/SYSID/FSETNAME-synced
e.g. on client iclientA
with fileset shared
use:
touch /var/intermezzo/iclientA/shared-synced
cp /dev/null /izo0/.intermezzo/shared/kml
cp /dev/null /izo0/.intermezzo/shared/last_rcvd
This is cumbersome, but journaled recovery is on its way.
To help us find bugs we need logging information. The logs come
in two places, from the kernel in /var/log/messages
, and from
lento on stdout and stderr.
The kernel debugging log slows things down enormously and is activated with:
echo 4095 > /proc/sys/intermezzo/debug
echo 1 > /proc/sys/intermezzo/trace
The lento log can be captured from the terminal, and is activated
using the --debuglevel=N
. With N=1 you get many things, with
N=100, all of it.
Mailing us the logs as well as a precise description of what you did to produce the bug might be enough to see what's happening.
Read the README file in the ../intermezzo/tests
directory. This can save all information for you conveniently and
runs the client(s) and server on a single system.
InterMezzo was heavily inspired by Coda, and its current cache synchronization protocol is one of the many protocols that Coda supports. It is likely not the best for every situation but it is as simple as we could make it.
InterMezzo's mechanisms are very different from those of Coda. We employ very different kernel code which maintains the cache in another file system (typically ext3/Reiserfs/XFS/JFS). The kernel code also uses the journaling support in the kernel to make transactional updates (with lazy commits) to the file space and update journals.
The primary reason for keeping it simple is that we wanted to use it as soon as possible. It is also hoped that it will not be too confusing to the end user, as is frequently the case with advanced network file systems.
InterMezzo divides the file space up in filesets. Typically a
fileset is much larger than a directory and smaller than a full disk
partition. Good examples of filesets might be /usr
or
someone's home directory.
The typical event sequence for a fileset in InterMezzo is as follows:
the fileset is created on the server, possibly populated, possibly empty. The file server and the kernel on the server are now aware of the fileset.
A client which needs the fileset is told about the fileset and its server. The client is added to the servers list of replicators of the fileset and the server is made a replicator of the cache on the client. The fileset is mounted on the client, and the client cache manager and kernel know about it.
A replicator has the following state:
replicator describes replication between this system and peer.
replicator describes replication of fileset fsetname
the next update record to expect from the peer in the peer's numbering sequence
the next update record to send to the peer in this systems numbering sequence
Under normal operation there are a collection of replicators that are connected to the server, and some replicators are in disconnected operation (the latter can also happen when the server fails).
The kernel and server/cache manager keep the log of updates to the
fileset transactionally in sync with the contents of the file system,
i.e. under all circumstances any update applied to the file system is
also entered in the update database. Also, the systems transactionally
update the next_to_expect
counters as updates are entered in the
cache.
All transactions have lazy commits. This means that in the case of a system failure, it is possible that not all data was saved to the disk, but it is guaranteed that what was written to disk is consistent between the filesystem and the KML/last_rcvd records. Upon system recovery missing data can be re-fetched, and the filesystem will be consistent.
The following rules govern the operations:
Before an update can be made to the file system, a permit is acquired. A permit acquisition consists of:
Read access on a synced fileset is unrestricted.
When a client or server notices that a peer is no longer available it does the following.
The reconnection protocol is the most complicated:
Normal operation can now resume.
InterMezzo can send six types of packets across the network. In practice
the MSG
and EOR
types are not used:
This type of packet is delivered by the connection to
the Lento::ReqDispatcher
session. This session in turn
invokes a method in Lento::InterMezzo::ReqHandler
. The
sender of a REQ
packet will be called the client the
receiver of the packet the server for the request.
The client sending a REQ includes the ctoken
to allow replies
to be dispatched to the same session. Following the REQ header is
always the request type, leading to correct dispatch in the
ReqDispatcher.
Any packet that is sent in response to a REQ
will contain the
ctoken
, to allow the client to dispatch the packet correctly.
It also contains an stoken
which the server includes in the
packet header. This allows the client to send further packets in the
exchange (MSG packets) to the correct server session.
The tokens are included to find the session that is meant to handle the reply to the request, or to send further data to the server session which received the request.
These packets always indicate the final packet in an
exchange, and are always sent from server to client, i.e. dispatched
to the session that created the ctoken
. They include an
integer return code by default in their payload. The final packet may
also be an EOD
packet.
sent from server to client. Not currently in use.
sent from client to server, and dispatched using an
stoken
. Not currently in use.
InterMezzo's bulk transfer uses three further packet types. This model is similar to that used in the RPC2 side effects in Coda:
packets are sent from source to sink
indicates no more DAT packets are following in an exchange, also sent from source to sink.
indicates to a client that it can start sending DAT packets. This is sent from server to client when the source of a bulk transfer is the client.
The reintegrate protocol consists of the following exchange:
REQ
packet with a payload containing: fsetname, seq_no
and an array of CML records
.
send_CML
).
CLOSE CML
record, it sends a
FetchFile request to the client, through a yield to
do_backfetch, (a backfetch is a term used to indicate that
the server retrieves data from a client). This request runs in a
separate Fetchfile
session that calls back at the done_fetching event.
The InterMezzo web site is http://www.inter-mezzo.org
.
General questions about InterMezzo can be sent to intermezzo-discuss@lists.sourceforge.net
. This along with other
InterMezzo related mail lists are archived on the InterMezzo web site,
so it may be worth checking here to see if your question has already
been answered.
Bug reports should be filed on sourceforge (XXX). Please include the version of InterMezzo you are using and a description of your system configuration and the problem observed.
It would be useful if you could run the config_check
script on
your hosts before sending the bug report to ensure that the system is
configured correctly. The config_check
script can be found in
the ../intermezzo/tools
directory. (XXX is this still working).
Also, please include all relevant logs: /var/log/messages, and the output of Lento (run with debugging) on server and clients.