Linux Implementation of mumufs

Author / Автор: Sergey Satskiy
NCBI
Publication date / Опубликовано: 07.12.2009
Version / Версия текста: 1.0

In memory of Roman Plekhanov, a good man

Introduction

At the time of writing the article it was quite hard to find a detailed description of how to implement a file system for Linux. This article is the result of the author’s work in a process of searching information and implementing mumufs. The article’s purpose is to describe in detail how to develop and implement a file system which resides in RAM and to introduce mumufs details to the readers. It is worth mentioning, however, that the author is not experienced in the area of Linux kernel module development and therefore some inaccuracy is possible.

The next section describes mumufs briefly while more detailed information could be found in [1].

Mumufs Brief Description

For some tasks it could be convenient to use an inter process data exchange model with the following features:

  • Data exchange is done by blocks of an arbitrary size
  • Each block has a unique identifier
  • Only the last written block is stored for each identifier
  • Many writing and reading processes can operate simultaneously

The described model is well suited for systems that don’t depend on the history of exchanged data. An example of such a system is a data visualization application that has data suppliers and consumers in separate processes. For the purpose of this article, imagine a hardware / software system with two independent wind speed sensors (each with its own measurement process), a process that displays wind speed, and a process that performs calculation using wind speed. Let’s also make an assumption that data blocks are identified as file names in file systems. Then the communications between the system parts could be as shown on the figure below.

Figure 1. Example of a data exchange between processes

The wind speed sensors could be connected via different interfaces which are served by separate processes (let’s leave out of the consideration a procedure which selects the sensor that should provide a wind speed value at a certain moment of time). As soon as a new value is received from a sensor it is written into a data block with the /mnt/mumu/wind identifier. Process #3 might serve a user interface and as soon as the value is updated in the data block it should update the user interface as well. Process #4 might use the wind speed value for some calculations.

Each process which opens and reads from /mnt/mumu/wind will get the current value. In order to do that a constant presence neither data suppliers nor data consumers is required.

As far as the basic operations with a data block are reading and writing the interface looks pretty much as the interface provided by file systems. It is also convenient that file systems usually require that file names are unique. The file names can be used as data blocks identifiers. One more positive feature is that it is possible to use many standard UNIX utilities like mkdir, ls, rm, cat, touch, ln etc which are originally intended for traditional file systems.

The Linux kernel, in turn, provides the VFS (virtual file system switcher) subsystem for developing new file systems. This article describes an implementation of mumufs - a file system which supports block oriented data exchange between processes, resides in RAM, and allows many simultaneous data suppliers and consumers.

Linux Kernel Module Requirements

Let’s discuss in more detail how the file system should look from a system administrator’s point of view. The file system resides in RAM so it would be nice to have a way to control the amount of memory used. The two values of most interest are the maximum size of a block and the maximum number of blocks. As soon as a file system could be mounted an arbitrary number of times those values must be applied to a mounted instance of mumufs but not to a module which implements the mumufs file system. In other words a system administrator must be able to issue a command similar to the following:

mount –t mumufs nodev /mnt/mumu –o max_entries=2048,max_block_size=1024

The module in turn can export some information using the /proc file system. It would be nice to be able to get the module version as well as information about all the mounted instances of mumufs. The desired /proc entries structure is shown below:

/proc/mumu/version
          /<mounted mumufs #1>
          . . .
          /<mounted mumufs #n>/mountpoint
                              /maxentries
                              /maxblocksize
                              /entries

Unfortunately it is possible to use the same mount point for many instances of mumufs (though only the last one mounted will be available for usage). Therefore the name <mounted mumufs> can not be just a decorated mount point path. It is necessary to add something else like a sequential number. The implementation however will use a string representation of a pointer to a structure which holds the mount point information as the <mounted mumufs> name.

Now we are ready to develop data structures.

Data Structures

The following structure is suitable for storing a single data block:

struct Storage
{
    unsigned char *      Buffer;         /* Data block             */

    unsigned int         BufferSize;     /* Buffer capacity        */
    unsigned int         BufferVersion;  /* Buffer version         */
    struct rw_semaphore  BufferLock;     /* RW lock for the buffer */

    wait_queue_head_t    BlockedReaders; /* Blocked processes list */
};

The Buffer field points to a data block whose size must not exceed a limit provided at the moment of mounting a file system. The BufferSize field stores the current data block size. Each mumufs transaction requires a block and each new block can have different size than the previous one, so the last size must be stored. The BufferVersion field stores the data block version. Each new write increments the version while the value 0 indicates that there are no data at all.

To make it safe to update the above-mentioned fields the BufferLock lock is used. The type of lock suitable here is the rw_semaphore lock, which supposes writing priority.

The BlockedReaders field is used to block reading when there are no data or the current file descriptor has already read the available data block. A process must be blocked in case if the file was opened in the blocking mode. The BlockedReaders filed stores a list of processes which are blocked on the read operation.

A second data structure is required to store information about a mounted instance of mumufs. The following structure will do the job:

struct mumu_mount_data
{
    struct vfsmount *        Mount;            /* fs mount point info                    */
    unsigned int             MaxBlockSize;     /* Maximum block size                     */

    unsigned int             MaxEntries;       /* Maximum number of files                */
    atomic_t                 NumberOfEntries;  /* The current number of files            */
    struct proc_dir_entry *  parent;           /* Used at the time when entry is removed */

};

The Mount field stores a pointer to the Linux kernel structure which describes the mount point. The vfsmount pointer is a rare guest in VFS calls and it appears in particular at the time of mounting a file system. It is impossible to know at what path the file system was mounted without this structure while the path is required to be provided via /proc. Therefore this pointer should be saved.

The MaxBlockSize and MaxEntries fields store the current parameters of a mounted mumufs instance. The NumberOfEntries field in turn stores the current number of data blocks on a mounted mumufs instance.

The last field – parent – is required for a correct destruction of entries in /proc which were created for a certain mumufs instance. The destruction should be done when a mumufs instance is unmounted.

All other important data structures are provided by VFS.

The Kernel Module

Starting to code a module for the 2.6 kernel series requires minimal effort. First define a few macros:

MODULE_LICENSE( "GPL" );
MODULE_AUTHOR( "Sergey Satskiy" );
MODULE_DESCRIPTION( "MuMu filesystem supports multiple writers and multiple readers" );

Then define the following two functions, which are called at the time of loading and unloading a module:

static int __init init_mumufs_fs( void )
{
    . . .
}

static void __exit exit_mumufs_fs( void )
{
    . . .
}

module_init( init_mumufs_fs )
module_exit( exit_mumufs_fs )

At the moment of loading a module mumufs has to be registered in the operating system, and at the moment of unloading a module it has to be unregistered. In addition, a root entry must be created/destroyed in /proc (/proc operations will be described later in a separate section).

A properly filled file_system_type structure is required to register mumufs:

static struct file_system_type      mumufs_fs_type =
{
    .owner      = THIS_MODULE,
    .name       = "mumufs",
    .get_sb     = mumufs_get_sb,
    .kill_sb    = mumufs_kill_super,
};

Then the register_filesystem() function has to be called:

register_filesystem( &mumufs_fs_type );

The assigning of THIS_MODULE to the owner field allows the Linux kernel to count references to the module properly. The mumufs_get_sb() and mumufs_kill_super() functions should correspondingly create and destroy a file system superblock.

The init_mumufs_fs() function can be implemented as follows:

static int __init init_mumufs_fs( void )
{
    int         RetVal;

    if ( InitialiseProcEntries() != 0 )
    {
        return -EFAULT;
    }

    RetVal = register_filesystem( &mumufs_fs_type );
    if ( RetVal != 0 )
    {
        RemoveProcEntries();
    }
    return RetVal;
}

While the exit_mumufs_fs() function as follows:

static void __exit exit_mumufs_fs( void )
{
    RemoveProcEntries();
    unregister_filesystem( &mumufs_fs_type );
}

Mounting and Unmounting mumufs

The mumufs_get_sb() and mumufs_kill_super() functions (pointers to which were provided at the moment of registering mumufs) will be used by VFS at the moment when a mumufs instance is mounted or unmounted.

To create a superblock it is possible to use the VFS get_sb_nodev() function which is intended for the cases when a file system is not coupled with any device. The mumufs_get_sb() function should also store a pointer to vsmount which is required for /proc entries and which will not appear anymore in VFS calls. So the function could be implemented as follows:

int  mumufs_get_sb( struct file_system_type *      fs_type,
                    int                            flags,
                    const char *                   dev_name,
                    void *                         data,
                    struct vfsmount *              mnt )
{
    int     ret;

    ret = get_sb_nodev( fs_type, flags, data, mumufs_fill_super, mnt );
    if ( ret == 0 )
    {
        struct super_block *        sb = mnt->mnt_sb;

        /* Memorize vfsmount pointer. It is required to display the mount */

        /* point in the /proc/mumufs/...                                  */
        ((struct mumu_mount_data *)(sb->s_fs_info))->Mount = mnt;
    }
    return ret;
}

One of the VFS get_sb_nodev() function arguments is a pointer to a function which must fill the super_block structure. The mumufs_fill_super() function fills the super_block and also allocates a mumu_mount_data structure, pointer to which is stored in the s_fs_info field of the super_block structure. The s_fs_info field is reserved for a file system specific data.

The mumufs_fill_super() function could be implemented as follows (the errors checking code is omitted to make the code shorter):

static int mumufs_fill_super( struct super_block *      sb,
                              void *                    data,
                              int                       silent )
{
    struct inode *              inode;
    struct dentry *             root;
    struct mumu_mount_data *    mount_data;


    mount_data = kmalloc( sizeof( struct mumu_mount_data ), GFP_KERNEL );

    memset( mount_data, 0, sizeof( struct mumu_mount_data ) );

    mumufs_parse_opt( (char *)data, mount_data );

    sb->s_fs_info = mount_data;
    sb->s_magic = MUMUFS_MAGIC;
    sb->s_op = &mumufs_ops;

    inode = mumufs_get_inode( sb, S_IFDIR | 0755, 0 );

    root = d_alloc_root( inode );
    sb->s_root = root;

        /* It is the point when a file system is mounted */

    CreateMountInfo( sb );
    return 0;
}

First, the mumufs_fill_super() function allocates the mumu_mount_point structure. Then it parses the file system mounting parameters with a help of the mumufs_parse_opt() function and stores a pointer to the mumu_mount_point structure in the superblock s_fs_info field. Then two important fields are filled in – the mumufs magic number and a pointer to the structure which describes operations on mumufs. After that the function creates an inode for the root element of the mounted file system with a help of the mumufs_get_inode() function. The mumufs_fill_super() function also creates /proc entries with a help of CreateMountInfo() function.

The mumufs operations structure and the magic number are as follows:

#define MUMUFS_MAGIC    0x710110

static struct super_operations              mumufs_ops =
{
    .statfs         = mumufs_statfs,
    .drop_inode     = generic_delete_inode,
    .put_super      = mumufs_put_super,
    .show_options   = mumufs_show_options,
};

How to parse mounting parameters, create inode, and work with /proc entries is discussed later.

The mumufs_put_super() function frees the memory occupied by a mount point describing structure.

static void mumufs_put_super( struct super_block *  sb )
{
    RemoveMountInfo( sb );          /* /proc/ entry removal           */

    mumufs_umount_cleanup( sb );
    kfree( sb->s_fs_info );         /* mumu_mount_data removal        */
}

At the time of unmounting mumufs VFS calls the mumufs_kill_super() function. It is possible that a mumufs instance still has some files i.e. there are allocated memory blocks which are associated with the files. The memory must be freed and this is done by the mumufs_kill_super() function.

Information about a data block is stored in the Storage structure (which is allocated at the moment of creating an ordinary inode) and a pointer to the Storage structure is stored in the i_private field of the inode structure. Since a superblock holds a list of all its inodes there is enough information to implement the mumufs_kill_super() function.

void  mumufs_kill_super( struct super_block *  sb )
{
    struct inode *      inode;
    struct inode *      next_i;

        /* Iterate over the fs inodes */

    list_for_each_entry_safe(inode, next_i, & sb->s_inodes, i_sb_list)
    {
        if ( inode->i_mode && S_IFREG )
        {
            if ( inode->i_private != NULL )
            {
                FreeStorage( (struct Storage *)( inode->i_private ) );
            }
        }
    }

    return kill_litter_super( sb );
}

Mounting Parameters Parsing

To simplify parsing mounting parameters VFS offers some functions. These functions need a structure which describes the allowed parameters:

enum
{
    OPT_MAX_BLOCK_SIZE,     /* max block size limitation   */

    OPT_MAX_ENTRIES,        /* max # of entries limitation */
    OPT_ERROR               /* end of list marker          */
};


static match_table_t    tokens =
{
    {OPT_MAX_BLOCK_SIZE,    "max_block_size=%u"},
    {OPT_MAX_ENTRIES,       "max_entries=%u"},
    {OPT_ERROR,             NULL}
};

Now the mumufs_parse_opt() function can be implemented:

int mumufs_parse_opt( char *                      opt,
                    struct mumu_mount_data *    d )
{
    char *      p;

    d->MaxBlockSize = DEFAULTMAXBLOCKSIZE;
    d->MaxEntries   = DEFAULTMAXENTRIES;

    while ( (p = strsep( &opt, "," )) != NULL )
    {
        int             token;
        int             value;
        substring_t     args[ MAX_OPT_ARGS ];


        if ( !*p )      continue;

        token = match_token( p, tokens, args );
        switch ( token )
        {
            case OPT_MAX_BLOCK_SIZE:
                if ( match_int( &args[0], &value ) )    return 0;
                d->MaxBlockSize = value;
                break;

            case OPT_MAX_ENTRIES:
                if ( match_int( &args[0], &value ) )    return 0;
                d->MaxEntries = value;
                break;

            default:
                return 0;
        }
    }

    return 1;
}

The mounting parameters should be properly displayed when a user issues the ‘mount’ command. Therefore the mumufs_show_options() function should be implemented and a pointer to it has to be saved in the mumufs operations structure.

int mumufs_show_options( struct seq_file *    m,
                         struct vfsmount *    mnt )
{
    struct mumu_mount_data *    d = (struct mumu_mount_data *)

                                            ( mnt->mnt_sb->s_fs_info );

    seq_printf( m, ",max_block_size=%u", d->MaxBlockSize );
    seq_printf( m, ",max_entries=%u", d->MaxEntries );

    return 0;
}

/proc Entries

At the time of loading the mumufs implementing module the /proc/mumu root entry needs to be created. The nested /proc/mumu/version entry also needs to be created at the time of the module loading. Reading from the /proc/mumu/version entry obviously provides the mumufs module version. The creation of the entries is done by the InitialiseProcEntries() function:

static const char *             EntryPoint = "mumu";

static const char *             Version = "0.1.0";
static struct proc_dir_entry *  MainProcEntry = NULL;

int  InitialiseProcEntries( void )
{
    struct proc_dir_entry *     Entry;
    int                         Flags;

    MainProcEntry = proc_mkdir( EntryPoint, NULL );
    if ( MainProcEntry == NULL )
    {
        return -ENOMEM;
    }
    MainProcEntry->owner = THIS_MODULE;


    Flags = S_IFREG | S_IRUGO;              // Anybody can read the file

    Entry = create_proc_entry( "version", Flags, MainProcEntry);
    Entry->owner = THIS_MODULE;
    Entry->read_proc = ReadVersion;

    return 0;
}

The assigning of THIS_MODULE to the owner field allows the kernel to count references to the module properly. (It is planned in the future kernel versions to remove the owner field from the proc_dir_entry as soon as the reference counting is going to be automated for /proc entries). The read_proc field stores a pointer to a function which will be called when a user reads from the corresponding file.

Obviously at the time of unloading the module it is necessery to destroy the /proc entries:

int RemoveProcEntries( void )
{
    remove_proc_entry( "version", MainProcEntry );
    remove_proc_entry( EntryPoint, NULL );

    return 0;
}

Creation and destruction of the information about each mounted mumufs is done similar.

Here is an implementation of the /proc/mumu/version entry reading function:

static int ReadVersion( char *  Page, char **  Start, 
                        off_t  Offset, int  Count, 
                        int *  Eof, void *  Data )
{
    return snprintf( Page, Count - 1, "%s
", Version ) + 1;
}

The function which provides mount point path when a user reads the corresponding /proc entry is a bit more complicated. The function uses the vfsmount structure fields to indentify a full path to the mount point with a help of the VFS d_path() function. The pointer to the vfsmount structure was saved at the time when mumufs was mounted.

static int ReadMountPoint( char *  Page, char **  Start, 

                           off_t  Offset, int  Count, 
                           int *  Eof, void *  Data )
{
    char                        buf[ NAME_MAX*2 ];
    struct vfsmount *           mount = ((struct mumu_mount_data *)Data)->Mount;
    struct path                 p;

    if ( mount != NULL )
    {
        /* Fill the path structure fields to get the mount point path */

        p.mnt = mount->mnt_parent;
        p.dentry = mount->mnt_mountpoint;

        return snprintf( Page, Count - 1, "%s
", d_path( &p, buf, 512 ) ) + 1;
    }

    return snprintf( Page, Count - 1, "Unknown
" ) + 1;
}

Creating and Removing Directories, Symbolic Links and Files

When the process of mounting mumufs was discussed the mumufs_get_inode() function was mentioned. Its responsibility is to create an inode.

The inode structure has, among the others, a field which points to the operations allowed for the inode. These operations differ depending on a type of inodes – a directory, a regular file or a symbolic link. mumufs uses the following inode operations structures:

static struct inode_operations              mumufs_file_inode_operations =
{
    .getattr    = simple_getattr,
    .setattr    = mumufs_inode_setattr,
};


static struct inode_operations              mumufs_dir_inode_operations =
{
    .create     = mumufs_create,
    .lookup     = simple_lookup,
    .link       = simple_link,
    .unlink     = mumufs_inode_unlink,
    .symlink    = mumufs_symlink,
    .mkdir      = mumufs_mkdir,
    .rmdir      = simple_rmdir,
    .mknod      = mumufs_mknod,
    .rename     = simple_rename,
};

There is no need to implement functions which start from simple_. They are provided by VFS and their implementation is suitable for the mumufs.

As for symbolic links a whole VFS provided operations structure is used.

Using the described above operations structures the mumufs_get_inode() function can be implemented as follows (the errors checking code is omitted to make the code shorter):


static struct inode *  mumufs_get_inode( struct super_block *  sb,
                                         int                   mode,
                                         dev_t                 dev )
{
    if ( S_ISREG( mode ) || S_ISDIR( mode ) || S_ISLNK( mode ) )
    {
        struct inode *      inode = new_inode( sb );
        if ( inode != NULL )
        {
            inode->i_mode = mode;
            inode->i_uid = current->fsuid;
            inode->i_gid = current->fsgid;
            inode->i_mapping->a_ops = &mumufs_aops;
            inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
            inode->i_private = NULL;

            switch ( mode & S_IFMT )
            {
                case S_IFREG:

                    {
                        int  Records = atomic_read( & ((struct mumu_mount_data *)

                                                       (sb->s_fs_info))->NumberOfEntries );
                        if ( Records >= ((struct mumu_mount_data *)
                                         (sb->s_fs_info))->MaxEntries )
                        {
                            iput( inode );
                            return NULL;
                        }
                    }

                    inode->i_op = & mumufs_file_inode_operations;
                    inode->i_fop = & mumufs_file_operations;

                        /* Alloc the node storage */

                    inode->i_private = AllocateStorage();

                        /* Increment # of entries */
                    atomic_inc( & ((struct mumu_mount_data *)
                                   (sb->s_fs_info))->NumberOfEntries );
                    break;
                case S_IFDIR:
                    inode->i_op = & mumufs_dir_inode_operations;
                    inode->i_fop = & simple_dir_operations;
                    /* dir inodes start off with i_nlink == 2 (for "." entry) */

                    inode->i_nlink++;
                    break;
                case S_IFLNK:
                    inode->i_op = & page_symlink_inode_operations;
                    break;
                default:
                    /* It is impossible - we have 'if' at the beginning */

                    break;
            }
        }
        return inode;
    }

    return NULL;
}

The function checks for what inode type it was called and then calls the VFS provided new_inode() function. Then it fills fields of the created inode depending on its type. If the inode is purposed for a regular file then the limit of a maximum number of data blocks is checked. If everything is OK then a data block is allocated with a help of the AllocateStorage() function and the counter of the current number of data blocks is incremented.

The inode structure has an important field called i_fop. It holds a pointer to a structure which describes file operations. The symbolic links and directories inodes can use existed VFS structures while the regular files should have their own:

static struct file_operations               mumufs_file_operations =
{
    .open       = mumufs_file_open,
    .read       = mumufs_file_read,
    .write      = mumufs_file_write,
    .release    = mumufs_file_release,
    .llseek     = mumufs_no_llseek,

    .poll       = mumufs_file_poll,     /* select & poll support */
};

The mentioned here functions will be used for all operations with files which correspond to operations with data blocks.

Opening and Closing of Files

The mumufs_file_open() and mumufs_file_release() functions are responsible for opening and closing regular files. An important detail here is that the kernel prepares the ‘file’ structure which corresponds to a file descriptor in the user space. A pointer to the file stricture is passed to all the functions which deal with a file. The file structure has the private_data field which is purposed for internal filesystem usage. mumufs needs to store a version of the last read data block and this field is a good candidate for that. At the time of opening a file the field should be zeroed.

int     mumufs_file_open( struct inode *  iNode, struct file *  File )
{
    int     RetVal = generic_file_open( iNode, File );
    if ( RetVal != 0 )
    {
        return RetVal;
    }

        /*
         * private_data stores the last read version of a block.
         * 0 - nothing has been read yet
         */

    File->private_data = 0;
    return 0;
}

At the time of closing a file there is nothing to do. The corresponding function is rather given for demonstration purposes.

int     mumufs_file_release( struct inode *  iNode, struct file *  File )
{
    struct Storage *        storage = (struct Storage *)(iNode->i_private);

    if ( storage == NULL )
    {
        /* It is basically a problem however there is nothing to do */

    }
    return 0;
}

Writing to a File

The function which writes to a file is very simple. It should check that the new data block size does not exceed the limit set for the file system instance, copy the new data from the user space to the kernel space, grab a lock, update a pointer to the data block, increase the current data block version, release the lock and notify all the blocked readers that new data are available. All the actions are consequent (the errors checking code is omitted to make the code shorter):


ssize_t mumufs_file_write( struct file *  File, const char *  Buffer, 
                           size_t  Size, loff_t *  Offset )
{
    struct inode *              node = File->f_dentry->d_inode;
    struct Storage *            storage = (struct Storage *)(node->i_private);
    struct mumu_mount_data *    mount_data = (struct mumu_mount_data *)

                                                 (File->f_dentry->d_sb->s_fs_info);
    unsigned char *             KernelBuffer = NULL;

        /* I support block oriented read/write so the offset must be 0 */
    if ( *Offset != 0 )                     return -EINVAL;
    if ( Size > mount_data->MaxBlockSize )  return -EINVAL;

        /* Buffer is required because the copying could be interrupted */

    KernelBuffer = kmalloc( Size, GFP_KERNEL );
    if ( KernelBuffer == NULL )             return -ENOSPC;
    if ( copy_from_user( KernelBuffer, Buffer, Size ) != 0 )
    {
        kfree( KernelBuffer );
        return -EFAULT;
    }

    down_write( & storage->BufferLock );

    if ( storage->Buffer != NULL )
    {
        kfree( storage->Buffer );
    }
    storage->Buffer = KernelBuffer;
    storage->BufferSize = Size;

        /* Update the block version */

    if ( ++(storage->BufferVersion) == 0 )
    {
        ++(storage->BufferVersion);
    }

    node->i_size = Size;            /* File size update    */
    node->i_mtime = CURRENT_TIME;   /* Modification time   */
    node->i_bytes = Size;           /* Used bytes          */

    up_write( & storage->BufferLock );

        /* Wake up all the reades */
    wake_up_interruptible( & storage->BlockedReaders );
    return Size;
}

In addition to the described above actions the function updates the current file size and the file modification date.

Reading from a File

Implementation of reading from a file is a bit more complicated. It is necessary to take into account what was the file opening mode. If the file was opened in a blocking mode and there is no new data then the reading process should be blocked. If the file was opened in a non-blocking mode and there is no new data then the process should not be blocked but the –EAGAIN value should be return as an indicator of data absence.

ssize_t mumufs_file_read( struct file *  File, char *  Buffer, size_t  Size, loff_t *  Offset )
{
    struct inode *              node = File->f_dentry->d_inode;
    struct Storage *            storage = (struct Storage *)(node->i_private);

        /* I support block oriented read/write so the offset must be 0 */

    if ( *Offset != 0 )                     return -EINVAL;
    if ( Size < storage->BufferSize )       return -EINVAL;

  Loop:
    down_read( & storage->BufferLock );
    if ( (storage->BufferVersion != 0 ) &&       /* There are data available     */

         (File->private_data != 
             (void*)(storage->BufferVersion)) )  /* New data block is available  */
    {
        int     BytesCopied = storage->BufferSize;

        if ( copy_to_user( Buffer, storage->Buffer, BytesCopied ) != 0 )
        {
            BytesCopied = -EINVAL;
        }
        else

        {
                /* Save the version of the read block */
            File->private_data = (void*)(storage->BufferVersion);
            node->i_atime = CURRENT_TIME;
        }

        up_read( & storage->BufferLock );
        return BytesCopied;
    }


        /*
         * There is nothing to read because: - 0 bytes available
         *                                   - no "fresh" data
         */

    if ( File->f_flags & O_NONBLOCK )   /* Not required to block */
    {
        up_read( & storage->BufferLock );
        return -EAGAIN;
    }

        /* It is blocking I/O */

    up_read( & storage->BufferLock );
    if ( wait_event_interruptible( storage->BlockedReaders,
                        ((void*)(storage->BufferVersion) != File->private_data) ) != 0 )
    {
        return -ERESTARTSYS;
    }
    goto Loop;
}

At the very beginning the function checks that the user provided buffer has enough space to hold the current available data block. Then the function grabs a reading lock and checks the new data availability. If the new data are available they are copied into the user provided buffer i.e. to the user space, the reading lock is released and the job is done. If there are no new data available then the file opening mode is checked. Depending on the mode the process is put to a waiting queue or not. This queue is used by the file writing function to notify the readers when the new data are available.

Select and Poll

The poll and select operations support is the last functionality to be added to the mumufs. The corresponding function should return a mask which indicates what file operations a process is able to perform without blocking.

mumufs allows non-blocking writing at any moment while non-blocking reading is possible only if new data are available. Before returning the mask it is necessary to register a wait queue for the select() and wait() system calls with a help of the poll_wait() function.

unsigned int  mumufs_file_poll( struct file *  File, 

                                struct poll_table_struct *  PollTable )
{
    unsigned int        mask = POLLOUT | POLLWRNORM;    /* Writing is always welcome */
    struct inode *      node = File->f_dentry->d_inode;
    struct Storage *    storage = (struct Storage *)(node->i_private);

    poll_wait( File, & storage->BlockedReaders, PollTable );

    down_read( & storage->BufferLock );
    if ( (storage->BufferVersion != 0 ) &&          /* There are data in the block       */

         (File->private_data != 
               (void*)(storage->BufferVersion)) )   /* New block is available            */
    {
        mask |= POLLIN | POLLRDNORM;
    }
    up_read( & storage->BufferLock );

    return mask;
}

Performance Testing

The mumufs performance testing is reasonable to do in comparison to the performance of another system mechanism which already exists in the OS. There are at least two candidates: shared memory and message queues. From the first look shared memory looks closer to mumufs because shared memory does not have limits on the number of data suppliers and consumers and stores exactly one copy of the data. Shared memory however does not provide any kind of access synchronization. That means that the client code has to implement synchronization itself. This in turn brings more complexity to the client code and increases the probability of errors. To make the things worse such synchronization is voluntary and this may affect the software reliability. Bearing in mind the speculations above message queues are closer to mumufs in terms of similarity of the client code and the software reliability. To be more precise the mumufs based client code is simpler however there is nothing closer than message queues so the comparative performance testing is done for mumufs and message queues.

The message queue mechanism does not suit well to organize data exchange between many suppliers and many consumers – each pair will require creating a designated queue and this leads to a complicated test code. Therefore the test examples are developed for a model where there is one data supplier and many consumers. It is obvious that message queues loose to mumufs in terms of simplicity of use in case of pure many to many communications however the simplification allows keeping the test code reasonably compact and easy. It is still possible to make assumptions about the performance of many to many data exchange model having the test results just for a one to many model.

The charts below show the time required for a data supplier to send (in case of message queue) or to write (in case of mumufs) all the packets and the the time since sending/writing the first packet till the last packet has been received/read by all the consumers. The time is shown on the axis of ordinats. So the smaller value means better performance. The number of packets is the same for all runs and is equal to 100000. The number of consumers varies from 1 to 64 and this is shown on the axis of abscissa. Each experiment had 3 runs and the average value was calculated.

Figure 2. Comparative performance with a packet of 16 bytes

Figure 3. Comparative performance with a packet of 512 bytes

Figure 4. Comparative performance with a packet of 1024 bytes

Figure 5. Comparative performance with a packet of 2048 bytes

Figure 6. Comparative performance with a packet of 4096 bytes

Figure 7. Comparative performance with a packet of 8192 bytes

The chart of the performance with a packet of 16384 bytes is given for mumufs only. Message queues have a system limitation which is set to 8192 bytes by default so the test runs were not executed. The purpose of the chart is to have a look at the mumufs scalability.

Figure 8. mumufs performance with a packet of 16384 bytes

It is possible to note that in case of a single data supplier and a single data consumer mumufs looses to message queues. The loss could be up to 40%. As soon as the number of consumers is increased mumufs however starts to win. The win could be up to 120%. The required time grows along with increasing the packet size though the speed of time growing is a bit less than the speed of increasing the packet size.

From the memory consumption point of view mumufs consumes less memory at the peak time than message queues. At the peak time message queues store all the unread messages multiplied by the number of consumers while mumufs stores only the last data block.

mumufs has some reserves of increasing performance for the sake of the memory consumption. Two buffers of a maximum size could be allocated for each data block at the time of a file creation. This approach allows avoiding expesive memory allocation and deallocation operations for each write operation and reduces locking concurrency. Such an implementation however is a matter of future.

The last that should be added is a few words about the hardware and the operating system which were used for development and testing. The hardware was based on Intel Atom 330 CPU with 2 gB of RAM which was controlled by Fedora 10 (kernel 2.6.27). More details could be found in [5].

Further Reading

  1. Sergey Satskiy. Mumufs: Virtual File System to Support IPC of Type Many-to-many. http://satsky.spb.ru/articles/mumufs/mumufsArticleEng.php
  2. Claudia Salzberg Rodriguez, Gordon Fischer, and Steven Smolski. The Linux Kernel Primer. Prentice Hall, 2005
  3. Robert Love. Linux Kernel Development. 2nd Edition. Novell, 2005
  4. Daniel Bovet, Marco Cesati. Understanding the Linux Kernel. O’Reilly, 2005
  5. Sergey Satskiy. Cheap Home Linux Server on miniItx Board. http://satsky.spb.ru/articles/LinuxOnMiniITX/LinuxOnMiniITXEng.php
  6. LWN: Creating Linux virtual filesystems. 2002. http://lwn.net/Articles/13325/
  7. LWN: Creating Linux virtual filesystems. 2003. http://lwn.net/Articles/57369/
  8. LWN: lwnfs simpler source code. 2003. http://lwn.net/Articles/57373/
  9. Mumufs source code which corresponds to the article. http://satsky.spb.ru/articles/mumufs/mumufs-0.1.0.tar.bz2


Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.

Разрешается копирование и распространение этой статьи любым способом без внесения изменений, при условии, что это разрешение сохраняется.
Last Updated: September 08, 2009