Installing

This section describes the install process for MarFS.

Overview 

MarFS is a distributed parallel filesystem, thusly setup is complicated and relies on a number of different technologies and systems. In this guide we will go over the setup of an example system on one node. This guide assumes knowledge of ZFS, MPI, and general Linux operations.

Storage 

MarFS stores data and metadata differently. Data is stored as erasure-coded objects. When data is written it is broken up into N number of pieces. Erasure data is calculated on our N objects to create E number of erasure objects. N+E number of objects are then mapped onto N+E identical filesystems. We refer to these N+E filesystems as a “pod”. In our example cluster we will have four filesystems in a single pod giving us a 3+1 erasure coding scheme. You can have multiple pods in a cluster. When you have multiple pods the pod is selected with a hash, and data is written to that pod. Data will never be written across multiple pods. So if you have 4 pods each matching our single pod with a 3+1 scheme those four objects will always be in the same pod.

Metadata can be stored on any filesystem that supports extended attributes and sparse-files. For scalability purposes a distributed filesystem is highly recommended. In our example system we will just use a directory, but our production clusters typically run General Parallel Filesystem (GPFS).

Data Access 

With object data being stored across a number of pods it is reasonable to provide a way to interact with the filesystem in a unified matter. Most users would expect a single mount point they can look through for various tasks. This is provided through FUSE, allowing users to look at their data. In production systems, this FUSE mount is read only, and is there for users to locate their files for parallel movement, although MarFS allows FUSE to be configured to allow read/write.

Data Movement 

Data is moved in parallel using PFTool. In our production systems, nodes running PFTool are called “File Transfer Agent” nodes, or FTAs.

Production Cluster Summary 

In a typical production system, we have three types of nodes, each handling a different role. For the purpose of our example system, one node will handle all three roles.

Storage Nodes 

Each storage node uses ZFS for MarFS block storage. Each node will have four zpools in a RAIDZ3(17+3) configuration. We have multiple pods configured to use 10+2 erasure coding. Must have a high performance network such as Infiniband.

Metadata Nodes 

We use GPFS as metadata storage in production systems. Your GPFS cluster should already be setup and ready to create filesets. You should have a high performance network such as Infiniband when using GPFS.

File Transfer Nodes 

These nodes will be used to move data in parallel from one place to another. We will use PFtool for this. You must have high performance network such as Infiniband. They will also be used to present MarFS to users through a FUSE mount.

MarFS abstractions 

Remember how earlier we talked about the pod? There are more things to understand about the pod. There are logical data abstractions we will see later when understanding the configuration file. We will talk about them briefly here first.

The Repository 

A repo is where all the object data for a MarFS Filesystem lives; it’s a logical description of a MarFS object-store, with details on the number of storage servers, etc.

Data Abstraction Layer 

Multi component stuff here maybe?

Metadata Abstraction Layer 

The Namespace 

A namespace in MarFS is a logical partition of the MarFS filesystem with a unique (virtual) mount point and attributes like permissions, similar to ZFS datasets. It also includes configuration details regarding MarFS metadata storage for that namespace. Each namespace in MarFS must be associated with a repo, and you can have multiple namespaces per repo. Both repos and namespaces are arbitrarily named.

Pods 

A collection of storage nodes.

Blocks 

A storage node in a pod.

Capacity Units 

Each capacity unit (cap) is a datastore on a ZFS zpool on a block in a pod.

Create Base Directory Structure 

In the case we are building MarFS on a multi-node system, it is helpful to have a shared filesystem among all the nodes in the cluster, such as a NFS share mounted on all nodes. We will keep all our source code and other files that must be shared here. In our example we will use /opt/campaign. Before we start installing, let’s create this directory and a subdirectory to function as an install target, and add it to our PATH environment variable.

mkdir -p /opt/campaign/install/bin
cd /opt/campaign
export PATH=$PATH:/opt/campaign/install/bin

We will also need a root directory for our metadata store

mkdir -p /marfs/mdal-root

and one for our data store

mkdir /marfs/dal-root

The last directory we need to create will be for our filesystem mount point.

mkdir /campaign

MarFS Config File 

MarFS uses a config file to set up repositories and namespaces. We will use this example config file which we will create at /opt/campaign/install/etc/marfs-config.xml:

<!--
   Copyright 2015. Triad National Security, LLC. All rights reserved.
   Full details and licensing terms can be found in the License file in the main development branch
   of the repository.
   MarFS was reviewed and released by LANL under Los Alamos Computer Code identifier: LA-CC-15-039.
-->

<!-- Start MarFS Configuration Info -->
<marfs_config version="0.0001beta-ExampleVersion">
   <!-- Mount Point 
         * This is the absolute path to a directory at which MarFS expects to be mounted.
         * Tools which interpret absolute paths ( FUSE ) will only function if they operate below this location.
         * -->
   <mnt_top>/campaign</mnt_top>

   <!-- Host Definitions ( CURRENTLY IGNORED )
        * This is a placeholder for future implementation.
        * The idea is to facilitate configuration of an entire MarFS cluster via 
        * host definitions within this config.
        * -->
   <hosts> ... </hosts>

   <!-- Repo Definition
        * A MarFS 'repo' is a definition of data and metadata structure.  It is the uppermost conceptual abstraction of
        * MarFS.
        * All files written to MarFS must be uniquely and permanently associated with a repo, which defines where file
        * information is placed and how that information is formatted.
        * -->
   <repo name="4+1_Repo">

      <!-- Per-Repo Data Scheme
           * Defines where data objects are placed, how they are formatted, and through what interface they are accessed
           * -->
      <data>

         <!-- Erasure Protection
              * Protection scheme for data objects of this repo.
              * Defines a count of data blocks ( 'N' ), parity blocks ( 'E' ), and erasure stripe width ( 'PSZ' ) for
              * each object.
              * -->
         <protection>
            <N>4</N>
            <E>1</E>
            <PSZ>1024</PSZ>
         </protection>

         <!-- Packing
              * This feature allows for the data content of multiple files to be 'packed' into a single data object.
              * When enabled, groups of files written to a single client 'stream' may be packed together.
              * Files may also be packed afterwards by the Resource Manager ( 'marfs-rman' ) in a process known as
              * 'repacking'.  Data from up to 'max_files' individual files may be packed into a single data object.
              * It is *highly* recommened to enable this feature, as it should greatly increase the efficiency of
              * reading+writing large numbers of small files while having no significant performance penalty for large
              * files.
              * -->
         <packing enabled="yes">
            <max_files>4096</max_files>
         </packing>

         <!-- Chunking
              * WARNING : At present, running with 'chunking' DISABLED is considered EXPERIMENTAL ONLY.  UNDEFINED
              * BEHAVIOR MAY RESULT!
              *
              * This feature allows for the data content of files to be divided over multiple data objects.
              * When enabled, only up to 'max_size' data bytes will be written to a single MarFS object.  Any overflow
              * will be stored to the subsequent data object in the same datastream.  The 'max_size' value DOES include
              * MarFS recovery info.
              * It is *highly* recommended to enable this feature, as it should allow MarFS to avoid creating individual
              * objects which exceed size limitations of the underlying data storage medium.  Additionally, this feature
              * provides an opportuntiy for client programs, such as pftool, to parallelize read/write of very large
              * files.
              * -->
         <chunking enabled="yes">
            <max_size>1G</max_size>
         </chunking>

         <!-- Object Distribution
              * WARNING: NEVER ADJUST THESE VALUES FOR AN EXISTING REPO, as doing so will render all previously written
              * data objects inaccessible!
              *
              * Defines how objects are hashed across available storage targets.
              * Actual implementation of pod / cap / scatter divisions will depend upon DAL definition used.  However,
              * the general defs are:
              *   POD = Refers to the broadest set of data storage infrastructure; generally, a group of N+E storage servers
              *   CAP = Refers to the smallest set of data storage infrastructure; generally, a ZFS pool of disks
              *   SCATTER = Refers to a purely logical division of storage; generally, a single target directory or bucket
              * For each distribution element, the following attributes can be defined:
              *   cnt = Total number of elements of that type ( i.e. pods cnt="10" indicates to use pod indices 0-9 )
              *   dweight = Default 'weight' value of each index ( assumed to be '1', if omitted )
              * Particular distribution indices can be 'weighted' to increase or decrease the number of data objects hashed
              * to those locations.
              * The weight value represents a multiplicative increase in the likelihood of that index being selected as a
              * storage target for a particular data object.  Custom weight values can be defined for each index within the
              * body of the respective element.
              * The format for these custom weight strings is:
              *   '{IndexNum}={CustomWeightValue}[,{IndexNum}={CustomWeightValue}]*'
              * If no custom value is specified, the default weight value is assumed ( see 'dweight', above ).
              * Examples:
              *   <pods cnt="3" dweight="3">0=1,2=6</pods>
              *     = This defines pod values ranging from 0-2 such that
              *            Pod 0 will receive roughly 10% of data objects from this repo
              *               ( IndexWeight / TotalWeight = 1/(1+3+6) = 1/10 -> 10% )
              *            Pod 1 will receive roughly 30% of data objects from this repo
              *            Pod 2 will receive roughly 60% of data objects from this repo
              *   <caps cnt="4" dweight="100">1=200,0=0</caps>
              *     = This defines cap values ranging from 0-3 such that
              *            Cap 0 will never receive any data objects from this repo
              *               ( IndexWeight of zero results in removal as a storage tgt )
              *            Cap 1 will recieve rouhgly 50% of data objects from this repo
              *               ( IndexWeight / TotalWeight = 200/400 -> 50% )
              *            Cap 2 will recieve roughly 25% of data objects from this repo
              *            Cap 3 will recieve roughly 25% of data objects from this repo
              * -->
         <distribution>
            <pods cnt="1"></pods>
            <caps cnt="4"></caps>
            <scatters cnt="4"/>
         </distribution>

         <!-- DAL Definition ( ignored by the MarFS code, parsed by LibNE )
              * This defines the configuration for the Data Abstraction Layer ( DAL ), which provides the interface
              * though which MarFS interacts with data objects.  Please see the erasureUtils repo ( LibNE ) for
              * implementation details.
              * In most contexts, use of the 'posix' DAL is recommended, which will translate MarFS objects into
              * posix-style files, stored at paths defined by 'dir_template' below a root location defined by 'sec_root'.
              * -->
         <DAL type="posix">
            <dir_template>pod{p}/block{b}/cap{c}/scat{s}/</dir_template>
            <sec_root>/marfs/dal-root</sec_root>
         </DAL>

      </data>

      <!-- Per-Repo Metadata Scheme
           * Defines where file metadata is placed, how it is formatted, through what interface it is accessed, and what
           * operations are permitted
           * -->
      <meta>

         <!-- Namespace Definitions
              * These define logical groupings of MarFS metadata, known as namespaces.  They can be thought of as MarFS
              * 'allocations' or 'datasets', each with tunable access permissions and quota values.
              * Namespaces follow a posix tree strcutre, appearing as the uppermost directories of the filesystem,
              * beginning with the 'root' NS.
              * Namespaces can be children of other namespaces, but never children of a standard directory.  Thus, every
              * file and directory within the system is a child of a specific MarFS NS.
              * Every MarFS file is uniquely and permanetly associated with the NS to which it was originally written.
              * Files cannot be linked or renamed between namespaces ( with the exception of Ghost Namespaces ).
              *
              * This node also defines the reference tree structure for all included namespaces.
              *   rbreadth = breadth of the reference tree
              *   rdepth = depth of the reference tree
              *   rdigits = minimum number of numeric digits per reference dir entry
              * Example:
              *   rbreadth="3" rdepth="2" rdigits="4"
              *     = This would result in use of reference directories...
              *        "0000/0000", "0000/0001", "0000/0002"
              *        "0001/0000", "0001/0001", "0001/0002"
              *        "0002/0000", "0002/0001", "0002/0002"
              *
              * WARNING: Be VERY CAREFUL when adjusting the reference structure of an existing repo.  As a rule of thumb,
              *          it should be avoided.  However, it is theoretically safe to adjust, so long as the new reference
              *          paths are a superset of the old ( all old reference paths are still included as valid targets ).
              *          Note that, because reference files are only placed at the leaves of the reference tree, this
              *          means that increasing 'rbreadth' is the only adjustment which can safely be made to an active repo.
              * -->
         <namespaces rbreadth="1" rdepth="2" rdigits="2">

            <!-- Namespace
                 * Definition of a standard MarFS Namespace.  Each such NS can have subspaces defined by contained XML 'ns'
                 * subnodes.  Every NS, with the exception of the single 'root' NS, must have a unique parent NS.
                 * -->
            <ns name="root">

               <!-- Quota Limits for this NS
                    * Quota limits can be defined in terms of number of files and quantity of data.
                    * A missing quota definition is interpreted as no quota limit on that value.
                    * -->
               <quotas>
                  <files>10K</files>  <!-- 10240 file count limit -->
                  <data>10T</data> <!-- 10 Tibibyte data size limit -->
               </quotas>

               <!-- Permission Settings for this NS
                    * Permissions are defined independently for 'interactive' and 'batch' programs.
                    * Program type is determined by the client during MarFS initialization ( see marfs_init() API function ).
                    * Permissions are defined as: '{PermissionCode}[,{PermissionCode}]*'
                    * Permission codes are:
                    *   RM = Read Metadata
                    *      = Allows the client to perform ops which read, but do not modify, metadata content.
                    *      = This confers the ability to perform ops such as 'readdir', 'stat', 'getxattr', 'open', etc.
                    *      = NOTE : It could easily be argued that no useful operation can be performed without some kind of
                    *               underlying metadata read ( such as directory traversal ).
                    *               Additionally, operations such as 'rmdir' can provide information on metadata structure
                    *               through brute force guessing of path names.  For example, failure of 'rmdir' with
                    *               ENOTDIR indicates existence of a non-directory filesystem entry at a guessed location.
                    *               However, for the sake of understandability and "least-surprise", only 'WM' permission
                    *               is needed for that operation.
                    *               If you truly desire that a client remain completely ignorant of NS contents, disable ALL
                    *               permissions and/or completely block access via posix permission sets.
                    *   WM = Write Metadata
                    *      = Allows the client to perform ops which modify, but do not explicitly read, metadata content.
                    *      = This confers the ability to perform ops such as 'mkdir', 'chmod', 'setxattr', 'unlink', etc.
                    *      = NOTE : Take care to note the inclusion of 'unlink' under this permission code.  Only 'WM' is
                    *               necessary to delete files / dirs.
                    *   RD = Read Data
                    *      = Allows the client to read the data content of MarFS files
                    *      = This confers the ability, specifically, to read from a MarFS file handle.
                    *      = NOTE : As 'open' access is controlled by 'RM', this permission code is rendered useless without it.
                    *   WD = Write Data
                    *      = Allows the client to create new MarFS files and write data content to them
                    *      = This confers the ability to perform the 'creat' and 'write' operations.
                    * -->
               <perms>
                  <!-- no data write for interactive programs -->
                  <interactive>RM,WM,RD</interactive>
                  <!-- full batch program access -->
                  <batch>RM,WM,RD,WD</batch>
               </perms>

               <!-- Subspace Definition -->
               <ns name="full-access-subspace">
                  <!-- no quota definition implies no limits -->

                  <!-- full permissions for all clients -->
                  <perms>
                     <interactive>RM,WM,RD,WD</interactive>
                     <batch>RM,WM,RD,WD</batch>
                  </perms>
               </ns>

            </ns>


         </namespaces>

         <!-- Direct Data
              * Enables the reading of files stored directly to the MDAL storage location, with no associated MarFS data
              * objects.
              * This is useful if an admin wants to drop in a small, temporary file without actually writing out a full
              * data object.
              * NOTE : At present, direct write of files is unimplemented and may remain that way.
              * -->
         <direct read="yes"/>

         <!-- MDAL Definition
              * Defines the interface for interacting with repo metadata.
              * In most contexts, the use of the 'posix' MDAL is recommended, which will store MarFS metadata in the form
              * of posix files + xattrs.
              * -->
         <MDAL type="posix">
            <ns_root>/marfs/mdal-root</ns_root>
         </MDAL>

      </meta>

   </repo>

</marfs_config>

We now need to export the config path as an environment variable so it can be found by the MarFS binaries:

export MARFS_CONFIG_PATH=/opt/campaign/install/etc/marfs-config.xml

Dependencies 

Depending on your environment you may need different things. To install and make use of MarFS you will need the following tools.

Fortunately many dependencies can be acquired through a package manager. * FUSE (and development packages) * G++ * Git * Libattr (and development packages) * Libtool * Libxml2 (and development packages) * Make * Nasm * Open MPI (and development packages) * Open SSL (and development packages)

Others can be obtained from source.

git clone https://github.com/01org/isa-l.git
git clone https://github.com/mar-file-system/erasureUtils.git
git clone https://github.com/mar-file-system/marfs.git
git clone https://github.com/pftool/pftool.git

A quick description of tools acquired from source:

ISA-L: Intel’s Intelligent Storage Acceleration Library
ErasureUtils: The erasure coding layer used for Multi-Component storage
MarFS: The core MarFS libraries
PfTool: A tool for parallel data movement

You will need yasm 1.2.0 or later for ISA-L.

You may also need to ensure that MPI is in your $PATH environment averiable and OpenMPI’s library directory is in your $LD_LIBRARY_PATH environment variable.

ISA-L 

cd isa-l
./autogen.sh
./configure --prefix=/opt/campaign/install
make
make install

ErasureUtils 

cd ../erasureUtils
autoreconf -i
./configure --prefix=/opt/campaign/install/ CFLAGS="-I/opt/campaign/install/include" LDFLAGS="-L/opt/campaign/install/lib"
make
make install

MarFS 

cd ../marfs
autoreconf -i
./configure --prefix=/opt/campaign/install/ CFLAGS="-I/opt/campaign/install/include" LDFLAGS="-L/opt/campaign/install/lib"
make
make install

PFTool 

cd ../pftool
./autogen
./configure --prefix=/opt/campaign/install/ CFLAGS="-I/opt/campaign/install/include" CXXFLAGS="-I/opt/campaign/install/include" LDFLAGS="-L/opt/campaign/install/lib" --enable-marfs
make
make install

Starting MarFS 

We now have all our utilites built and are almost ready to boot the filesystem. On a simple one node system not leveraging GPFS or ZFS like this one, the process is as simple as launching the FUSE mount. On more complex systems, you will have to initialize the meta and data stores and any network mounts. These tasks can be automated using tools such as Ansible, but that is beyond the scope of this document.

Verifying the Configuration 

Before we start the filesystem, we need to first verify that our configuration file is valid, and the meta and data spaces match the structure specified in the config file. We do this by running

marfs-verifyconf -a

which will print “Config Verified” to the console if the system is ready to boot.

Launching FUSE 

To launch FUSE (and our filesystem), we execute:

marfs-fuse -o allow_other,use_ino,intr /campaign

If everything goes according to plan, marfs-fuse will start in the background. You should now be able to access our MarFS filesystem through the mount at /campaign. If the marfs-fuse call returned a nonzero value, something went wrong. In case this happens, reconfigure MarFS with the --enable-debugALL flag.

See if it works!

$ ls /campaign
full-access-subspace

$ echo test > /campaign/full-access-subspace/test
$ cat /campaign/full-access-subspace/test
test

Build and run PFTool 

In the previous section we mounted MarFS through FUSE. It is possible to use this FUSE mount for all accesses to MarFS, but performance is limited by going through a single host (and through FUSE), even though in multi-node systems, writes to the underlying storage-servers are performed in parallel. Therefore, for a large datacenter, it may make more sense to use FUSE solely for metadata access (rename, delete, stat, ls etc). This could be achieved by changing the “perms” for every namespace in the config file.

We recommend using pftool for the “heavy lifting” of transferring big datasets. PFTool provides fast, parallel MarFS data movement, and is preferred over using the FUSE mount.

The PFTool binary can be invoked directly, but the preferred method is through provided helper scripts (pfls, pfcm, pfcp).