Installing ********** This section describes the install process for MarFS. .. contents:: :depth: 2 :local: Overview ======== MarFS is a distributed parallel filesystem, thusly setup is complicated and relies on a number of different technologies and systems. In this guide we will go over the setup of an example system on one node. This guide assumes knowledge of ZFS, MPI, and general Linux operations. Storage ------- MarFS stores data and metadata differently. Data is stored as erasure-coded objects. When data is written it is broken up into N number of pieces. Erasure data is calculated on our N objects to create E number of erasure objects. N+E number of objects are then mapped onto N+E identical filesystems. We refer to these N+E filesystems as a "pod". In our example cluster we will have four filesystems in a single pod giving us a 3+1 erasure coding scheme. You can have multiple pods in a cluster. When you have multiple pods the pod is selected with a hash, and data is written to that pod. Data will never be written across multiple pods. So if you have 4 pods each matching our single pod with a 3+1 scheme those four objects will always be in the same pod. Metadata can be stored on any filesystem that supports extended attributes and sparse-files. For scalability purposes a distributed filesystem is highly recommended. In our example system we will just use a directory, but our production clusters typically run General Parallel Filesystem (GPFS). Data Access ----------- With object data being stored across a number of pods it is reasonable to provide a way to interact with the filesystem in a unified matter. Most users would expect a single mount point they can look through for various tasks. This is provided through FUSE, allowing users to look at their data. In production systems, this FUSE mount is read only, and is there for users to locate their files for parallel movement, although MarFS allows FUSE to be configured to allow read/write. Data Movement ------------- Data is moved in parallel using PFTool. In our production systems, nodes running PFTool are called "File Transfer Agent" nodes, or FTAs. Production Cluster Summary ========================== In a typical production system, we have three types of nodes, each handling a different role. For the purpose of our example system, one node will handle all three roles. Storage Nodes ------------- Each storage node uses ZFS for MarFS block storage. Each node will have four zpools in a RAIDZ3(17+3) configuration. We have multiple pods configured to use 10+2 erasure coding. Must have a high performance network such as Infiniband. Metadata Nodes -------------- We use GPFS as metadata storage in production systems. Your GPFS cluster should already be setup and ready to create filesets. You should have a high performance network such as Infiniband when using GPFS. File Transfer Nodes ------------------- These nodes will be used to move data in parallel from one place to another. We will use `PFtool `_ for this. You must have high performance network such as Infiniband. They will also be used to present MarFS to users through a FUSE mount. MarFS abstractions ================== Remember how earlier we talked about the pod? There are more things to understand about the pod. There are logical data abstractions we will see later when understanding the configuration file. We will talk about them briefly here first. The Repository -------------- A repo is where all the object data for a MarFS Filesystem lives; it’s a logical description of a MarFS object-store, with details on the number of storage servers, etc. Data Abstraction Layer ---------------------- Multi component stuff here maybe? Metadata Abstraction Layer -------------------------- The Namespace ------------- A namespace in MarFS is a logical partition of the MarFS filesystem with a unique (virtual) mount point and attributes like permissions, similar to ZFS datasets. It also includes configuration details regarding MarFS metadata storage for that namespace. Each namespace in MarFS must be associated with a repo, and you can have multiple namespaces per repo. Both repos and namespaces are arbitrarily named. Pods ---- A collection of storage nodes. Blocks ------ A storage node in a pod. Capacity Units -------------- Each capacity unit (cap) is a datastore on a ZFS zpool on a block in a pod. Create Base Directory Structure =============================== In the case we are building MarFS on a multi-node system, it is helpful to have a shared filesystem among all the nodes in the cluster, such as a NFS share mounted on all nodes. We will keep all our source code and other files that must be shared here. In our example we will use :code:`/opt/campaign`. Before we start installing, let's create this directory and a subdirectory to function as an install target, and add it to our PATH environment variable. .. code-block:: bash mkdir -p /opt/campaign/install/bin cd /opt/campaign export PATH=$PATH:/opt/campaign/install/bin We will also need a root directory for our metadata store .. code-block:: bash mkdir -p /marfs/mdal-root and one for our data store .. code-block:: bash mkdir /marfs/dal-root The last directory we need to create will be for our filesystem mount point. .. code-block:: bash mkdir /campaign MarFS Config File ----------------- MarFS uses a config file to set up repositories and namespaces. We will use this example config file which we will create at :code:`/opt/campaign/install/etc/marfs-config.xml`: .. literalinclude:: _static/new_config.xml :language: xml We now need to export the config path as an environment variable so it can be found by the MarFS binaries: .. code-block:: bash export MARFS_CONFIG_PATH=/opt/campaign/install/etc/marfs-config.xml Dependencies ============ Depending on your environment you may need different things. To install and make use of MarFS you will need the following tools. Fortunately many dependencies can be acquired through a package manager. * FUSE (and development packages) * G++ * Git * Libattr (and development packages) * Libtool * Libxml2 (and development packages) * Make * Nasm * Open MPI (and development packages) * Open SSL (and development packages) .. yum install libattr-devel curl-devel \ curl openssl-devel openssl Others can be obtained from source. .. code-block:: bash git clone https://github.com/01org/isa-l.git git clone https://github.com/mar-file-system/erasureUtils.git git clone https://github.com/mar-file-system/marfs.git git clone https://github.com/pftool/pftool.git A quick description of tools acquired from source:: ISA-L: Intel’s Intelligent Storage Acceleration Library ErasureUtils: The erasure coding layer used for Multi-Component storage MarFS: The core MarFS libraries PfTool: A tool for parallel data movement You will need yasm 1.2.0 or later for ISA-L. You may also need to ensure that MPI is in your :code:`$PATH` environment averiable and OpenMPI's library directory is in your :code:`$LD_LIBRARY_PATH` environment variable. ISA-L ----- .. code-block:: bash cd isa-l ./autogen.sh ./configure --prefix=/opt/campaign/install make make install ErasureUtils ------------ .. code-block:: bash cd ../erasureUtils autoreconf -i ./configure --prefix=/opt/campaign/install/ CFLAGS="-I/opt/campaign/install/include" LDFLAGS="-L/opt/campaign/install/lib" make make install MarFS ----- .. code-block:: bash cd ../marfs autoreconf -i ./configure --prefix=/opt/campaign/install/ CFLAGS="-I/opt/campaign/install/include" LDFLAGS="-L/opt/campaign/install/lib" make make install PFTool ------ .. code-block:: bash cd ../pftool ./autogen ./configure --prefix=/opt/campaign/install/ CFLAGS="-I/opt/campaign/install/include" CXXFLAGS="-I/opt/campaign/install/include" LDFLAGS="-L/opt/campaign/install/lib" --enable-marfs make make install Starting MarFS ============== We now have all our utilites built and are almost ready to boot the filesystem. On a simple one node system not leveraging GPFS or ZFS like this one, the process is as simple as launching the FUSE mount. On more complex systems, you will have to initialize the meta and data stores and any network mounts. These tasks can be automated using tools such as Ansible, but that is beyond the scope of this document. Verifying the Configuration --------------------------- Before we start the filesystem, we need to first verify that our configuration file is valid, and the meta and data spaces match the structure specified in the config file. We do this by running .. code-block:: bash marfs-verifyconf -a which will print ":code:`Config Verified`" to the console if the system is ready to boot. Launching FUSE -------------- To launch FUSE (and our filesystem), we execute: .. code-block:: bash marfs-fuse -o allow_other,use_ino,intr /campaign If everything goes according to plan, marfs-fuse will start in the background. You should now be able to access our MarFS filesystem through the mount at :code:`/campaign`. If the :code:`marfs-fuse` call returned a nonzero value, something went wrong. In case this happens, reconfigure MarFS with the :code:`--enable-debugALL` flag. See if it works! ---------------- .. code-block:: bash $ ls /campaign full-access-subspace $ echo test > /campaign/full-access-subspace/test $ cat /campaign/full-access-subspace/test test Build and run PFTool -------------------- In the previous section we mounted MarFS through FUSE. It is possible to use this FUSE mount for all accesses to MarFS, but performance is limited by going through a single host (and through FUSE), even though in multi-node systems, writes to the underlying storage-servers are performed in parallel. Therefore, for a large datacenter, it may make more sense to use FUSE solely for metadata access (rename, delete, stat, ls etc). This could be achieved by changing the "perms" for every namespace in the config file. We recommend using pftool for the "heavy lifting" of transferring big datasets. PFTool provides fast, parallel MarFS data movement, and is preferred over using the FUSE mount. The PFTool binary can be invoked directly, but the preferred method is through provided helper scripts (:code:`pfls`, :code:`pfcm`, :code:`pfcp`).