Containers

Containers are encapsulated environments that include an operating system, libraries, and software. For example, if you have a host machine running Centos, you can run an isolated container with Ubuntu 18.04. At a high level, it's useful to think of a container as a program or binary.

To promote reproducibility and portability, it's considered best practice to define containers for a WDL task to run in - this ensures that running the same task on a different system will run the exact same software.

Docker images are the most common container format, but it is not advisable for certain systems to run Docker itself, and for this reason Cromwell can be configured to support a number of alternatives.

Prerequisites
Goals
Specifying Containers in your Workflow
Docker
Singularity
udocker
Configuration in Detail
Best Practices
- Image Versions
Notes
- How does Cromwell know when a job or container has completed?
- Cromwell: Run-in-background
Next Steps

Prerequisites

This tutorial page relies on completing the previous tutorials:

Five Minute Introduction
Configuration Files
Recommended: Getting started on HPC clusters

Goals

At the end of this tutorial, you'll become familiar with container technologies and how to configure Cromwell to use these independently, or with job schedulers.

Specifying Containers in your Workflow

Containers are specified on a per-task level, this can be achieved in WDL by specifying a docker tag in the runtime section. For example, the following script should run in the ubuntu:latest container:

task hello_world {
    String name = "World"
    command {
        echo 'Hello, ${name}'
    }
    output {
        File out = stdout()
    }
    runtime {
        docker: 'ubuntu:latest'
    }
}

workflow hello {
    call hello_world
}

Docker

Docker is a popular container technology that is natively supported by Cromwell and WDL.

Docker on a Local Backend

On a single machine (laptop or server), no extra configuration is needed to allow docker to run, provided Docker is installed.

You can install Docker for Linux, Mac or Windows from Docker Hub

Docker on Cloud

It is strongly advised that you provide a Docker image to tasks that will run on Cloud backends, and in fact most Cloud providers require it.

It might be possible to use an alternative container engine, but this is not recommended if Docker is supported.

Docker on HPC

Docker can allow running users to gain superuser privileges, called the Docker daemon attack surface. In HPC and multi-user environments, Docker recommends that "only trusted users should be allowed to control your Docker Daemon".

For this reason, this tutorial will also explore other technologies that support the reproducibility and simplicity of running a workflow that use docker containers; Singularity and udocker.

Singularity

Singularity is a container technology designed for use on HPC systems in particular, while ensuring an appropriate level of security that Docker cannot provide.

Installation

Before you can configure Cromwell on your HPC system, you will have to install Singularity, which is documented here. In order to gain access to the full set of features in Singularity, it is strongly recommended that Singularity is installed by root, with the setuid bit enabled, as is (documented here). This likely means that you will have to ask your sysadmin to install it for you. Because singularity ideally needs setuid, your admins may have some qualms about giving Singularity this privilege. If that is the case, you might consider forwarding this letter to your admins.

If you are not able to get Singularity installed with these privileges, you can attempt a user install. If this is the case, you will have to alter your Cromwell configuration to work in "sandbox" mode, which is explained in this part of the documentation.

Configuring Cromwell for Singularity

Once Singularity is installed, you'll need to modify the config block inside backend.providers in your Cromwell configuration. In particular, this block contains a key called submit-docker, which will contain a script that is run whenever a job needs to run that uses a Docker image. If the job does not specify a Docker image, the regular submit block will be used.

As the configuration will require more knowledge about your execution environment, see the local and job scheduler sections below for example configurations.

Local environments

On local backends, you have to configure Cromwell to use a different submit-docker script that would start Singularity instead of docker. Singularity requires docker images to be prefixed with the prefix docker://.

Using containers isolates the filesystem that the script is allowed to interact with, for that reason we'll bind in the current working directory as ${docker_cwd}, and we'll use the container-specific script path ${docker_script}.

An example submit script for Singularity is:

singularity exec --containall --bind ${cwd}:${docker_cwd} docker://${docker} ${job_shell} ${docker_script}

As the Singularity exec command does not emit a job-id, we must include the run-in-background tag within the the provider section in addition to the docker-submit script. As Cromwell watches for the existence of the rc file, the run-in-background option has the caveat that we require the Singularity container to successfully complete, otherwise the workflow might hang indefinitely.

To ensure reproducibility and an isolated environment inside the container, --containall is an important function. By default, Singularity will mount the user's home directory and import the user's environment as well as some other things that make Singularity easier to use in an interactive shell. Unfortunately settings in the home directory and the user's environment may affect the outcome of the tools that are used. This means different users may get different results. Therefore, to ensure reproducibility while using Singularity, the --containall flag should be used. This will make sure the environment is cleaned and the HOME directory is not mounted.

Putting this together, we have an example base configuration for a local environment:

include required(classpath("application"))

backend {
    default: singularity
    providers: {
        singularity {
            # The backend custom configuration.
            actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"

            config {
                run-in-background = true
                runtime-attributes = """
                  String? docker
                """
                submit-docker = """
                  singularity exec --containall --bind ${cwd}:${docker_cwd} docker://${docker} ${job_shell} ${docker_script}
                """
            }
        }
    }
}

Job schedulers

To run Singularity on a job scheduler, the singularity command needs to be passed to the scheduler as a wrapped command.

For example, in SLURM, we can use the normal SLURM configuration as explained in the SLURM documentation, however we'll add a submit-docker block to execute when a task is tagged with a docker container.

When constructing this block, there are a few things to keep in mind: - Make sure Singularity is loaded (and in PATH). If module is installed for example you can call module load Singularity. If the cluster admin has made a Singularity module available. Alternatively you can alter the PATH variable directly or simply use /path/to/singularity directly in the config. - We should treat worker nodes as if they do not have stable access to the internet or build access, so we will pull the container before the task is submit to the cluster. - It's a good idea to use a Singularity cache so that same images should only have to be pulled once. Make sure you set the SINGULARITY_CACHEDIR environment variable to a location on the filesystem that is reachable by the worker nodes! - If we are using a cache we need to ensure that submit processes started by Cromwell do not pull to the same cache at the same time. This may corrupt the cache. We can prevent this by implementing a filelock with flock and pulling the image before the job is submitted. The flock and pull command needs to be placed before the submit command so all pull commands are executed on the same node. This is necessary for the filelock to work. - As mentioned above the --containall flag is important for reproducibility.

submit-docker = """
    # Make sure the SINGULARITY_CACHEDIR variable is set. If not use a default
    # based on the users home.
    if [ -z $SINGULARITY_CACHEDIR ]; 
        then CACHE_DIR=$HOME/.singularity/cache
        else CACHE_DIR=$SINGULARITY_CACHEDIR
    fi
    # Make sure cache dir exists so lock file can be created by flock
    mkdir -p $CACHE_DIR  
    LOCK_FILE=$CACHE_DIR/singularity_pull_flock
    # Create an exclusive filelock with flock. --verbose is useful for 
    # for debugging, as is the echo command. These show up in `stdout.submit`.
    flock --verbose --exclusive --timeout 900 $LOCK_FILE \
    singularity exec --containall docker://${docker} \
    echo "successfully pulled ${docker}!"

    # Submit the script to SLURM
    sbatch \
      [...]
      --wrap "singularity exec --containall --bind ${cwd}:${docker_cwd} $IMAGE ${job_shell} ${docker_script}"
  """

Putting this all together, a complete SLURM + Singularity config might look like this:

backend {
  default = slurm

  providers {
    slurm {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"                                                                                     
      config {
        runtime-attributes = """
        Int runtime_minutes = 600
        Int cpus = 2
        Int requested_memory_mb_per_core = 8000
        String? docker
        """

        submit = """
            sbatch \
              --wait \
              -J ${job_name} \
              -D ${cwd} \
              -o ${out} \
              -e ${err} \
              -t ${runtime_minutes} \
              ${"-c " + cpus} \
              --mem-per-cpu=${requested_memory_mb_per_core} \
              --wrap "/bin/bash ${script}"
        """

        submit-docker = """
            # Make sure the SINGULARITY_CACHEDIR variable is set. If not use a default
            # based on the users home.
            if [ -z $SINGULARITY_CACHEDIR ]; 
                then CACHE_DIR=$HOME/.singularity/cache
                else CACHE_DIR=$SINGULARITY_CACHEDIR
            fi
            # Make sure cache dir exists so lock file can be created by flock
            mkdir -p $CACHE_DIR  
            LOCK_FILE=$CACHE_DIR/singularity_pull_flock
            # Create an exclusive filelock with flock. --verbose is useful for 
            # for debugging, as is the echo command. These show up in `stdout.submit`.
            flock --verbose --exclusive --timeout 900 $LOCK_FILE \
            singularity exec --containall docker://${docker} \
            echo "successfully pulled ${docker}!"

            # Submit the script to SLURM
            sbatch \
              --wait \
              -J ${job_name} \
              -D ${cwd} \
              -o ${cwd}/execution/stdout \
              -e ${cwd}/execution/stderr \
              -t ${runtime_minutes} \
              ${"-c " + cpus} \
              --mem-per-cpu=${requested_memory_mb_per_core} \
              --wrap "singularity exec --containall --bind ${cwd}:${docker_cwd} $IMAGE ${job_shell} ${docker_script}"
        """

        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }
  }
}

Without Setuid

In addition, if you or your sysadmins were not able to give setuid permissions to singularity, you'll have to modify the config further to ensure the use of sandbox images:

submit-docker = """
    [...]

    # Build the Docker image into a singularity image
    # We don't add the .sif file extension because sandbox images are directories, not files
    DOCKER_NAME=$(sed -e 's/[^A-Za-z0-9._-]/_/g' <<< ${docker})
    IMAGE=${cwd}/$DOCKER_NAME
    singularity build --sandbox $IMAGE docker://${docker}

    # Now submit the job
    # Note the use of --userns here
    sbatch \
      [...]
      --wrap "singularity exec --userns --bind ${cwd}:${docker_cwd} $IMAGE ${job_shell} ${docker_script}"
"""

Singularity Cache

By default, Singularity will cache the Docker images you pull in ~/.singularity, your home directory.

However, if you are sharing your Docker images with other users or have limited space in your user directory, you can redirect this caching location by exporting the SINGULARITY_CACHEDIR variable in your .bashrc or at the start of the submit-docker block.

export SINGULARITY_CACHEDIR=/path/to/shared/cache

For further information on the Singularity Cache, refer to the Singularity 2 caching documentation (this hasn't yet been updated for Singularity 3).

udocker

udocker is a tool designed to "execute simple docker containers in user space without requiring root privileges".

In essence, udocker provides a command line interface that mimics docker, and implements the commands using one of four different container backends:

PRoot
Fakechroot
runC
Singularity

Installation

udocker can be installed without any kind of root permissions. Refer to udocker's installation documentation here for more information.

Configuration

(As of 2019-02-18) udocker does not support looking up docker container by digests, hence you'll have to make ensure hash-lookup is disabled. Refer to this section for more detail.

To configure udocker to work in a local environment, you must tag the provider's configuration to run-in-background and update the submit-docker to use udocker:

run-in-background = true
submit-docker = """
    udocker run -v ${cwd}:${docker_cwd} ${docker} ${job_shell} ${docker_script}
"""

With a job queue like SLURM, you just need to wrap this script in an sbatch submission like we did with Singularity:

submit-docker = """
    # Pull the image using the head node, in case our workers don't have network access
    udocker pull ${docker}

    sbatch \
      -J ${job_name} \
      -D ${cwd} \
      -o ${cwd}/execution/stdout \
      -e ${cwd}/execution/stderr \
      -t ${runtime_minutes} \
      ${"-c " + cpus} \
      --mem-per-cpu=${requested_memory_mb_per_core} \
      --wrap "udocker run -v ${cwd}:${docker_cwd} ${docker} ${job_shell} ${docker_script}"
"""

Caching

udocker caches images in a single directory, which defaults to ~/.udocker, meaning that caching is done on a per-user basis. However, like Singularity, if you want to share a cache with other users in your project,you you can override the location of the udocker cache directory either using: * A config file described here, containing a line such as topdir = "/path/to/cache". * Using the environment variable $UDOCKER_DIR

Configuration in Detail

The behaviour of Cromwell with containers can be modified using a few other options.

Enforcing container requirements

You can enforce the use of a container by not including the submit block in the provider section.

However note that some interpolated variables (${stdout}, ${stderr}) are different between these two blocks.

Docker Digests

Each Docker repository has a number of tags that can be used to refer to the latest image of a particular type. For instance, when you run a normal Docker image with docker run image, it will actually run image:latest, the latest tag of that image.

However, by default Cromwell requests and runs images using their sha hash, rather than using tags. This strategy is actually preferable, because it ensures every execution of the task or workflow will use the exact same version of the image, but some engines such as udocker don't support this feature.

If you are using udocker or want to disable the use of hash-based image references, you can set the following config option:

docker.hash-lookup.enabled = false

Nb: By disabling hash-lookup, call caching will not work for any container using a floating tag.

Docker Root

If you want to change the root directory inside your containers, where the task places input and output files, you can edit the following option:

backend {
  providers {
    LocalExample {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {

        # Root directory where Cromwell writes job results in the container. This value
        # can be used to specify where the execution folder is mounted in the container.
        # it is used for the construction of the docker_cwd string in the submit-docker
        # value above.
        dockerRoot = "/cromwell-executions"
      }
    }
  }
}

Docker Config Block

Further docker configuration options available to be put into your config file are as follows. For the latest list of parameters, refer to the example configuration file, and specific backend provider examples.

docker {
  hash-lookup {
    # Set this to match your available quota against the Google Container Engine API
    #gcr-api-queries-per-100-seconds = 1000

    # Time in minutes before an entry expires from the docker hashes cache and needs to be fetched again
    #cache-entry-ttl = "20 minutes"

    # Maximum number of elements to be kept in the cache. If the limit is reached, old elements will be removed from the cache
    #cache-size = 200

    # How should docker hashes be looked up. Possible values are "local" and "remote"
    # "local": Lookup hashes on the local docker daemon using the cli
    # "remote": Lookup hashes on docker hub, gcr, gar, quay
    #method = "remote"
  }
}

Best Practices

Image Versions

When choosing the image version for your pipeline stages, it is highly recommended that you use a hash rather than a tag, for the sake of reproducibility For example, in WDL, you could do this:

runtime {
    docker: 'ubuntu:latest'
}

But what you should do is this:

runtime {
    docker: 'ubuntu@sha256:7a47ccc3bbe8a451b500d2b53104868b46d60ee8f5b35a24b41a86077c650210'
}

You can find the sha256 of an image using docker images --digests

Notes

How does Cromwell know when a job or container has completed?

Cromwell uses the presence of the rc (returncode) file to determine whether a task has succeeded or failed. This rc file is generated as part of the script within the execution directory, where the script is assembled at runtime. This is important as if the script executes successfully but the container doesn't terminate, Cromwell will continue the execution of the workflow and the container will persist hogging system resources.

Within the configurations above: - singularity: The exec mode does not run a container on the background

Cromwell: Run-in-background

By enabling Cromwell's run-in-background mode, you remove the necessity for the kill, check-alive and job-id-regex blocks, which disables some safety checks when running workflows:

If there is an error starting the container or executing the script, Cromwell may not recognise this error and hang. For example, this may occur if the container attempts to exceed its allocated resources (runs out of memory); the container daemon may terminate the container without completing the script.
If you abort the workflow (by attempting to close Cromwell or issuing an abort command), Cromwell does not have a reference to the container execution and will not be able to terminate the container.

This is only necessary in local environments where there is no job manager to control this, however if your container technology can emit an identifier to stdout, then you are able to remove the run-in-background flag.

Next Steps

Congratulations for improving the reproducibility of your workflows! You might find the following cloud-based tutorials interesting to test your workflows (and ensure the same results) in a completely different environment: