Cromwell provides a generic way to configure a backend relying on most High Performance Computing (HPC) frameworks, and with access to a shared filesystem.
The two main features that are needed for this backend to be used are a way to submit a job to the compute cluster and to get its status through the command line. You can find example configurations for a variety of those backends here:
FileSystems
Shared FileSystem
HPC backends rely on being able to access and use a shared filesystem to store workflow results.
Cromwell is configured with a root execution directory which is set in the configuration file under backend.providers.<backend_name>.config.root
. This is called the cromwell_root
and it is set to ./cromwell-executions
by default. Relative paths are interpreted as relative to the current working directory of the Cromwell process.
When Cromwell runs a workflow, it first creates a directory <cromwell_root>/<workflow_uuid>
. This is called the workflow_root
and it is the root directory for all activity in this workflow.
Each call
has its own subdirectory located at <workflow_root>/call-<call_name>
. This is the <call_dir>
.
Any input files to a call need to be localized into the <call_dir>/inputs
directory. There are different localization strategies that Cromwell will try until one works:
hard-link
- This will create a hard link to the filesoft-link
- Create a symbolic link to the file. This strategy is not applicable for tasks which specify a Docker image and will be ignored.copy
- Make a copy the filecached-copy
An experimental feature. This copies files to a file cache in<workflow_root>/cached-inputs
and then hard links them in the<call_dir>/inputs
directory.
cached-copy
is intended for a shared filesystem that runs on multiple physical disks, where docker containers are used.
Hard-links don't work between different physical disks and soft-links don't work with docker. Copying uses a lot of
space if a multitude of tasks use the same input. cached-copy
copies the file only once to the physical disk containing
the <workflow_root>
and then uses hard links for every task that needs the input file. This can save a lot of space.
The default order in reference.conf
is hard-link
, soft-link
, copy
Shared filesystem localization is defined in the config
section of each backend. The default stanza for the Local and HPC backends looks like this:
filesystems {
local {
localization: [
"hard-link", "soft-link", "copy"
]
}
}
Additional FileSystems
HPC backends (as well as the Local backend) can be configured to be able to interact with other type of filesystems, where the input files can be located for example.
Currently the only other filesystem supported is Google Cloud Storage (GCS). See the Google section of the documentation for information on how to configure GCS in Cromwell.
Once you have a google authentication configured, you can simply add a gcs
stanza in your configuration file to enable GCS:
backend.providers.MyHPCBackend {
filesystems {
gcs {
# A reference to a potentially different auth for manipulating files via engine functions.
auth = "application-default"
}
}
}
Exit code timeout
If the cluster forcefully kills a job, it is unable to write its exit code anymore.
To address this the option exit-code-timeout-seconds
can be used.
Cromwell will check the aliveness of the job with the check-alive
script, every exit-code-timeout-seconds
(polling).
When a job is no longer alive and another exit-code-timeout-seconds
seconds have passed without an RC file being made, Cromwell can mark the job as failed.
If retries are enabled the job is submitted again.
This option will enable polling with the check-alive
option, this could cause high load on whatever system check-alive
calls.
When the option exit-code-timeout-seconds
is not set cromwell will only execute the check-alive
option after a restart of a cromwell server.
backend {
providers {
<backend name> {
config {
exit-code-timeout-seconds = 120
# other config options
}
}
}
}