Getting started on HPC clusters
Prerequisites
This tutorial page relies on completing the previous tutorials:
Goals
At the end of this tutorial you'll have set up Cromwell to run against your HPC cluster. We'll use SGE as an example but this applies equally to LSF and others.
Let's get started!
Telling Cromwell the type of backend
Start by defining your new backend configuration under the section backend
. For now, we'll give your backend the name SGE
, but you can use any name you would like.
backend {
providers {
SGE {
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
# to be filled in
}
}
}
}
The actor-factory
above tells cromwell that you will be using the config
section to tell cromwell how to submit jobs, abort jobs, etc.
You'll likely also want to change the default backend to your new backend, by setting this configuration value:
backend.default = SGE
Specifying the runtime attributes for your HPC tasks
In the config section for your backend, you can define the different runtime attributes that your HPC tasks will support. Any runtime attribute configured here will be read from the WDL tasks, and then passed into the command line used to submit jobs to the HPC cluster.
All runtime attributes must be defined in a single multi-line block. The syntax of this block is the same as defining the inputs for a WDL task.
backend.providers.SGE.config {
runtime-attributes = """
Int cpu = 1
Float? memory_gb
String? sge_queue
String? sge_project
"""
}
In the example above, we have defined four different WDL variables defined, cpu
, memory_gb
, sge_queue
, and sge_project
. Below you will find more information on cpu
and memory
, and the ability to add custom runtime attributes like the sge_queue
and sge_project
.
cpu
When you declare a runtime attribute with the name cpu
, it must be an Int
. This integer will validated to always be >= 1
.
backend.providers.SGE.config {
runtime-attributes = """
Int cpu = 1
# ...
"""
}
memory
When running a workflow, the memory runtime attribute in the task will specify the units of memory. For example, this jobs specifies that it only needs 512 megabytes of memory when running.
task hello {
command { echo hello }
runtime { memory: "512 MB" }
}
However, it's possible that when submitting jobs to your HPC cluster you want to specify the units in gigabytes.
To specify the memory units that the submit command should use, append the units to the memory runtime attribute. For example:
backend.providers.SGE.config {
runtime-attributes = """
Float? memory_gb
# ...
"""
}
Now, no matter what unit of memory is used within the task, the value will be converted into gigabytes before it is passed to your submit command.
custom attributes
You can also declare other runtime attributes that a WDL task may use. For example, suppose you would like to allow the WDL to specify an sge queue in a task, like:
task hello {
command { echo hello }
runtime { sqe_queue: "short" }
}
You declare your runtime attribute in your config by adding any other custom value to the runtime-attributes
section:
backend.providers.SGE.config {
runtime-attributes = """
String? sge_queue
# ...
"""
}
In this case, we've stated that the sge_queue
is optional. This allows us to reuse WDLs from other pipeline authors who may not have set an sge_queue
.
Alternatively, you can also set a default for the declared runtime attributes.
backend.providers.SGE.config {
runtime-attributes = """
String sge_queue = "short"
# ...
"""
}
Call Caching based on runtime attributes
The rules for call caching in HPC backends are:
* docker
: Will be considered when call caching.
* Memory options: Will not be considered when call caching.
* CPU options: Will not be considered when call caching.
* Custom Attributes: Will not be considered when call caching (by default).
Although custom attributes will not be considered when call caching by default, you can override this in a runtime-attributes-for-caching
section. Eg:
backend.providers.SGE.config {
runtime-attributes = """
String sge_queue = "short"
String singularity_image
# ...
"""
runtime-attributes-for-caching {
sge_queue: false
singularity_image: true
}
}
- Note: Only custom attributes can be altered like this. Memory, CPU and docker will always have their default cache-consideration behavior.
- Note: Unlike memory, cpu and docker attributes which inherit validation and hash-lookup behavior, any custom attributes will be simple primitive comparisons.
- For example, a
docker
attribute will be cached by looking up docker hashes against a docker repository, but a customsingularity
attribute would be a primitive string match.
- For example, a
How Cromwell should start an HPC job
When Cromwell runs a task, it will fill in a template for the job using the declared runtime attributes. This specific template will vary depending on the requirements of your HPC cluster. For example, say you normally submit jobs to SGE using:
qsub -terse -V -b y -N my_job_name \
-wd /path/to/working_directory \
-o /path/to/stdout.qsub \
-e /path/to/stderr.qsub \
-pe smp 1 -l mem_free=0.5g -q short \
/usr/bin/env bash myScript.bash
For this particular SGE cluster, the above sets the working directory, stdout and stderr paths, the number of cpus to 1, the memory to half a gigabyte, and runs on the short queue.
Converting this into a template using our runtime attributes requires defining submit
as one would a WDL task command
:
backend.providers.SGE.config {
submit = """
qsub \
-terse \
-V \
-b y \
-N ${job_name} \
-wd ${cwd} \
-o ${out}.qsub \
-e ${err}.qsub \
-pe smp ${cpu} \
${"-l mem_free=" + memory_gb + "g"} \
${"-q " + sge_queue} \
${"-P " + sge_project} \
/usr/bin/env bash ${script}
"""
}
When the job finishes submitting, Cromwell will need to retrieve the job id, so that it can abort the job if necessary. This job should be written to the stdout after submission, where Cromwell will then read the job id. Because the job id may be surrounded by other text, a custom regular expression should capture the actual job id. Because the submit above uses -terse
, the job id will be the entire contents of the stdout, but should be all digits:
backend.providers.SGE.config {
job-id-regex = "(\\d+)"
}
How Cromwell should abort an HPC job
When aborting an HPC job, Cromwell will run a command confifured under the key kill
, passing in the WDL variable job_id
:
backend.providers.SGE.config {
kill = "qdel ${job_id}"
}
How Cromwell checks if an HPC job is alive
Whenever Cromwell restarts it checks to see if a job has completed by searching for return code in a file called rc
. If this file isn't available, in this case Cromwell runs an extra check to make sure the job is still alive. You can configure the command used for this check via:
backend.providers.SGE.config {
check-alive = "qstat -j ${job_id}"
}
Other backend settings
On some systems, the administrators may limit the number of HPC jobs a user may run at a time. To configure this limit, you can use the value concurrent-job-limit
to limit the number of jobs.
backend.providers.SGE.config {
concurrent-job-limit = 100
}
Putting the config section all together
With the above sections, we can combine them all together to create a completly working HPC backend.
backend {
default = SGE
providers {
SGE {
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
concurrent-job-limit = 100
runtime-attributes = """
Int cpu = 1
Float? memory_gb
String? sge_queue
String? sge_project
"""
submit = """
qsub \
-terse \
-V \
-b y \
-N ${job_name} \
-wd ${cwd} \
-o ${out} \
-e ${err} \
-pe smp ${cpu} \
${"-l mem_free=" + memory_gb + "g"} \
${"-q " + sge_queue} \
${"-P " + sge_project} \
/usr/bin/env bash ${script}
"""
job-id-regex = "(\\d+)"
kill = "qdel ${job_id}"
check-alive = "qstat -j ${job_id}"
}
}
}
}
Running Cromwell with this in our configuration file will now submit jobs to SGE!
Next steps
You might find the following tutorials interesting to tackle next:
- Persisting Data Between Restarts
- Server Mode
- If you'd like to configure Cromwell to use a local scratch device, see instructions here HPCSlurmWithLocalScratch.md