The 'Checkpoint File' Optimization
Overview
Available in Cromwell 55 and higher.
This optimization hopes to resolve the issue of your worker VM being preempted 9 hours and 55 minutes into the runtime of a 10 hour job and having no option but to re-run the entire computation again.
Description
Specifying a checkpointFile
value in a task's runtime
section designates a checkpoint file which will periodically be
copied to cloud storage every 10 minutes.
This checkpoint file will then be restored automatically on subsequent attempts if the job is interrupted.
Note: Although the checkpoint file is deleted if the task succeeds, additional charges may accrue storing the checkpoint file during the running of the task, if the task is aborted or otherwise stopped externally, and by transferring it between the VM and the cloud. These cost should be minor, especially balanced against the performance and cost benefits of being able to restore from the checkpoint when preemptible VMs are interrupted.
Effect on Call Caching
The presence or absence of the checkpointFile
attribute is not considered when determining whether to call cache.
Example
The following WDL demonstrates the use of the checkpointFile
optimization. It has a command which is checkpoint-aware:
- It starts by attempting to restore state from the
my_checkpoint
file (or starts at1
if the checkpoint is empty) - Then it counts up to 100, printing out the current counter value and a date timestamp at each value.
To make the checkpointing work, the runtime
section specifies checkpointFile: "my_checkpoint"
.
version 1.0
workflow count_wf {
call count { input: count_to = 100 }
}
task count {
input {
Int count_to
}
command <<<
# Note: Cromwell will stage the checkpoint file on recovery attempts.
# This task checks the 'my_checkpoint' file for a counter value, or else
# initializes the counter at '1':
FROM_CKPT=$(cat my_checkpoint | tail -n1 | awk '{ print $1 }')
FROM_CKPT=${FROM_CKPT:-1}
echo '--' >> my_checkpoint
for i in $(seq $FROM_CKPT ~{count_to})
do
echo $i $(date) >> my_checkpoint
sleep 4
done
>>>
runtime {
docker: "ubuntu:latest"
preemptible: 3
# Note: This checkpointFile attribute is what signals to Cromwell to save
# the designated checkpoint file:
checkpointFile: "my_checkpoint"
}
output {
# Note: This task also uses the checkpoint as its output. This is not
# required for checkpointing to work:
Array[String] out = read_lines("my_checkpoint")
}
}
Backend Support
Cromwell supports the checkpointFile
attribute on the following backends:
- The Google PAPIv2 (alpha1) backend
- The Google Life Sciences (beta) backend