Your basic jobfile should look like this.
#!/bin/bash
#SBATCH --partition=gpu_a100
#SBATCH --gpus=1
#SBATCH --cpus-per-task=18
#SBATCH --job-name=train
#SBATCH --ntasks=1
#SBATCH --time=10:00:00
source venv/bin/activate
python train.py
Log everything
When you are launching experiments, it is very important to keep track of what you launched (launch script) and the output (both the standard output and standard error). I like to store everything in folders with the job ID as name (that is accessed with %A
in the SLURM directives, and SLURM_JOB_ID in the bash script). Then, I keep track of all the job IDs in a spreadsheet.
#SBATCH --output=logs/%A/stdout.txt
#SBATCH --error=logs/%A/stderr.txt
cp $0 logs/${SLURM_JOB_ID}/script.sh
Parallelize everything
If your task is parallelizable, do so. Launching many small jobs is better than launching a big one, because they can run in parallel, and because the scheduler will favour shorter jobs.
Array jobs
In some cases, it is interesting to launch a bunch similar jobs. For example, you might want to generate lots of images to compute the FID score of a diffusion model.
#SBATCH --output=logs/%A_%a/stdout.txt
#SBATCH --error=logs/%A_%a/stderr.txt
#SBATCH --array=1-10
cp $0 logs/${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}/script.sh
source venv/bin/activate
python inference.py \
--seed ${SLURM_ARRAY_TASK_ID} \
--output ./samples-${SLURM_ARRAY_TASK_ID}
Notice that in the SBATCH directives, %A
and $a
correspond to SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID in the bash script. Notice also that we are using SLURM_ARRAY_JOB_ID (that is the same for all launched jobs), not SLURM_JOB_ID.
Dependencies
Some jobs need to wait until another is done to start. In those cases, you can do something like this:
#SBATCH --dependency=afterok:123456
Get notified
It can be useful to get an email if your job fails (you might get an OOM exception for example). Instead of ALL
, you can use BEGIN
, END
, FAIL
,TIME_LIMIT
, TIME_LIMIT_80
(reached 80 percent of the time limit)…
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user1@mail.com,user2@mail.com
Resource monitoring
You can check how many credits you have left with accinfo
, how many credits have been used by other users in your account with accuse
and the status of the nodes with sinfo --summarize
.