| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

SLURM Tips and Tricks

Page history last edited by Stephen Wissow 1 month ago

 

SLURM Tips and Tricks

SLURM Documentation
Local UNH cluster SLURM documentation

Workflow

  1. Program your software to handle a single instance.
  2. Generate a file with one command to run per line, to conduct full experiment.
  3. Create a slurm array job to distribute and run those commands.

Example

          
#!/bin/bash
#SBATCH --job-name=foo   # Job name
#SBATCH --mail-type=NONE            # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=your.email@cs.unh.edu   # Where to send mail	
#SBATCH --ntasks=1                  # Run a single task
#SBATCH --array=1-100                 # Array range
#SBATCH --time=0:10:00

#SBATCH --mem=1G

#SBATCH --no-kill #SBATCH -p compute eval "$(head -${SLURM_ARRAY_TASK_ID} /path/to/commands.run | tail -1)"

 

Note that array should go from 1-N where N is the number of commands to run.

Run with `sbatch script.sh'.

See /home/aifs2/group/lib/slurm/hostname.sh for another example.

 

-t D-HH:MM sets a time limit

--mem=63G sets a memory limit.  please make sure your job doesn't swap!

 

A Note on Concurrency, Accuracy of Timing, and Memory Limits

Setting a memory limit (`--mem`) of less than one half a node's total RAM allows (though does not force) multiple jobs to run simultaneously on the same node, even when ntasks=1. If the number of tasks (cells in the job array) exceeds the number of (available?) nodes, Slurm will allocate more than one task per node. Anecdotally, circumstantial evidence suggests in this case that our Slurm configuration appears, by default, to allocate no more than one task per core.

 

Anecdotally, circumstantial evidence suggests that, if a job without a `--mem` limit set is already running, submitting a second job with a `--mem` limit of less than half a node's total RAM will *not* cause the second's jobs tasks to be scheduled on the unused cores of the machines on which the first job is already running.

If you want to avoid running multiple tasks on the same node at the same time, say because you want to minimize timing inaccuracies due to cache and memory bus interference between processes, then, for a 64 GB machine, set something high like 63 GB (anecdotally Slurm has rejected `--mem=63G` on our 64 GB machines, but has accepted `--mem=62G`).

 

FIX: what if we want to allow more than one task per node, but limit to one task per core (i.e., prevent hyperthreading -- current machines have two threads per core).

 

Helpful commands

          
squeue # Monitors currently running tasks (includes runtime).  also useful as `watch squeue'
scontrol update ArrayTaskThrottle=n JobId=ddddddd # Adjusts running job to use n nodes.
scontrol update ArrayTaskThrottle=0 JobId=ddddddd # Adjusts running job to use all available nodes.
sinfo # Shows partitions and their availability. 

        

Comments (0)

You don't have permission to comment on this page.