Table of contents
GPU queue/partitions
Partitions | GPUs | Nodes | CPUs | Memory | Time limit |
---|---|---|---|---|---|
qTRDGPU[H/M/L] | gpu:V100:8 | dgx001-dgx004 | 40 | 512GB | 5d 8h |
gpu:V100:4 | gn001-gn002 | 40 | 192GB | 5d 8h | |
gpu:A100:8 | dgxa001 | 40 | 1TB | 5d 8h | |
qTRDGPU | gpu:RTX:1 | agn001-agn020 | 64 | 512GB | 5d 8h |
GPU partitions preemption rule
Partitions | Priority | Limitations | Preemption |
---|---|---|---|
qTRDGPUH | high | Max 4 GPUs per user | N/A |
qTRDGPUM | medium | Max 8 GPUs per user | suspend |
qTRDGPUL | low | N/A | suspend |
qTRDGPU | N/A | N/A | N/A |
Special nodes
Nodes | CPUs | Memory | GPUs | Purpose |
---|---|---|---|---|
arctrdgndev101.rs.gsu.edu | 4 | 62GB | TITAN X:2 | GPU development & testing |
trendsagn019.rs.gsu.edu | 64 | 512GB | gpu:RTX:1 | GPU development & testing |
Allocating GPUs in SLURM
When allocating GPUs in SLURM, use the value in the GPUs column in the above table as the --gres
parameter. See examples below.
Job array with multiple tasks on each GPU
The JobSubmit.sh
file
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 10
#SBATCH --mem=50g
#SBATCH --gres=gpu:V100:1
#SBATCH -p qTRDGPUH
#SBATCH -t 4-00
#SBATCH -J <job name>
#SBATCH -e error%A-%a.err
#SBATCH -o out%A-%a.out
#SBATCH -A <slurm_account_code>
#SBATCH --oversubscribe
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email address>
sleep 10s
echo $HOSTNAME >&2
source <path to conda installation>/bin/activate <name/path of conda environment>
python script.py --arg1 $SLURM_ARRAY_TASK_ID &
python script.py --arg1 $SLURM_ARRAY_TASK_ID &
wait
sleep 10s
Submitting the job
sbatch --array=1-8%2 JobSubmit.sh
Start interactive mode/terminal/bash on GPU worker/compute nodes
can be run on the login node
# Start interactive mode on a GPU worker node
$ srun -p qTRDGPUH -A <slurm_account_code> -v -n1 --pty --mem=10g --gres=gpu:V100:1 /bin/bash