Sample Job Scripting

Anatomy of a Job script

A job script commonly consists of two parts:

Scheduler specific options to manage the resources configure the job environment
Job-specific shell commands (configuring software environment, specifying your binary/executable)

Here is a simple example of how a job script looks like:

#!/bin/bash

SBATCH --job-name="Hello World"
SBATCH --partition=peregrine-cpu
SBATCH --qos=cpu_debug
SBATCH --nodes=1
SBATCH --ntasks=1
SBATCH --cpus-per-task=1
SBATCH --mem-per-cpu=2G
SBATCH --time=00:01:00
SBATCH --output=my-job.out
SBATCH --error=my-job.err
SBATCH --mail-type=begin
SBATCH --mail-type=end
SBATCH --mail-user=<First.Last>@colostate.edu

module load python/anaconda
srun python3 hello_world.py

#!/bin/bash

SBATCH --job-name="Hello World"
SBATCH --partition=peregrine-cpu
SBATCH --qos=cpu_debug
SBATCH --nodes=1
SBATCH --ntasks=1
SBATCH --cpus-per-task=1
SBATCH --mem-per-cpu=2G
SBATCH --time=00:01:00
SBATCH --output=my-job.out
SBATCH --error=my-job.err
SBATCH --mail-type=begin
SBATCH --mail-type=end
SBATCH --mail-user=<First.Last>@colostate.edu

module load python/anaconda
srun python3 hello_world.py

The short description of each command follows in the table below and more details of each command follow after the table.

Command	Description
`#!/bin/bash`	Specifies the Unix shell to be used
`SBATCH --job-name="Hello World"`	A name for your job
`SBATCH --partition=peregrine-cpu`	Partition to which job should be submitted
`SBATCH --qos=cpu_debug`	QoS type
`SBATCH --nodes=1`	Node count
`SBATCH --ntasks=1`	Total number of tasks across all nodes
`SBATCH --cpus-per-task=1`	Cpu-cores per task (>1 if multi-threaded tasks)
`SBATCH --mem-per-cpu=2G`	Memory per cpu-core
`SBATCH --time=00:01:00`	Total run time limit (HH:MM:SS)
`SBATCH --output=my-job.out`	Output log file
`SBATCH --error=my-job.err`	Error file
`SBATCH --mail-type=begin`	Send email when job begins
`SBATCH --mail-type=end`	Send email when job ends
`SBATCH --mail-user=@colostate.edu`	Email address to be notified
`module load python/anaconda`	Loads the latest Anaconda module
srun python3 hello_world.py	Run the slurm job

In the scheduler section (that commands that begin with SBATCH), one specifies a series of SBATCH directives which set the resource requirements and other parameters of the job. The above example is a short running CPU job, as it is submitted to the peregrine-cpu partition and the QoS requested is cpu_debug.

Warning

Specifying a QoS is mandatory.

The script above requests 1 CPU-core (via --cpus-per-task=1) and 2 GB of memory (--mem-per-cpu=2G) and a wall time of 1 minute (--time=00:01:00).

Then we specify where the output and error messages get written, using the --output= and --error=. In this example, we are writing the output to a file named my-job.out and the error messages to my-job.err. If you do not specify the output files, SLURM writes both stdout and stderr to a file named slurm-XXXX.out, where XXXX is the job id.

You can specify an email address (--mail-user=) and the events for which you would like to be notified. In the above example, we have specified to alert us at the beginning and at the end of the job.

In the second section, we have the job specific commands. Any environment modules that are needed for utilizing software should be loaded at this stage. Just like on other CS machines, we use environment modules to use software available under /usr/local. More information on using environment modules can be found on the Environment Modules page. Here we are loading the Anaconda Python module, to make use of Python. Note that you may use other modules which provide Python as well.

And finally, the actual work to be done, which in this example is the execution of a Python code, is specified in the final line. The executable (here the Python interpreter) is usually called using the srun slurm command.

Samples

The sub-sections provide example scripts for the following types of jobs:

Serial Jobs
Multithreaded Jobs
MPI Jobs
Hybrid Jobs
GPU Jobs
Interactive Jobs

Serial Jobs

Serial jobs use only a single CPU-core.

First, let us write a simple Python program.

# This program prints Hello, world!

print('Hello, World!')

# This program prints Hello, world!

print('Hello, World!')

Save it as hello.py.

Now we will write a SLURM script to run our serial Python code as a job:

#!/bin/bash
#SBATCH --job-name="Hello World" 	# a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug			    # qos type
#SBATCH --nodes=1                	# node count
#SBATCH --ntasks=1               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=1        	# cpu-cores per task
#SBATCH --mem-per-cpu=2G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)


module purge
module load python/anaconda
srun python3 hello.py

#!/bin/bash
#SBATCH --job-name="Hello World" 	# a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug			    # qos type
#SBATCH --nodes=1                	# node count
#SBATCH --ntasks=1               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=1        	# cpu-cores per task
#SBATCH --mem-per-cpu=2G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)


module purge
module load python/anaconda
srun python3 hello.py

Save it as helloworld-python.sh and submit using the command

sbatch helloworld-python.sh

sbatch helloworld-python.sh

The result will be saved in a file named slurm-####.out and should look like

Hello, World!

Hello, World!

Multithreaded Jobs

Many modern software like Matlab, NumPy, etc. come with libraries that are able to use multiple CPU-cores via shared-memory parallel programming techniques like OpenMP or pthreads.

For such applications, one can use the cpus-per-task parameter to tell Slurm to run the job using multiple CPU-cores.

Note that the product of ntasks and cpus-per-task should not be greater than the number of CPU-cores allowed on a partition/QoS.

Warning

Using larger values of cpus-per-task will not magically speed up your job. This leads to wastage of resources and might even cause your job to be assigned a lower priority. So, make sure your application uses multithreading libraries or your code has been explicitly written to use multiple threads.

We provide examples for multithreaded:

MATLAB
OpenMP
Python

MATLAB

MATLAB jobs work well as serial (single-threaded) jobs. But if your application/code uses MATLAB’s Parallel Computing Toolbox (e.g., parfor) or MATLAB’s BLAS libraries, then you can script your jobs to run over multiple CPUs.

Warning

At present multi-node MATLAB jobs are not possible. So, your Slurm script should always use #SBATCH –nodes=1.

Here we take the example from MathWorks website which uses multiple cores in a for loop.

for_loop.m

poolobj = parpool;
fprintf('Number of workers: %g\n', poolobj.NumWorkers);

tic
n = 200;
A = 500;
a = zeros(n);
parfor i = 1:n
    a(i) = max(abs(eig(rand(A))));
end
toc

poolobj = parpool;
fprintf('Number of workers: %g\n', poolobj.NumWorkers);

tic
n = 200;
A = 500;
a = zeros(n);
parfor i = 1:n
    a(i) = max(abs(eig(rand(A))));
end
toc

We then use the following SLURM script to run the above MATLAB code via the scheduler.

#!/bin/bash
#
#SBATCH --job-name="Matlab" 	    # a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug				# qos type
#SBATCH --nodes=1                	# node count
#SBATCH --ntasks=1               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=4        	# cpu-cores per task 
#SBATCH --mem-per-cpu=4G         	# memory per cpu-core

module purge
module load matlab

matlab -nodisplay -nosplash -r for_loop

#!/bin/bash
#
#SBATCH --job-name="Matlab" 	    # a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug				# qos type
#SBATCH --nodes=1                	# node count
#SBATCH --ntasks=1               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=4        	# cpu-cores per task 
#SBATCH --mem-per-cpu=4G         	# memory per cpu-core

module purge
module load matlab

matlab -nodisplay -nosplash -r for_loop

Save the script as matlab.sh and submit it as

sbatch matlab.sh

sbatch matlab.sh

The output would go to a file slurm-######.out, named after the job id.
It should look like:

                            < M A T L A B (R) >
                  Copyright 1984-2022 The MathWorks, Inc.
                  R2022a (9.12.0.1884302) 64-bit (glnxa64)
                             February 16, 2022

 
To get started, type doc.
For product information, visit www.mathworks.com.
 
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 4).
Number of workers: 4
Elapsed time is 9.388137 seconds.

                            < M A T L A B (R) >
                  Copyright 1984-2022 The MathWorks, Inc.
                  R2022a (9.12.0.1884302) 64-bit (glnxa64)
                             February 16, 2022

 
To get started, type doc.
For product information, visit www.mathworks.com.
 
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 4).
Number of workers: 4
Elapsed time is 9.388137 seconds.

Notice the time taken to finished the task in the last line. You can change the value of cpus-per-task and see how this time changes!

OpenMP

In this example, we will use a simple OpenMP C++ program and run it via SLURM. Here is the C++ code we will be using:

#include <iostream>
#include <omp.h>

int main(int argc, char* argv[]) {
  using namespace std;
 
  #pragma omp parallel
  {
  int id = omp_get_thread_num();
  int numthrds = omp_get_num_threads();
  cout << "Hello from thread " << id << " of " << numthrds << endl;
  }
  return 0;
}

#include <iostream>
#include <omp.h>

int main(int argc, char* argv[]) {
  using namespace std;
 
  #pragma omp parallel
  {
  int id = omp_get_thread_num();
  int numthrds = omp_get_num_threads();
  cout << "Hello from thread " << id << " of " << numthrds << endl;
  }
  return 0;
}

Save the code as omp.cpp.

Now compile it into a binary named omp using g++:

g++ -fopenmp -o omp omp.cpp

g++ -fopenmp -o omp omp.cpp

We will now run the binary omp using the following SLURM script

#!/bin/bash
#
#SBATCH --job-name="Hello World OMP" 	# a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug				# qos type
#SBATCH --nodes=1                	# node count
#SBATCH --ntasks=1               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=8        	# cpu-cores per task
#SBATCH --mem-per-cpu=2G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module purge

./omp

#!/bin/bash
#
#SBATCH --job-name="Hello World OMP" 	# a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug				# qos type
#SBATCH --nodes=1                	# node count
#SBATCH --ntasks=1               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=8        	# cpu-cores per task
#SBATCH --mem-per-cpu=2G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module purge

./omp

Note how we are using cpus-per-task as 8, so we are using 8 CPU cores.
Save the script as omp.sh and submit the job by running

Note how we are using cpus-per-task as 8, so we are using 8 CPU cores.
Save the script as omp.sh and submit the job by running

sbatch omp.sh

sbatch omp.sh

The output should go in a file named slurm-####.out and should look like

Hello from thread Hello from thread Hello from thread 6 of 8Hello from thread Hello from thread 
0 of 8
4 of 8
53 of 8
Hello from thread 1 of 8
Hello from thread 2 of 8
Hello from thread 7 of 8
 of 8

Hello from thread Hello from thread Hello from thread 6 of 8Hello from thread Hello from thread 
0 of 8
4 of 8
53 of 8
Hello from thread 1 of 8
Hello from thread 2 of 8
Hello from thread 7 of 8
 of 8

Python

In this example we’ll use the numpy library in Python to demonstrate a multi-threaded Python job.

Save the following code as numpy-demo.py

import os
num_threads = int(os.environ['SLURM_CPUS_PER_TASK'])

import mkl
mkl.set_num_threads(num_threads)

N = 2000
num_runs = 5

import numpy as np
np.random.seed(42)

from time import perf_counter
x = np.random.randn(N, N).astype(np.float64)
times = []
for _ in range(num_runs):
  t0 = perf_counter()
  u, s, vh = np.linalg.svd(x)
  elapsed_time = perf_counter() - t0
  times.append(elapsed_time)

print("execution time: ", min(times))
print("threads: ", num_threads)

import os
num_threads = int(os.environ['SLURM_CPUS_PER_TASK'])

import mkl
mkl.set_num_threads(num_threads)

N = 2000
num_runs = 5

import numpy as np
np.random.seed(42)

from time import perf_counter
x = np.random.randn(N, N).astype(np.float64)
times = []
for _ in range(num_runs):
  t0 = perf_counter()
  u, s, vh = np.linalg.svd(x)
  elapsed_time = perf_counter() - t0
  times.append(elapsed_time)

print("execution time: ", min(times))
print("threads: ", num_threads)

Now save the following SLURM script as numpy-demo.sh

#!/bin/bash
#
#SBATCH --job-name="NumPY Demo" 	# a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug           # qos type
#SBATCH --nodes=1                	# node count
#SBATCH --ntasks=1               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=8        	# cpu-cores per task
#SBATCH --mem-per-cpu=1G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)
#
module purge
module load python/anaconda

srun python numpy-demo.py

#!/bin/bash
#
#SBATCH --job-name="NumPY Demo" 	# a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug           # qos type
#SBATCH --nodes=1                	# node count
#SBATCH --ntasks=1               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=8        	# cpu-cores per task
#SBATCH --mem-per-cpu=1G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)
#
module purge
module load python/anaconda

srun python numpy-demo.py

Then submit the job as

sbatch numpy-demo.sh

sbatch numpy-demo.sh

The output should be in a file name slurm-####.out and should look like:

execution time:  1.5503690890036523
threads:  8

execution time:  1.5503690890036523
threads:  8

MPI Jobs

MPI (Message Passing Interface) utlizes node based parallelism, a MPI enabled code can use multiple CPU-cores on multiple nodes. Here is the C++ code we will be using:

#include <iostream>
#include <mpi.h>

int main(int argc, char** argv) {
  using namespace std;
  
  MPI_Init(&argc, &argv);

  int world_size, world_rank;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  // Print off a hello world message
  cout << "Process " << world_rank << " of " << world_size
       << " says hello from " << processor_name << endl;
  
  // uncomment next line to make CPU-cores work (infinitely)
  // while (true) {};

  MPI_Finalize();
  return 0;
}

#include <iostream>
#include <mpi.h>

int main(int argc, char** argv) {
  using namespace std;
  
  MPI_Init(&argc, &argv);

  int world_size, world_rank;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  // Print off a hello world message
  cout << "Process " << world_rank << " of " << world_size
       << " says hello from " << processor_name << endl;
  
  // uncomment next line to make CPU-cores work (infinitely)
  // while (true) {};

  MPI_Finalize();
  return 0;
}

Save the code as mpi.cpp.

Now compile it into a binary named mpi using the MPI compiler:

module load compilers/mpi/openmpi-slurm
mpicxx -o mpi mpi.cpp

module load compilers/mpi/openmpi-slurm
mpicxx -o mpi mpi.cpp

We will now run the binary mpi using the following SLURM script

#!/bin/bash
#
#SBATCH --job-name="MPI Demo" 		# name of your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_short				# qos type
#SBATCH --nodes=2                	# node count
#SBATCH --ntasks=16               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=1        	# cpu-cores per task
#SBATCH --mem-per-cpu=1G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)
#
module purge
module load compilers/mpi/openmpi-slurm 

srun ./mpi

#!/bin/bash
#
#SBATCH --job-name="MPI Demo" 		# name of your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_short				# qos type
#SBATCH --nodes=2                	# node count
#SBATCH --ntasks=16               	# total number of tasks across all nodes
#SBATCH --cpus-per-task=1        	# cpu-cores per task
#SBATCH --mem-per-cpu=1G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)
#
module purge
module load compilers/mpi/openmpi-slurm 

srun ./mpi

Submit the job as

sbatch mpi.sh

sbatch mpi.sh

The result will be saved in a file named slurm-####.out and should look like

Process 15 of 16 says hello from peregrine1
Process 1 of 16 says hello from peregrine0
Process 2 of 16 says hello from peregrine0
Process 3 of 16 says hello from peregrine0
Process 4 of 16 says hello from peregrine0
Process 5 of 16 says hello from peregrine0
Process 6 of 16 says hello from peregrine0
Process 7 of 16 says hello from peregrine0
Process 8 of 16 says hello from peregrine0
Process 9 of 16 says hello from peregrine0
Process 10 of 16 says hello from peregrine0
Process 11 of 16 says hello from peregrine0
Process 12 of 16 says hello from peregrine0
Process 13 of 16 says hello from peregrine0
Process 0 of 16 says hello from peregrine0
Process 14 of 16 says hello from peregrine1

Process 15 of 16 says hello from peregrine1
Process 1 of 16 says hello from peregrine0
Process 2 of 16 says hello from peregrine0
Process 3 of 16 says hello from peregrine0
Process 4 of 16 says hello from peregrine0
Process 5 of 16 says hello from peregrine0
Process 6 of 16 says hello from peregrine0
Process 7 of 16 says hello from peregrine0
Process 8 of 16 says hello from peregrine0
Process 9 of 16 says hello from peregrine0
Process 10 of 16 says hello from peregrine0
Process 11 of 16 says hello from peregrine0
Process 12 of 16 says hello from peregrine0
Process 13 of 16 says hello from peregrine0
Process 0 of 16 says hello from peregrine0
Process 14 of 16 says hello from peregrine1

Hybrid Jobs

One can combine multithreading and multinode parallelism using a hybrid OpenMP/MPI approach. Let use the following C++ code, which uses both MPI and OMP:

#include <iostream>
#include <mpi.h>
#include <omp.h>

int main(int argc, char** argv) {
  using namespace std;
  
  MPI_Init(&argc, &argv);

  int world_size, world_rank;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  #pragma omp parallel
  {
  int id = omp_get_thread_num();
  int nthrds = omp_get_num_threads();
  cout << "Hello from thread " << id << " of " << nthrds
       << " on MPI process " << world_rank << " of " << world_size
       << " on node " << processor_name << endl;
  }

  MPI_Finalize();
  return 0;
}

#include <iostream>
#include <mpi.h>
#include <omp.h>

int main(int argc, char** argv) {
  using namespace std;
  
  MPI_Init(&argc, &argv);

  int world_size, world_rank;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  #pragma omp parallel
  {
  int id = omp_get_thread_num();
  int nthrds = omp_get_num_threads();
  cout << "Hello from thread " << id << " of " << nthrds
       << " on MPI process " << world_rank << " of " << world_size
       << " on node " << processor_name << endl;
  }

  MPI_Finalize();
  return 0;
}

Save it as hybrid.cpp and compile it via the command

module load compilers/mpi/openmpi-slurm
mpicxx -fopenmp -o hybrid hybrid.cpp

module load compilers/mpi/openmpi-slurm
mpicxx -fopenmp -o hybrid hybrid.cpp

Below is a SLURM job script for our code:

#!/bin/bash
#
#SBATCH --job-name="Hybrid Demo" 	# a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug				# qos type
#SBATCH --nodes=2                	# node count
#SBATCH --ntasks-per-node=2      	# total number of tasks per node
#SBATCH --cpus-per-task=4        	# cpu-cores per task
#SBATCH --mem-per-cpu=1G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module purge
module load compilers/mpi/openmpi-slurm 

srun ./hybrid

#!/bin/bash
#
#SBATCH --job-name="Hybrid Demo" 	# a name for your job
#SBATCH --partition=peregrine-cpu	# partition to which job should be submitted
#SBATCH --qos=cpu_debug				# qos type
#SBATCH --nodes=2                	# node count
#SBATCH --ntasks-per-node=2      	# total number of tasks per node
#SBATCH --cpus-per-task=4        	# cpu-cores per task
#SBATCH --mem-per-cpu=1G         	# memory per cpu-core
#SBATCH --time=00:01:00          	# total run time limit (HH:MM:SS)

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module purge
module load compilers/mpi/openmpi-slurm 

srun ./hybrid

Notice how we ask for two nodes, and 2 tasks per node, and 4 cpus-per-task. Our code will be running over two nodes now. Save the script as hybrid.sh and submit it as

sbatch hybrid.sh

sbatch hybrid.sh

The result will be saved in a file named slurm-####.out and should look like

Hello from thread Hello from thread 0 of 4 on MPI process 1 of 4 on node peregrine03 of 4 on MPI process 1 of 4 on node peregrine0
Hello from thread 2 of 4 on MPI process 1 of 4 on node peregrine0

Hello from thread Hello from thread 3 of 4 on MPI process 3 of 4 on node peregrine1Hello from thread 1 of 4 on MPI process 3 of 4 on node 2peregrine1 of 4 on MPI process 3 of 4 on node peregrine1


Hello from thread 1 of 4 on MPI process 0 of 4 on node peregrine0
Hello from thread 1 of 4 on MPI process 1 of 4 on node peregrine0
Hello from thread 2 of 4 on MPI process 0 of Hello from thread 4 on node peregrine0
Hello from thread 0 of 4 on MPI process 0 of 4 on node peregrine0
3 of 4 on MPI process 0 of 4 on node peregrine0
Hello from thread Hello from thread 3 of 4 on MPI process 2 of 4 on node peregrine11 of 4 on MPI process 2 of 4 on node peregrine1

Hello from thread 2 of 4 on MPI process 2 of 4 on node peregrine1
Hello from thread 0 of 4 on MPI process 3 of 4 on node peregrine1
Hello from thread 0 of 4 on MPI process 2 of 4 on node peregrine1

Hello from thread Hello from thread 0 of 4 on MPI process 1 of 4 on node peregrine03 of 4 on MPI process 1 of 4 on node peregrine0
Hello from thread 2 of 4 on MPI process 1 of 4 on node peregrine0

Hello from thread Hello from thread 3 of 4 on MPI process 3 of 4 on node peregrine1Hello from thread 1 of 4 on MPI process 3 of 4 on node 2peregrine1 of 4 on MPI process 3 of 4 on node peregrine1


Hello from thread 1 of 4 on MPI process 0 of 4 on node peregrine0
Hello from thread 1 of 4 on MPI process 1 of 4 on node peregrine0
Hello from thread 2 of 4 on MPI process 0 of Hello from thread 4 on node peregrine0
Hello from thread 0 of 4 on MPI process 0 of 4 on node peregrine0
3 of 4 on MPI process 0 of 4 on node peregrine0
Hello from thread Hello from thread 3 of 4 on MPI process 2 of 4 on node peregrine11 of 4 on MPI process 2 of 4 on node peregrine1

Hello from thread 2 of 4 on MPI process 2 of 4 on node peregrine1
Hello from thread 0 of 4 on MPI process 3 of 4 on node peregrine1
Hello from thread 0 of 4 on MPI process 2 of 4 on node peregrine1

GPU Jobs

GPUs are available on the peregrine and kestrel nodes through the peregrine-gpu and kestrel-gpu partitions. There are three types of GPUs on these nodes:

peregrine-gpu

Nvidia A100 80GB – 6 available
Nvidia A100 40GB – 4 available

kestrel-gpu

Nvidia GeForce RTX 3090 24GB – 12 available

How to use GPUs

To use GPUs in your SLURM job:

Add an additional SBATCH statement: #SBATCH –gres=gpu:<type>:<number_of_gpus> to your job script.
- For A100 80GB, use
  #SBATCH --gres=gpu:a100-sxm4-80gb:1
- For A100 40GB, use
  #SBATCH --gres=gpu:nvidia_a100_3g.40gb:1
- For RTX 3090 24GB, use
  #SBATCH --gres=gpu:3090:1
Submit to the peregrine-gpu partition for A100s or kestrel-gpu partition for the 3090s.
Note that the number at the end of the SBATCH statement is the quantity of GPUs. In the statements above, we have requested for 1 GPU.

Warning

Adding the –gres option to a Slurm script for a CPU-only code WILL NOT magically speed-up your code.
Only software/code that has been explicitly written to run on GPUs can benefit from GPUs.
Requesting a GPU for a CPU-only code will waste resources and might as well lower down the priority of your future jobs.

Info

The GPU type must be specified in the SLURM script.
It is not possible to mix and match GPU types in a single job.

Warning

Do not ask for multiple GPUs if your codes is only written to use a single GPU.
Doing so will waste resources and might as well lower down the priority of your future jobs.

CPU-GPU ratio

On the peregrine nodes, the ratio of CPUs to GPUs is 6:1. So, your job can request 6 CPU cores for 1 GPU.

Monitor GPU Usage

After you submit your GPU job via sbatch command, you can monitor the GPU usage to check the memory usage of one or more GPUs in your job. Use the following command to get GPU usage of your job:

sgpu <your-jobid-here>

sgpu <your-jobid-here>

The above command runs within your job’s resource allocation. Though the resources required for this task are not too high, and should not impact your job performance, it is recommened to use this on an “as needed” basis, and not in a script which runs it in a loop.

CUDA

Here is the CUDA code we will be using. It defines two vectors and adds them.

#include "stdio.h"
#include <sys/time.h>
#include <cuda.h>

#define N 1000
__global__ void add(int *a, int *b, int *c)
{
int tID = blockIdx.x;
if (tID < N)
{
c[tID] = a[tID] + b[tID];
}
}
int main()
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, N*sizeof(int));
cudaMalloc((void **) &dev_b, N*sizeof(int));
cudaMalloc((void **) &dev_c, N*sizeof(int));
// Fill Arrays
for (int i = 0; i < N; i++)
{
a[i] = i,
b[i] = 1;
}
cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
{
printf("%d + %d = %d\n", a[i], b[i], c[i]);
}

return 0;
}


#include "stdio.h"
#include <sys/time.h>
#include <cuda.h>

#define N 1000
__global__ void add(int *a, int *b, int *c)
{
int tID = blockIdx.x;
if (tID < N)
{
c[tID] = a[tID] + b[tID];
}
}
int main()
{
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, N*sizeof(int));
cudaMalloc((void **) &dev_b, N*sizeof(int));
cudaMalloc((void **) &dev_c, N*sizeof(int));
// Fill Arrays
for (int i = 0; i < N; i++)
{
a[i] = i,
b[i] = 1;
}
cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
{
printf("%d + %d = %d\n", a[i], b[i], c[i]);
}

return 0;
}

Save the code as vector-add.cu.

We will now compile and run the code using the following SLURM script cuda.sh:

#!/bin/bash

#SBATCH --job-name=cuda-add       		 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module load cuda
nvcc vector-add.cu -o vector-add
srun vector-add

#!/bin/bash

#SBATCH --job-name=cuda-add       		 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module load cuda
nvcc vector-add.cu -o vector-add
srun vector-add

Submit the job as

sbatch cuda.sh

sbatch cuda.sh

The result will be saved in a file named slurm-####.out and should look like

0 + 1 = 1
1 + 1 = 2
2 + 1 = 3
3 + 1 = 4
4 + 1 = 5
5 + 1 = 6
6 + 1 = 7
---------
---------
996 + 1 = 997
997 + 1 = 998
998 + 1 = 999
999 + 1 = 1000

0 + 1 = 1
1 + 1 = 2
2 + 1 = 3
3 + 1 = 4
4 + 1 = 5
5 + 1 = 6
6 + 1 = 7
---------
---------
996 + 1 = 997
997 + 1 = 998
998 + 1 = 999
999 + 1 = 1000

MatLab

MATLAB has some built-in routines that can take advantage of a GPU. The sample code below performs a matrix decomposition using MATLAB GPU functions.

gpu = gpuDevice();
fprintf('Using a %s GPU.\n', gpu.Name);
disp(gpuDevice);

X = gpuArray([1 0 2; -1 5 0; 0 3 -9]);
whos X;
[U,S,V] = svd(X)
fprintf('trace(S): %f\n', trace(S))
quit;

gpu = gpuDevice();
fprintf('Using a %s GPU.\n', gpu.Name);
disp(gpuDevice);

X = gpuArray([1 0 2; -1 5 0; 0 3 -9]);
whos X;
[U,S,V] = svd(X)
fprintf('trace(S): %f\n', trace(S))
quit;

Save the code below as svd.m.

We will now use the following SLURM script matlab-gpu.sh to run the code:

#!/bin/bash

#SBATCH --job-name="Matlab-GPU-Demo"	 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module purge
module load matlab

matlab -singleCompThread -nodisplay -nosplash -r svd

#!/bin/bash

#SBATCH --job-name="Matlab-GPU-Demo"	 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module purge
module load matlab

matlab -singleCompThread -nodisplay -nosplash -r svd

Submit the job as

sbatch matlab-gpu.sh

sbatch matlab-gpu.sh

The result will be saved in a file named slurm-####.out and should look like

                            < M A T L A B (R) >
                  Copyright 1984-2022 The MathWorks, Inc.
                  R2022a (9.12.0.1884302) 64-bit (glnxa64)
                             February 16, 2022

 
To get started, type doc.
For product information, visit www.mathworks.com.
 
Using a NVIDIA A100-SXM4-80GB MIG 3g.40gb GPU.
  CUDADevice with properties:

                      Name: 'NVIDIA A100-SXM4-80GB MIG 3g.40gb'
                     Index: 1
         ComputeCapability: '8.0'
            SupportsDouble: 1
             DriverVersion: 11.7000
            ToolkitVersion: 11.2000

-----------------------------------------------------------
-------------------TRUNCATED-------------------------------
-----------------------------------------------------------

V =

    0.0403    0.1761   -0.9835
   -0.3974   -0.9003   -0.1775
    0.9168   -0.3980   -0.0337

trace(S): 15.718392

                            < M A T L A B (R) >
                  Copyright 1984-2022 The MathWorks, Inc.
                  R2022a (9.12.0.1884302) 64-bit (glnxa64)
                             February 16, 2022

 
To get started, type doc.
For product information, visit www.mathworks.com.
 
Using a NVIDIA A100-SXM4-80GB MIG 3g.40gb GPU.
  CUDADevice with properties:

                      Name: 'NVIDIA A100-SXM4-80GB MIG 3g.40gb'
                     Index: 1
         ComputeCapability: '8.0'
            SupportsDouble: 1
             DriverVersion: 11.7000
            ToolkitVersion: 11.2000

-----------------------------------------------------------
-------------------TRUNCATED-------------------------------
-----------------------------------------------------------

V =

    0.0403    0.1761   -0.9835
   -0.3974   -0.9003   -0.1775
    0.9168   -0.3980   -0.0337

trace(S): 15.718392

PyTorch

In this example, we’ll use the PyTorch MNSIT example. Get the source code from https://github.com/pytorch/examples/tree/main/mnist and save the Python code as mnist.py

We will now use the following SLURM script pytorch-gpu.sh to run the code:

#!/bin/bash

#SBATCH --job-name="PyTorch-GPU-Demo"	 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module purge
module load python/anaconda

python mnist.py --epochs=3

#!/bin/bash

#SBATCH --job-name="PyTorch-GPU-Demo"	 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module purge
module load python/anaconda

python mnist.py --epochs=3

Submit the job as

sbatch pytorch-gpu.sh

sbatch pytorch-gpu.sh

The result will be saved in a file named slurm-####.out and should look like

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.299824
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.733667
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.933156
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.623502
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.357575
Train Epoch: 1 [3200/60000 (5%)]	Loss: 0.315663


-----------------------------------------------------------
-------------------TRUNCATED-------------------------------
-----------------------------------------------------------

Train Epoch: 3 [55680/60000 (93%)]	Loss: 0.009016
Train Epoch: 3 [56320/60000 (94%)]	Loss: 0.241464
Train Epoch: 3 [56960/60000 (95%)]	Loss: 0.004863
Train Epoch: 3 [57600/60000 (96%)]	Loss: 0.004337
Train Epoch: 3 [58240/60000 (97%)]	Loss: 0.109445
Train Epoch: 3 [58880/60000 (98%)]	Loss: 0.038164
Train Epoch: 3 [59520/60000 (99%)]	Loss: 0.014446

Test set: Average loss: 0.0333, Accuracy: 9887/10000 (99%)

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.299824
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.733667
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.933156
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.623502
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.357575
Train Epoch: 1 [3200/60000 (5%)]	Loss: 0.315663


-----------------------------------------------------------
-------------------TRUNCATED-------------------------------
-----------------------------------------------------------

Train Epoch: 3 [55680/60000 (93%)]	Loss: 0.009016
Train Epoch: 3 [56320/60000 (94%)]	Loss: 0.241464
Train Epoch: 3 [56960/60000 (95%)]	Loss: 0.004863
Train Epoch: 3 [57600/60000 (96%)]	Loss: 0.004337
Train Epoch: 3 [58240/60000 (97%)]	Loss: 0.109445
Train Epoch: 3 [58880/60000 (98%)]	Loss: 0.038164
Train Epoch: 3 [59520/60000 (99%)]	Loss: 0.014446

Test set: Average loss: 0.0333, Accuracy: 9887/10000 (99%)

TensorFlow

In this example, we’ll use a small example along the lines of https://www.tensorflow.org/tutorials/keras/classification Save the following Python code as mnist.py

import tensorflow as tf

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10)
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)

import tensorflow as tf

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10)
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)

We will now use the following SLURM script tf-gpu.sh to run the code:

#!/bin/bash

#SBATCH --job-name="TensorFlow-GPU-Demo"	 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module purge
module load python/anaconda

python mnist.py

#!/bin/bash

#SBATCH --job-name="TensorFlow-GPU-Demo"	 # job name
#SBATCH --partition=peregrine-gpu 		 # partition to which job should be submitted
#SBATCH --qos=gpu_debug			  		 # qos type
#SBATCH --nodes=1                 		 # node count
#SBATCH --ntasks=1                		 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         		 # cpu-cores per task
#SBATCH --mem=4G                  		 # total memory per node
#SBATCH --gres=gpu:nvidia_a100_3g.40gb:1 # Request 1 GPU (A100 40GB)
#SBATCH --time=00:05:00 				 #  wall time

module purge
module load python/anaconda

python mnist.py

Summit the job as

sbatch tf-gpu.sh

sbatch tf-gpu.sh

The result will be saved in a file named slurm-####.out and should look like

Epoch 1/10
1875/1875 [==============================] - 2s 958us/step - loss: 0.2587 - accuracy: 0.9252
Epoch 2/10
1875/1875 [==============================] - 2s 955us/step - loss: 0.1135 - accuracy: 0.9660
Epoch 3/10
1875/1875 [==============================] - 2s 956us/step - loss: 0.0772 - accuracy: 0.9764

-----------------------------------------------------------
-------------------TRUNCATED-------------------------------
-----------------------------------------------------------

1875/1875 [==============================] - 2s 956us/step - loss: 0.0285 - accuracy: 0.9910
Epoch 8/10
1875/1875 [==============================] - 2s 956us/step - loss: 0.0245 - accuracy: 0.9920
Epoch 9/10
1875/1875 [==============================] - 2s 955us/step - loss: 0.0184 - accuracy: 0.9943
Epoch 10/10
1875/1875 [==============================] - 2s 956us/step - loss: 0.0172 - accuracy: 0.9942
313/313 - 0s - loss: 0.0815 - accuracy: 0.9771 - 330ms/epoch - 1ms/step

Test accuracy: 0.9771000146865845

Epoch 1/10
1875/1875 [==============================] - 2s 958us/step - loss: 0.2587 - accuracy: 0.9252
Epoch 2/10
1875/1875 [==============================] - 2s 955us/step - loss: 0.1135 - accuracy: 0.9660
Epoch 3/10
1875/1875 [==============================] - 2s 956us/step - loss: 0.0772 - accuracy: 0.9764

-----------------------------------------------------------
-------------------TRUNCATED-------------------------------
-----------------------------------------------------------

1875/1875 [==============================] - 2s 956us/step - loss: 0.0285 - accuracy: 0.9910
Epoch 8/10
1875/1875 [==============================] - 2s 956us/step - loss: 0.0245 - accuracy: 0.9920
Epoch 9/10
1875/1875 [==============================] - 2s 955us/step - loss: 0.0184 - accuracy: 0.9943
Epoch 10/10
1875/1875 [==============================] - 2s 956us/step - loss: 0.0172 - accuracy: 0.9942
313/313 - 0s - loss: 0.0815 - accuracy: 0.9771 - 330ms/epoch - 1ms/step

Test accuracy: 0.9771000146865845

Interactive Jobs

Certain applications require direct user input via a terminal. For these one can make use of interactive jobs in SLURM, which makes it possible to run applications/commands on compute nodes in a shell. SLURM offers two ways in which one can run interactive jobs: using the srun command and salloc command.

Warning

Interactive jobs are intended for very short running and very specific applications/commands.
Do not use interactive jobs for long term jobs and for regular applications.
Please use the sbatch command to submit jobs.

SRUN

Using the srun command, interactive jobs can be run within a scheduled shell.

Here is an example:

srun --nodes=1 --ntasks=1 --mem=4G --time=00:05:00 --pty /bin/bash

srun --nodes=1 --ntasks=1 --mem=4G --time=00:05:00 --pty /bin/bash

Notice how the prompt changes indicating that a new shell has been spawned on one of the compute nodes:

peregrine0:~$

peregrine0:~$

You can now run your interactive application/command and after you are done, just type exit at the command prompt to quit the shell and delete the SLURM job.

SALLOC

For situations where you would like to come back to your interactive session (after disconnecting from it), you can use SLURM’s salloc command to allocate resource up-front and keep the job running. The process looks like this:

Use salloc to create the resource allocation up front
Use srun to connect to it, as many times as needed during the job time frame.

Run the command below to allocate resources:

salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00

salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00

Here we are allocating 4GB of memory and one CPU on a node for 20 minutes. The command will display a job id number. Keep a note of it, as you will need that to connect to the interactive shell.

salloc: Granted job allocation 235
salloc: Waiting for resource configuration
salloc: Nodes peregrine0 are ready for job

salloc: Granted job allocation 235
salloc: Waiting for resource configuration
salloc: Nodes peregrine0 are ready for job

Notice this time the prompt did not change. Since salloc only allocates resources to your job, it does not start a shell. To connect to an interactive shell on your job use the srun command and specify the job id (which you noted in the salloc step):

srun --jobid=235 --pty /bin/bash

srun --jobid=235 --pty /bin/bash

You will now be landed on a compute node in an interactive shell.

peregrine0:~$

peregrine0:~$

Now you can exit from the shell and connect again later, using the srun command with the same job id number. To finally delete your job, use the scancel command.

scancel 235

scancel 235

salloc: Job allocation 235 has been revoked.
Hangup

salloc: Job allocation 235 has been revoked.
Hangup

Click/Tap here to return to Falcon HPC Cluster page.

Colorado State University

Systems and Network Administration

Sample Job Scripting

Anatomy of a Job script

Warning

Samples

Serial Jobs

Multithreaded Jobs

Warning

MATLAB

Warning

OpenMP

Python

MPI Jobs

Hybrid Jobs

GPU Jobs

How to use GPUs

Warning

Info

Warning

CPU-GPU ratio

Monitor GPU Usage

CUDA

MatLab

PyTorch

TensorFlow

Interactive Jobs

Warning

SRUN

SALLOC