Job Handling

This section describes how the scheduler is configured on the Falcon cluster and how to use it.

Scheduler

A scheduler, also known as a task manager or a batch-queuing system, acts as a resource manager to provide users access to the cluster’s resources (such as CPUs, GPUs, and memory) in a fair and efficient manner. There are many different resource managers available, we use SLURM on Falcon.

Jobs

The way a program is run on an HPC cluster differs from how it is run on a typical workstation. We communicate with only the login node(s) when we log into the cluster. However, since the compute nodes are where the cluster’s actual computing power is located, programs need to run on them and not on the login nodes.

Since users cannot login on the compute nodes, you must ask the cluster’s scheduling system to run your program on the compute nodes. To do this, you must submit a unique script with instructions on how to run your program on the compute nodes. This script is then scheduled to be run on the compute nodes by the scheduler as a job.

Partitions

Partitions group nodes into logical sets of resources. They can be considered as queues for jobs, each having a set of resource limits. Users submit their jobs to a specific partition which suits best to their job’s resource requirements.

Refer to the sub-sections on the left for detailed information on partitions, QoS, submitting jobs, and more.

Partitions

Falcon cluster provides different CPU and GPU architectures. Also, there are different queues for different job priorities and limitations.

We use the following control techniques for specifying queues and resources:

Partitions
Quality of Service (QoS)

At present there are 3 partitions, peregrine-cpu is the default:

Partition	QoS	Time Limit	Cores Limit	Jobs (running)	Jobs (waiting)
peregrine-cpu	cpu_debug	30 min	20	1	2
peregrine-cpu	cpu_short	24 hours	20	1	2
peregrine-cpu	cpu_medium	3 days	20	1	2
peregrine-cpu	cpu_long	10 days	20	1	2

Partition	Job Type CPU/GPU	CPU Memory / GPU Memory
peregrine-cpu	single and multi-core	1 TB / NA
peregrine-cpu	GPU	1 TB / 640 GB
kestrel-gpu	GPU	360 GB / 288 GB

Quality of Service (QoS)

We use QoS to classify different jobs based on limits. In each partition there is a default QoS (bold below). Each QoS has specific limits:

Partition	QoS	Time Limit
peregrine-cpu	cpu_debug	30 min
peregrine-cpu	cpu_short	24 hours
peregrine-cpu	cpu_medium	3 days
peregrine-cpu	cpu_long	10 days

[peregrine,kestrel]-gpu	cpu_debug	30 min
[peregrine,kestrel]-gpu	cpu_short	24 hours
[peregrine,kestrel]-gpu	cpu_medium	3 days
[peregrine,kestrel]-gpu	cpu_long	10 days

Resource Limits

CPU Partition Limits

Partition	QoS	Time Limit	Cores Limit	Jobs (running)	Jobs (waiting)
peregrine-cpu	cpu_debug	30 min	20	1	2
peregrine-cpu	cpu_short	24 hours	20	1	2
peregrine-cpu	cpu_medium	3 days	20	1	2
peregrine-cpu	cpu_long	10 days	20	1	2

GPU Partition Limits

At present, the resource limits on the GPU partitions are:

Partition	GPUs	CPUs	Nodes	Jobs (running)	Jobs (waiting)
peregrine-gpu (A100)	2	20	1	2	1
kestrel-gpu (3090)	3	32	1	2	2

Submitting Jobs

To submit jobs on a cluster, one needs to allocate resources and then run the executable over these resources. This is usually done by writing a job script. In this job script, you specify a partition, the resources your job needs (CPUs, memory, GPUs, etc) and a QoS.

Refer to the Sample Job Scripts section for help on writing job scripts.

Submit your Job

After you have your job script written in a file (say job.sh), submit the job via the sbatch command as shown below:

sbatch job.sh

sbatch job.sh

On successful submission, you would see an output similar to

Submitted batch job 1025

Submitted batch job 1025

Where that number 1025 is the job id number. This is job id number is needed for troubleshooting, so make a note of it.

You can then monitor the status of your job, refer to the Monitoring Jobs section.

Monitoring Jobs

Job Progress

Once you submit your job, it goes through several states. The most common states are:

PENDING
RUNNING
SUSPENDED
COMPLETING -and-
COMPLETED

Below is a listing of all the states, with their short codes:

Short Code	State
PD	Pending. Job is waiting for resource allocation
R	Running. Job has an allocation and is running
S	Suspended. Execution has been suspended and resources have been released for other jobs
CA	Cancelled. Job was explicitly cancelled by the user or the system administrator
CG	Completing. Job is in the process of completing. Some processes on some nodes may still be active
CD	Completed. Job has terminated all processes on all nodes with an exit code of zero
F	Failed. Job has terminated with non-zero exit code or other failure condition

Slurm provides commands which you can use to monitor your jobs. You can also use the Live Cluster Status web page for a quick glance at all jobs. And you can specify your email address within your job script to be alerted at specific job events, see the Sample Job Scripts section for help configuring email alerts.

Monitoring Commands

squeue

The command squeue provides an overview of jobs in the scheduling queue (state information, allocated resources, runtime, etc.).

Syntax

squeue [options]

squeue [options]

Common options

Option	Description
–user=<user[,user[,…]]>	Request jobs from a comma separated list of users
–jobs=<job_id[,job_id[,…]]>	Request specific jobs to be displayed
–partition=<part[,part[,…]]>	Request jobs to be displayed from a comma separated list of partitions
–states=<state[,state[,…]]>	Display jobs in specific states. Comma separated list or “all”. Default: “PD,R,CG”

The default output format is as follows:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

where

Keyword	Description
JOBID	Job or step ID. For array jobs, the job ID format will be of the form `<job_id>_<index>`
PARTITION	Partition of the job/step
NAME	Name of the job/step
USER	Owner of the job/step
ST	State of the job/step. See above for a description of the most common states
TIME	Time used by the job/step. Format is days-hours:minutes:seconds (days,hours only printed as needed)
NODES	Number of nodes allocated to the job or the minimum amount of nodes required by a pending job
NODELIST (REASON)	For pending jobs: Reason why pending. For failed jobs: Reason why failed. For all other job states: List of allocated nodes.

Examples

List all currently running jobs of user foo:

squeue --user=foo

squeue --user=foo

List all currently running jobs of user foo in partition bar, in running state:

squeue --user=foo --partition=bar --states=R

squeue --user=foo --partition=bar --states=R

sfqueue

The sfqueue command provides the queue status including number of CPUs and GPUs used. The output is similar to that displayed on the live status webpage.

Example

squeue

squeue

sgpu

To see the GPU memory usage use the sgpu command.

Example

sgpu <your-jobid-here>

sgpu <your-jobid-here>

scontrol

The scontrol command provides detailed information about jobs and job steps.

Syntax

scontrol [options] [command]

scontrol [options] [command]

Examples

Show detailed information about with ID 1536:

scontrol show jobid 1536

scontrol show jobid 1536

Show even more detailed information about job with ID 1396 (including jobscript):

scontrol -dd show jobid 1396

scontrol -dd show jobid 1396

sstat

The command sstat provides detailed usage information about running jobs.

Syntax

sstat [options] -j <job(.stepid)>

sstat [options] -j <job(.stepid)>

Examples

Show detailed information about job with ID 1536:

sstat -j 1536

sstat -j 1536

Show even more detailed information about job with ID 1396:

sstat -v -j 1396

sstat -v -j 1396

Deleting Jobs

You may use the scancel command to delete active jobs.

Syntax

The command scancel can be applied to a category of jobs, using the following options:

Command	Description
`-u, --user $USER`	jobs of current user
`-A, --account`	jobs under this charge account
`-n, --jobname`	jobs with this job name
`-p, --partition`	jobs in this partition
`-t, --state`	jobs in this state

Examples

Delete specific job with ID 1536:

scancel 1536

scancel 1536

Delete all running jobs:

scancel --state=R

scancel --state=R

Delete all of your jobs:

scancel --user $USER

scancel --user $USER

Check-pointing

Check-pointing means saving the state of your work periodically, so that it can be resumed later.

Need for Check-pointing

Imagine your code is doing some heavy computations and has been running for hours without saving and suddenly, the computer crashes or shuts down. You are likely to lose your hours-long computational work and would need to start all over again. But if you keep saving your work every ten minutes, if something like above example happens, you will only lose maximum of ten minutes of your computaion.

Check-pointing your code is always a good practice, regardless you are running it on an HPC or not. On HPC systems, check-pointing becomes even more important. There could be several reasons a job could get aborted:

Job exceeds the time limit
Job exceeds the allocated memory
Preemption by other jobs
Node failure (rare, but possible)

In addition to the above job failure reasons, check-pointing is also good for debugging and monitoring your job progress.

How to check-point

You can make your code checkpoint-able by saving its state periodically, and looking for a state file when it starts. Here are some general steps you need to follow:

Look for a state file (which contains a previously saved state).
If such a state file is found, restore the state, otherwise start from the beginning.
Periodically save the state (this could include the intermediate results or the variable values, etc.).

The way you would implement these would vary based on your code.

Job Priority

The job scheduler (SLURM) on Falcon assigns a priority to each job in order to determine which jobs to schedule and when. Priority determines a job’s position in the pending queue relative to other jobs and the order in which the pending jobs will run. This is an integer value which is calculated based on a number of factors.

Priority Calculation

On the Falcon cluster, the job priority is calculated as a weighted sum of the following factors:

Age: Length of time your job has been pending in the queue, eligible to be scheduled. The job priority increases with job’s age.
Size: The size of your job in terms of resources requested (CPU,Memory,GPUs). At present, this is not used to calculate priority.
QoS: Priority based on job’s requested run-time: debug, short, medium or long. At present, this is not used to calculate priority.
Fairshare: Based on your historical usage (explained in the next section). Job priority decreases with the increase in resource usage.

Fair Share

We use the concept of “fair-share” to promote a balanced resource usage among users. The scheduler deprioritizes users with excessive resource utilization. It makes sure that users who haven’t used the cluster as much get higher priority for their jobs, while users who have used the cluster a lot don’t overuse it.

A fractional number between 0 to 1, is assigned to all users based on their past usage. This number keeps changing based on your usage and on the total number of users in the system. Job priority calculation uses this number as one of the factors.

Warning

If you’ve been using a lot of resources, your fair-share number will keep going down, and the job priority for your subsequent jobs will drop.

Click/Tap here to return to Falcon HPC Cluster page.

Colorado State University

Systems and Network Administration

Job Handling

Scheduler

Jobs

Partitions

More on Handling Jobs

Partitions

Quality of Service (QoS)

Resource Limits

CPU Partition Limits

GPU Partition Limits

Submitting Jobs

Submit your Job

Monitoring Jobs

Job Progress

Monitoring Commands

squeue

Syntax

Examples

sfqueue

Example

sgpu

Example

scontrol

Syntax

Examples

sstat

Syntax

Examples

Deleting Jobs

Syntax

Examples

Check-pointing

Need for Check-pointing

How to check-point

Job Priority

Priority Calculation

Fair Share

Warning