This section describes how the scheduler is configured on the Falcon cluster and how to use it.

Scheduler

A scheduler, also known as a task manager or a batch-queuing system, acts as a resource manager to provide users access to the cluster’s resources (such as CPUs, GPUs, and memory) in a fair and efficient manner. There are many different resource managers available, we use SLURM on Falcon.

Jobs

The way a program is run on an HPC cluster differs from how it is run on a typical workstation. We communicate with only the login node(s) when we log into the cluster. However, since the compute nodes are where the cluster’s actual computing power is located, programs need to run on them and not on the login nodes.

Since users cannot login on the compute nodes, you must ask the cluster’s scheduling system to run your program on the compute nodes. To do this, you must submit a unique script with instructions on how to run your program on the compute nodes. This script is then scheduled to be run on the compute nodes by the scheduler as a job.

Partitions

Partitions group nodes into logical sets of resources. They can be considered as queues for jobs, each having a set of resource limits. Users submit their jobs to a specific partition which suits best to their job’s resource requirements.

Refer to the sub-sections on the left for detailed information on partitions, QoS, submitting jobs, and more.

More on Handling Jobs

  1. Partitions
  2. Resource Limits
  3. Submitting Jobs
  4. Monitoring Jobs
  5. Deleting Jobs
  6. Check-pointing
  7. Job Priority

Partitions

Falcon cluster provides different CPU and GPU architectures. Also, there are different queues for different job priorities and limitations.

We use the following control techniques for specifying queues and resources:

  1. Partitions
  2. Quality of Service (QoS)

At present there are 3 partitions, peregrine-cpu is the default:

PartitionQoSTime LimitCores LimitJobs (running)Jobs (waiting)
peregrine-cpucpu_debug30 min2012
peregrine-cpucpu_short24 hours2012
peregrine-cpucpu_medium3 days2012
peregrine-cpucpu_long10 days2012
PartitionJob Type CPU/GPUCPU Memory / GPU Memory
peregrine-cpusingle and multi-core1 TB / NA
peregrine-cpuGPU1 TB / 640 GB
kestrel-gpuGPU360 GB / 288 GB

Quality of Service (QoS)

We use QoS to classify different jobs based on limits. In each partition there is a default QoS (bold below). Each QoS has specific limits:

PartitionQoSTime Limit
peregrine-cpucpu_debug30 min
peregrine-cpucpu_short24 hours
peregrine-cpucpu_medium3 days
peregrine-cpucpu_long10 days
[peregrine,kestrel]-gpucpu_debug30 min
[peregrine,kestrel]-gpucpu_short24 hours
[peregrine,kestrel]-gpucpu_medium3 days
[peregrine,kestrel]-gpucpu_long10 days

Resource Limits

CPU Partition Limits

PartitionQoSTime LimitCores LimitJobs (running)Jobs (waiting)
peregrine-cpucpu_debug30 min2012
peregrine-cpucpu_short24 hours2012
peregrine-cpucpu_medium3 days2012
peregrine-cpucpu_long10 days2012

GPU Partition Limits

At present, the resource limits on the GPU partitions are:

PartitionGPUsCPUsNodesJobs (running)Jobs (waiting)
peregrine-gpu (A100)220121
kestrel-gpu (3090)332122

Submitting Jobs

To submit jobs on a cluster, one needs to allocate resources and then run the executable over these resources. This is usually done by writing a job script. In this job script, you specify a partition, the resources your job needs (CPUs, memory, GPUs, etc) and a QoS.

Refer to the Sample Job Scripts section for help on writing job scripts.

Submit your Job

After you have your job script written in a file (say job.sh), submit the job via the sbatch command as shown below:

sbatch job.sh

On successful submission, you would see an output similar to

Submitted batch job 1025

Where that number 1025 is the job id number. This is job id number is needed for troubleshooting, so make a note of it.

You can then monitor the status of your job, refer to the Monitoring Jobs section.

Monitoring Jobs

Job Progress

Once you submit your job, it goes through several states. The most common states are:

  • PENDING
  • RUNNING
  • SUSPENDED
  • COMPLETING -and-
  • COMPLETED

Below is a listing of all the states, with their short codes:

Short CodeState
PDPending. Job is waiting for resource allocation
RRunning. Job has an allocation and is running
SSuspended. Execution has been suspended and resources have been released for other jobs
CACancelled. Job was explicitly cancelled by the user or the system administrator
CGCompleting. Job is in the process of completing. Some processes on some nodes may still be active
CDCompleted. Job has terminated all processes on all nodes with an exit code of zero
FFailed. Job has terminated with non-zero exit code or other failure condition

Slurm provides commands which you can use to monitor your jobs. You can also use the Live Cluster Status web page for a quick glance at all jobs. And you can specify your email address within your job script to be alerted at specific job events, see the Sample Job Scripts section for help configuring email alerts.

Monitoring Commands

squeue

The command squeue provides an overview of jobs in the scheduling queue (state information, allocated resources, runtime, etc.).

Syntax
squeue [options]

Common options

OptionDescription
–user=<user[,user[,…]]>Request jobs from a comma separated list of users
–jobs=<job_id[,job_id[,…]]>Request specific jobs to be displayed
–partition=<part[,part[,…]]>Request jobs to be displayed from a comma separated list of partitions
–states=<state[,state[,…]]>Display jobs in specific states. Comma separated list or “all”. Default: “PD,R,CG”

The default output format is as follows:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

where

KeywordDescription
JOBIDJob or step ID. For array jobs, the job ID format will be of the form <job_id>_<index>
PARTITIONPartition of the job/step
NAMEName of the job/step
USEROwner of the job/step
STState of the job/step. See above for a description of the most common states
TIMETime used by the job/step. Format is days-hours:minutes:seconds (days,hours only printed as needed)
NODESNumber of nodes allocated to the job or the minimum amount of nodes required by a pending job
NODELIST (REASON)For pending jobs: Reason why pending.
For failed jobs: Reason why failed.
For all other job states: List of allocated nodes.
Examples

List all currently running jobs of user foo:

squeue --user=foo

List all currently running jobs of user foo in partition bar, in running state:

squeue --user=foo --partition=bar --states=R

sfqueue

The sfqueue command provides the queue status including number of CPUs and GPUs used. The output is similar to that displayed on the live status webpage.

Example
squeue

sgpu

To see the GPU memory usage use the sgpu command.

Example
sgpu <your-jobid-here>

scontrol

The scontrol command provides detailed information about jobs and job steps.

Syntax
scontrol [options] [command]
Examples

Show detailed information about with ID 1536:

scontrol show jobid 1536

Show even more detailed information about job with ID 1396 (including jobscript):

scontrol -dd show jobid 1396

sstat

The command sstat provides detailed usage information about running jobs.

Syntax
sstat [options] -j <job(.stepid)>
Examples

Show detailed information about job with ID 1536:

sstat -j 1536

Show even more detailed information about job with ID 1396:

sstat -v -j 1396

Deleting Jobs

You may use the scancel command to delete active jobs.

Syntax

The command scancel can be applied to a category of jobs, using the following options:

CommandDescription
-u, --user $USERjobs of current user
-A, --accountjobs under this charge account
-n, --jobnamejobs with this job name
-p, --partitionjobs in this partition
-t, --statejobs in this state
Examples

Delete specific job with ID 1536:

scancel 1536

Delete all running jobs:

scancel --state=R

Delete all of your jobs:

scancel --user $USER

Check-pointing

Check-pointing means saving the state of your work periodically, so that it can be resumed later.

Need for Check-pointing

Imagine your code is doing some heavy computations and has been running for hours without saving and suddenly, the computer crashes or shuts down. You are likely to lose your hours-long computational work and would need to start all over again. But if you keep saving your work every ten minutes, if something like above example happens, you will only lose maximum of ten minutes of your computaion.

Check-pointing your code is always a good practice, regardless you are running it on an HPC or not. On HPC systems, check-pointing becomes even more important. There could be several reasons a job could get aborted:

  • Job exceeds the time limit
  • Job exceeds the allocated memory
  • Preemption by other jobs
  • Node failure (rare, but possible)

In addition to the above job failure reasons, check-pointing is also good for debugging and monitoring your job progress.

How to check-point

You can make your code checkpoint-able by saving its state periodically, and looking for a state file when it starts. Here are some general steps you need to follow:

  1. Look for a state file (which contains a previously saved state).
  2. If such a state file is found, restore the state, otherwise start from the beginning.
  3. Periodically save the state (this could include the intermediate results or the variable values, etc.).

The way you would implement these would vary based on your code.

Job Priority

The job scheduler (SLURM) on Falcon assigns a priority to each job in order to determine which jobs to schedule and when. Priority determines a job’s position in the pending queue relative to other jobs and the order in which the pending jobs will run. This is an integer value which is calculated based on a number of factors.

Priority Calculation

On the Falcon cluster, the job priority is calculated as a weighted sum of the following factors:

  • Age: Length of time your job has been pending in the queue, eligible to be scheduled. The job priority increases with job’s age.
  • Size: The size of your job in terms of resources requested (CPU,Memory,GPUs). At present, this is not used to calculate priority.
  • QoS: Priority based on job’s requested run-time: debug, short, medium or long. At present, this is not used to calculate priority.
  • Fairshare: Based on your historical usage (explained in the next section). Job priority decreases with the increase in resource usage.

Fair Share

We use the concept of “fair-share” to promote a balanced resource usage among users. The scheduler deprioritizes users with excessive resource utilization. It makes sure that users who haven’t used the cluster as much get higher priority for their jobs, while users who have used the cluster a lot don’t overuse it.

A fractional number between 0 to 1, is assigned to all users based on their past usage. This number keeps changing based on your usage and on the total number of users in the system. Job priority calculation uses this number as one of the factors.

Warning

If you’ve been using a lot of resources, your fair-share number will keep going down, and the job priority for your subsequent jobs will drop.

Click/Tap here to return to Falcon HPC Cluster page.