This section describes how the scheduler is configured on the Falcon cluster and how to use it.
Scheduler
A scheduler, also known as a task manager or a batch-queuing system, acts as a resource manager to provide users access to the cluster’s resources (such as CPUs, GPUs, and memory) in a fair and efficient manner. There are many different resource managers available, we use SLURM on Falcon.
Jobs
The way a program is run on an HPC cluster differs from how it is run on a typical workstation. We communicate with only the login node(s) when we log into the cluster. However, since the compute nodes are where the cluster’s actual computing power is located, programs need to run on them and not on the login nodes.
Since users cannot login on the compute nodes, you must ask the cluster’s scheduling system to run your program on the compute nodes. To do this, you must submit a unique script with instructions on how to run your program on the compute nodes. This script is then scheduled to be run on the compute nodes by the scheduler as a job.
Partitions
Partitions group nodes into logical sets of resources. They can be considered as queues for jobs, each having a set of resource limits. Users submit their jobs to a specific partition which suits best to their job’s resource requirements.
Refer to the sub-sections on the left for detailed information on partitions, QoS, submitting jobs, and more.
More on Handling Jobs
- Partitions
- Resource Limits
- Submitting Jobs
- Monitoring Jobs
- Deleting Jobs
- Check-pointing
- Job Priority
Partitions
Falcon cluster provides different CPU and GPU architectures. Also, there are different queues for different job priorities and limitations.
We use the following control techniques for specifying queues and resources:
- Partitions
- Quality of Service (QoS)
At present there are 3 partitions, peregrine-cpu is the default:
| Partition | QoS | Time Limit | Cores Limit | Jobs (running) | Jobs (waiting) |
|---|---|---|---|---|---|
| peregrine-cpu | cpu_debug | 30 min | 20 | 1 | 2 |
| peregrine-cpu | cpu_short | 24 hours | 20 | 1 | 2 |
| peregrine-cpu | cpu_medium | 3 days | 20 | 1 | 2 |
| peregrine-cpu | cpu_long | 10 days | 20 | 1 | 2 |
| Partition | Job Type CPU/GPU | CPU Memory / GPU Memory |
| peregrine-cpu | single and multi-core | 1 TB / NA |
| peregrine-cpu | GPU | 1 TB / 640 GB |
| kestrel-gpu | GPU | 360 GB / 288 GB |
Quality of Service (QoS)
We use QoS to classify different jobs based on limits. In each partition there is a default QoS (bold below). Each QoS has specific limits:
| Partition | QoS | Time Limit |
|---|---|---|
| peregrine-cpu | cpu_debug | 30 min |
| peregrine-cpu | cpu_short | 24 hours |
| peregrine-cpu | cpu_medium | 3 days |
| peregrine-cpu | cpu_long | 10 days |
| [peregrine,kestrel]-gpu | cpu_debug | 30 min |
| [peregrine,kestrel]-gpu | cpu_short | 24 hours |
| [peregrine,kestrel]-gpu | cpu_medium | 3 days |
| [peregrine,kestrel]-gpu | cpu_long | 10 days |
Resource Limits
CPU Partition Limits
| Partition | QoS | Time Limit | Cores Limit | Jobs (running) | Jobs (waiting) |
|---|---|---|---|---|---|
| peregrine-cpu | cpu_debug | 30 min | 20 | 1 | 2 |
| peregrine-cpu | cpu_short | 24 hours | 20 | 1 | 2 |
| peregrine-cpu | cpu_medium | 3 days | 20 | 1 | 2 |
| peregrine-cpu | cpu_long | 10 days | 20 | 1 | 2 |
GPU Partition Limits
At present, the resource limits on the GPU partitions are:
| Partition | GPUs | CPUs | Nodes | Jobs (running) | Jobs (waiting) |
|---|---|---|---|---|---|
| peregrine-gpu (A100) | 2 | 20 | 1 | 2 | 1 |
| kestrel-gpu (3090) | 3 | 32 | 1 | 2 | 2 |
Submitting Jobs
To submit jobs on a cluster, one needs to allocate resources and then run the executable over these resources. This is usually done by writing a job script. In this job script, you specify a partition, the resources your job needs (CPUs, memory, GPUs, etc) and a QoS.
Refer to the Sample Job Scripts section for help on writing job scripts.
Submit your Job
After you have your job script written in a file (say job.sh), submit the job via the sbatch command as shown below:
sbatch job.shOn successful submission, you would see an output similar to
Submitted batch job 1025Where that number 1025 is the job id number. This is job id number is needed for troubleshooting, so make a note of it.
You can then monitor the status of your job, refer to the Monitoring Jobs section.
Monitoring Jobs
Job Progress
Once you submit your job, it goes through several states. The most common states are:
- PENDING
- RUNNING
- SUSPENDED
- COMPLETING -and-
- COMPLETED
Below is a listing of all the states, with their short codes:
| Short Code | State |
|---|---|
| PD | Pending. Job is waiting for resource allocation |
| R | Running. Job has an allocation and is running |
| S | Suspended. Execution has been suspended and resources have been released for other jobs |
| CA | Cancelled. Job was explicitly cancelled by the user or the system administrator |
| CG | Completing. Job is in the process of completing. Some processes on some nodes may still be active |
| CD | Completed. Job has terminated all processes on all nodes with an exit code of zero |
| F | Failed. Job has terminated with non-zero exit code or other failure condition |
Slurm provides commands which you can use to monitor your jobs. You can also use the Live Cluster Status web page for a quick glance at all jobs. And you can specify your email address within your job script to be alerted at specific job events, see the Sample Job Scripts section for help configuring email alerts.
Monitoring Commands
squeue
The command squeue provides an overview of jobs in the scheduling queue (state information, allocated resources, runtime, etc.).
Syntax
squeue [options]Common options
| Option | Description |
|---|---|
| –user=<user[,user[,…]]> | Request jobs from a comma separated list of users |
| –jobs=<job_id[,job_id[,…]]> | Request specific jobs to be displayed |
| –partition=<part[,part[,…]]> | Request jobs to be displayed from a comma separated list of partitions |
| –states=<state[,state[,…]]> | Display jobs in specific states. Comma separated list or “all”. Default: “PD,R,CG” |
The default output format is as follows:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
where
| Keyword | Description |
|---|---|
| JOBID | Job or step ID. For array jobs, the job ID format will be of the form <job_id>_<index> |
| PARTITION | Partition of the job/step |
| NAME | Name of the job/step |
| USER | Owner of the job/step |
| ST | State of the job/step. See above for a description of the most common states |
| TIME | Time used by the job/step. Format is days-hours:minutes:seconds (days,hours only printed as needed) |
| NODES | Number of nodes allocated to the job or the minimum amount of nodes required by a pending job |
| NODELIST (REASON) | For pending jobs: Reason why pending. For failed jobs: Reason why failed. For all other job states: List of allocated nodes. |
Examples
List all currently running jobs of user foo:
squeue --user=fooList all currently running jobs of user foo in partition bar, in running state:
squeue --user=foo --partition=bar --states=Rsfqueue
The sfqueue command provides the queue status including number of CPUs and GPUs used. The output is similar to that displayed on the live status webpage.
Example
squeuesgpu
To see the GPU memory usage use the sgpu command.
Example
sgpu <your-jobid-here>scontrol
The scontrol command provides detailed information about jobs and job steps.
Syntax
scontrol [options] [command]Examples
Show detailed information about with ID 1536:
scontrol show jobid 1536Show even more detailed information about job with ID 1396 (including jobscript):
scontrol -dd show jobid 1396sstat
The command sstat provides detailed usage information about running jobs.
Syntax
sstat [options] -j <job(.stepid)>Examples
Show detailed information about job with ID 1536:
sstat -j 1536Show even more detailed information about job with ID 1396:
sstat -v -j 1396Deleting Jobs
You may use the scancel command to delete active jobs.
Syntax
The command scancel can be applied to a category of jobs, using the following options:
| Command | Description |
|---|---|
-u, --user $USER | jobs of current user |
-A, --account | jobs under this charge account |
-n, --jobname | jobs with this job name |
-p, --partition | jobs in this partition |
-t, --state | jobs in this state |
Examples
Delete specific job with ID 1536:
scancel 1536Delete all running jobs:
scancel --state=RDelete all of your jobs:
scancel --user $USERCheck-pointing
Check-pointing means saving the state of your work periodically, so that it can be resumed later.
Need for Check-pointing
Imagine your code is doing some heavy computations and has been running for hours without saving and suddenly, the computer crashes or shuts down. You are likely to lose your hours-long computational work and would need to start all over again. But if you keep saving your work every ten minutes, if something like above example happens, you will only lose maximum of ten minutes of your computaion.
Check-pointing your code is always a good practice, regardless you are running it on an HPC or not. On HPC systems, check-pointing becomes even more important. There could be several reasons a job could get aborted:
- Job exceeds the time limit
- Job exceeds the allocated memory
- Preemption by other jobs
- Node failure (rare, but possible)
In addition to the above job failure reasons, check-pointing is also good for debugging and monitoring your job progress.
How to check-point
You can make your code checkpoint-able by saving its state periodically, and looking for a state file when it starts. Here are some general steps you need to follow:
- Look for a state file (which contains a previously saved state).
- If such a state file is found, restore the state, otherwise start from the beginning.
- Periodically save the state (this could include the intermediate results or the variable values, etc.).
The way you would implement these would vary based on your code.
Job Priority
The job scheduler (SLURM) on Falcon assigns a priority to each job in order to determine which jobs to schedule and when. Priority determines a job’s position in the pending queue relative to other jobs and the order in which the pending jobs will run. This is an integer value which is calculated based on a number of factors.
Priority Calculation
On the Falcon cluster, the job priority is calculated as a weighted sum of the following factors:
- Age: Length of time your job has been pending in the queue, eligible to be scheduled. The job priority increases with job’s age.
- Size: The size of your job in terms of resources requested (CPU,Memory,GPUs). At present, this is not used to calculate priority.
- QoS: Priority based on job’s requested run-time: debug, short, medium or long. At present, this is not used to calculate priority.
- Fairshare: Based on your historical usage (explained in the next section). Job priority decreases with the increase in resource usage.
Fair Share
We use the concept of “fair-share” to promote a balanced resource usage among users. The scheduler deprioritizes users with excessive resource utilization. It makes sure that users who haven’t used the cluster as much get higher priority for their jobs, while users who have used the cluster a lot don’t overuse it.
A fractional number between 0 to 1, is assigned to all users based on their past usage. This number keeps changing based on your usage and on the total number of users in the system. Job priority calculation uses this number as one of the factors.
Warning
If you’ve been using a lot of resources, your fair-share number will keep going down, and the job priority for your subsequent jobs will drop.