This page includes a short guide on how to set job parameters to achieve good computation efficiency and optimize the usage of resources, which should result in shorter queue times. This guide aims to be as generic as possible and doesn't cover all possible cases, if your application has specific requirements feel free to experiment beyond the suggestions included in this guide. If in doubt, feel free to contact the PLGrid helpdesk or consult Slurm documentation for an in-depth explanation of topics and options discussed on this page:

https://slurm.schedmd.com/quickstart.html

https://slurm.schedmd.com/sbatch.html

Motivation

Choosing a good, close to optimal, job configuration has many benefits, which include:

  1. Jobs with specific configurations are easier to allocate for the scheduler, thus have shorter queue times.
  2. More jobs are running at the same time. This leads to a shorter "time to result" if you have a set of jobs.
  3. Optimal usage of assigned resources. Some applications can utilize many cores, while others achieve the best results with fewer cores.

Consult the hardware

We need to know the underlying hardware to choose a proper job configuration. You can find the hardware configuration of cluster nodes in the manual for the particular cluster or ask Slurm to tell you how the nodes are configured through the "control show node <nodename>" command. In the case of Ares, a CPU node has 48 cores and 184GB of memory available for the user. Note that there is a specific ratio of memory for each CPU, similarly, there is a certain number of CPUs and memory for each GPU in the case of GPU nodes.

Let's start from the most simple example:

Single core job
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=3850M
# ... other configuration directives

In the above script, we allocate a single task with one core on a single node (duh!) and 3850GB of memory. Those parameters and values are the defaults for jobs in the plgrid queue on Ares, so specifying them is optional, but it is a good practice to state them in cases where a job uses more than one core.

If you are unsure if your application can benefit from allocating a whole node, allocating a certain node fraction is best. For example, if you are migrating from Prometheus to Ares, and your application was running on a whole node on Prometheus (24 cores), we can try to replicate this configuration on Ares. Half of a computing node on Ares has 24 cores with 92GB of memory, which looks like a good fit. In the Slurm script, this would look like the following example:

Half node job
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=92G
# ... other configuration directives

Note that the script explicitly requests 1 node, 1 task, and 24 cores for the task. The memory declaration states that we request half of the node's memory. Why not allocate 120GB? Because then we would allocate more than half of the total memory, more than half of the node. In such a case, two jobs wouldn't be able to fit on one node, which results in more difficult scheduling and suboptimal resource usage.

Additionally, if the job exceeds the ratio of memory proportional to 1 CPU (3,85GB on Ares), the accounting system takes this into account and charges the grant as if the job would have used more CPUs! An example of such a job would be to request 24 CPU and 120GB of memory, resulting in job billing as if the job used 32 cores.

If your job can utilize many cores, allocating a whole node or multiple whole nodes is best. Avoid jobs where individual tasks are spread across several nodes. Such jobs usually perform poorly, as communication within a single node is much faster than talking to other machines. A sample job script for the multi-node job is shown below:

Multi node job
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
#SBATCH --mem=184G
# ... other configuration directives
# ... some MPI application

This script specifies that the job will use 4 nodes and 192 cores. Each node will allocate 184GB of memory. Please note that the job explicitly states that it will use 4 whole nodes, where we request 48 tasks on each node. This way ensures that tasks are close to each other.

How to determine if a job has good efficiency

There are some general guidelines applicable to most cases, which include the following:

  1. Consult the output of "hpc-jobs-history" command, which includes the "efficiency" column. This column is a rough estimate of used CPU time. Low values suggest that the application is not using all the cores or time is spent on things other than computation, including IO, memory allocation, etc.
  2. Note the duration of jobs, which is a universal guideline on how fast the computations are performed.
  3. Test a chosen set of job configurations, and determine a performance/CPU ratio. This will allow for determining the best number of CPUs for the job.

Of course, the above guidelines are not definitive. E.g., if a job has the best performance to cpu ratio for single core jobs, which last a week, it is not an optimal configuration. If the job can scale with reasonable efficiency up to a certain point where the job takes 1 day to execute, this is the best option. As a general guideline, keeping job runtime from 1 hour to 3 days maximum is best. Too short jobs might result in significant scheduling overhead.

If your application includes a specific metric for determining performance, you're in luck! Examples of such applications include Namd and Gromacs. In such cases, we have a clear indication of performance right from the start of calculations and determining the optimal job configuration boils down to testing a few possibilities and estimating the cost of running a job versus performed computation steps.

Optimize the queue time

You can ask the scheduler to provide the estimated start time of your job. This can be done by issuing the "sbatch --test-only script.sh" command. This command doesn't submit the job, but it returns a pessimistic estimate of when the job will be started, keep in mind that it is just an estimate, and in most cases, the queue time will be shorter.

To shorten the queue times, we can apply the following methods:

  1. If you make a job easier to schedule, it might be started faster. This includes:
    1. Applying suggestions from previous points and requesting fractions of nodes or full nodes.
    2. Setting a realistic estimate of job runtime. This way, the scheduler might squeeze your job into a free spot more easily.
  2. Choose an optimal configuration for your job. If your job doesn't benefit from larger resources, reduce the job size. Smaller jobs might take longer to execute but are usually started faster due to backfilling and available resources.
  3. Plan your work! Sometimes there is no way around the long queue, so one way to mitigate the wait times is to plan your work ahead of time and submit jobs in advance.
  • No labels