Queue system#

The queue system is responsible for scheduling, running, monitoring and acting on a set of realizations.

A number of queue systems are pre-installed with ERT:

LOCAL — run locally, on your machine. Details.
LSF — send computation to an LSF cluster. Details.
TORQUE — send computation to a TORQUE or PBS cluster. Details.
SLURM — send computation to a Slurm cluster. Details.

Select the system using the QUEUE_SYSTEM keyword. For example, the following line in your ERT configuration file specifies that ERT should use the LOCAL system:

QUEUE_SYSTEM LOCAL

This page documents the configuration options available for each queue system. Some of the options apply to all systems, others only to specific queue systems.

Options that affect all queue systems#

In addition to the queue-specific settings, the following options affect all queue systems. These are documented in List of keywords.

MAX_SUBMIT — see List of keywords
MAX_RUNTIME — see List of keywords
STOP_LONG_RUNNING — see List of keywords
JOBNAME — see List of keywords
NUM_CPU — see List of keywords

On Crash or Exit#

Realizations that were submitted before the early exit will keep on running. Results in a runpath can be loaded manually if enough realizations completed. Normally ert will do this automatically at the end of each iteration. See also Restarting ES-MDA.

LOCAL queue#

Let’s create local_queue.ert with the following content:

JOBNAME queue_test_%d

QUEUE_SYSTEM LOCAL
QUEUE_OPTION LOCAL MAX_RUNNING 50

RUNPATH local_testing/realization-<IENS>/iter-<ITER>

NUM_REALIZATIONS 100
MIN_REALIZATIONS 1

INSTALL_JOB QUEUE_TEST QUEUE_TEST
SIMULATION_JOB QUEUE_TEST

In addition to this config, we’ll also need a forward model config QUEUE_TEST:

EXECUTABLE queue_test_forward_model.py

As well as the actual forward model, queue_test_forward_model.py:

#!/usr/bin/env python
import socket

print(socket.gethostname())

Running ERT with this configuration, you can find the hostname of your machine in the STDOUT of the run.

Note that running the test experiment will always run on the LOCAL queue, no matter what your configuration says.

There is only one queue option for the local queue system: MAX_RUNNING.

LSF systems#

IBM’s Spectrum LSF software is a common queue management system in high-performance computing environments.

The following example configuration makes some assumptions:

Passwordless ssh access to the compute cluster.
The mr LSF queue exists (check available queues with bqueues).
The runpath (i.e. a folder with name queue_testing inside the current working directory) is accessible from the LSF server.

Note that the QUEUE_TEST forward model config file and queue_test_forward_model.py remain the same as before.

JOBNAME queue_test_%d

NUM_CPU 2

QUEUE_SYSTEM LSF
QUEUE_OPTION LSF MAX_RUNNING 1
QUEUE_OPTION LSF LSF_SERVER be-grid01 -- Change this to a server you have access to
QUEUE_OPTION LSF LSF_QUEUE mr
QUEUE_OPTION LSF PROJECT_CODE user:$USER

RUNPATH lsf_testing/realization-<IENS>/iter-<ITER>

NUM_REALIZATIONS 1
MIN_REALIZATIONS 1

INSTALL_JOB QUEUE_TEST QUEUE_TEST
SIMULATION_JOB QUEUE_TEST

It is possible to set LSF options in the site-config, which is a site wide configuration that affects all users.

The following is a list of available LSF configuration options:

TORQUE and PBS systems#

TORQUE is a distributed resource manager providing control over batch jobs and distributed compute nodes; it implements the API of the Portable Batch System (PBS), so is compatible with systems using OpenPBS or Altair’s PBS Professional.

ERT offers several options specific to the TORQUE/PBS queue system, controlling how it submits jobs. Currently, the option only works when the machine you are logged into has direct access to the queue system. ERT then submits directly with no further configuration.

To instruct ERT to use a TORQUE/PBS queue system, use the following configuration:

QUEUE_SYSTEM TORQUE

The following is a list of all queue-specific configuration options:

Slurm systems#

Slurm is an open source queue system with many of the same capabilites as LSF. The Slurm support in ERT assumes that the computer you are running on is part of the Slurm cluster and no capabilities for ssh forwarding, shell to use and so on is provided.

The Slurm support in ERT interacts with the Slurm system by issuing sbatch, sinfo, squeue and scancel commands, and parsing the output from these commands. By default the Slurm driver will assume that the commands are in PATH, i.e. the command to submit will be the equivalent of:

bash% sbatch submit_script.sh

But you can configure which binary should be used by using the QUEUE_OPTION SLURM configuration command, for example:

QUEUE_OPTION SLURM SBATCH  /path/to/special/sbatch
QUEUE_OPTION SLURM SINFO   /path/to/special/sinfo
QUEUE_OPTION SLURM SQUEUE  /path/to/special/squeue
QUEUE_OPTION SLURM SCANCEL /path/to/special/scancel

The Slurm queue managing tool has a very fine grained control. In ERT only the most necessary options have been added.