Queue system
The queue system is responsible for scheduling, running, monitoring and acting
on a set of realizations.
A number of queue systems are pre-installed with ERT:
LOCAL — run locally, on your machine. Details.
LSF — send computation to an LSF cluster. Details.
TORQUE — send computation to a TORQUE or PBS cluster. Details.
SLURM — send computation to a Slurm cluster. Details.
Select the system using the QUEUE_SYSTEM keyword. For example, the
following line in your ERT configuration file specifies that ERT should use the
LOCAL system:
This page documents the configuration options available for each queue system.
Some of the options apply to all systems, others only to specific queue systems.
Options that affect all queue systems
In addition to the queue-specific settings, the following options affect
all queue systems. These are documented in List of keywords .
On Crash or Exit
Realizations that were submitted before the early exit will keep on running.
Results in a runpath can be loaded manually if enough realizations completed.
Normally ert will do this automatically at the end of each iteration.
See also Restarting ES-MDA .
LOCAL queue
Let’s create local_queue.ert with the following content:
JOBNAME queue_test_ < IENS >
QUEUE_SYSTEM LOCAL
QUEUE_OPTION LOCAL MAX_RUNNING 50
RUNPATH local_testing / realization -< IENS >/ iter -< ITER >
NUM_REALIZATIONS 100
MIN_REALIZATIONS 1
INSTALL_JOB QUEUE_TEST QUEUE_TEST
FORWARD_MODEL QUEUE_TEST
In addition to this config, we’ll also need a forward model config
QUEUE_TEST :
EXECUTABLE queue_test_forward_model . py
As well as the actual forward model, queue_test_forward_model.py :
#!/usr/bin/env python
import socket
print ( socket . gethostname ())
Running ERT with this configuration, you can find the hostname of your machine
in the STDOUT of the run.
Note that running the test experiment will always run on the LOCAL queue,
no matter what your configuration says.
There is only one queue option for the local queue system: MAX_RUNNING .
MAX_RUNNING
The queue option MAX_RUNNING controls the maximum number of simultaneously
submitted and running realizations, where n is a positive integer:
QUEUE_OPTION LOCAL MAX_RUNNING n
If n is zero (the default), then there is no limit, and all realizations
will be started as soon as possible.
LSF systems
IBM’s Spectrum LSF software
is a common queue management system in high-performance computing environments.
The following example configuration makes some assumptions:
Passwordless ssh access to the compute cluster.
The mr LSF queue exists (check available queues with bqueues ).
The runpath (i.e. a folder with name queue_testing inside the
current working directory) is accessible from the LSF server.
Note that the QUEUE_TEST forward model config file and
queue_test_forward_model.py remain the same as before.
JOBNAME queue_test_ < IENS >
NUM_CPU 2
QUEUE_SYSTEM LSF
QUEUE_OPTION LSF MAX_RUNNING 1
RUNPATH lsf_testing / realization -< IENS >/ iter -< ITER >
NUM_REALIZATIONS 1
MIN_REALIZATIONS 1
INSTALL_JOB QUEUE_TEST QUEUE_TEST
FORWARD_MODEL QUEUE_TEST
It is possible to set LSF options in the site-config , which is a site wide
configuration that affects all users.
The following is a list of available LSF configuration options:
SUBMIT_SLEEP
Determines for how long in seconds the system will sleep between submitting jobs.
Default: 0 . To change it to 1 second:
QUEUE_OPTION LSF SUBMIT_SLEEP 1
LSF_QUEUE
The name of the LSF queue you wish to send simulations to. The parameter
will be passed as bsub -q name_of_queue (assuming bsub is the
submit command you are using). Docs.
Usage:
QUEUE_OPTION LSF LSF_QUEUE name_of_queue
LSF_RESOURCE
A resource requirement string describes the resources that a job needs.
LSF uses resource requirements to select hosts for remote execution and
job execution. Resource requirement strings can be simple (applying to the
entire job) or compound (applying to the specified number of slots).
The value passed does not use units and depends on the cluster’s configuration,
so follow up with cluster administrator to find out what the set unit is.
Docs.
Passed as the -R option to bsub . For example, this will
request approximately 15 gigabytes when the default unit is megabytes:
QUEUE_OPTION LSF LSF_RESOURCE rusage [ mem = 15000 ]
PROJECT_CODE
String identifier used to map hardware resource usage to a project or account.
The project or account does not have to exist.
Equates to the -P parameter for e.g. bsub .
See docs.
For example, to register jobs in the foo project:
QUEUE_OPTION LSF PROJECT_CODE foo
If the option is not set in the config file and the forward model section contains
any of the following simulator jobs [RMS, FLOW, ECLIPSE100, ECLIPSE300]
a default will be set.:
FORWARD_MODEL RMS < args >
FORWARD_MODEL ECLIPSE100 < args >
This will set the PROJECT_CODE option to rms+eclipse100
EXCLUDE_HOST
Comma separated list of hosts to be excluded. The LSF system will pass this
list of hosts to the -R argument of e.g. bsub with the criteria
hname!=<exluded_host_1> . For example:
QUEUE_OPTION LSF EXCLUDE_HOST host1 , host2
MAX_RUNNING
The queue option MAX_RUNNING controls the maximum number of simultaneous jobs
submitted to the queue when using (in this case) the LSF queue system, where n
is a positive integer:
QUEUE_OPTION LSF MAX_RUNNING n
If n is zero (the default), then it is set to the number of realizations.
TORQUE and PBS systems
TORQUE
is a distributed resource manager providing control over batch jobs and
distributed compute nodes; it implements the API of the Portable Batch System
(PBS), so is compatible with systems using OpenPBS
or Altair’s PBS Professional .
ERT offers several options specific to the TORQUE/PBS queue system, controlling
how it submits jobs. Currently, the option only works when the machine
you are logged into has direct access to the queue system. ERT then submits
directly with no further configuration.
To instruct ERT to use a TORQUE/PBS queue system, use the following
configuration:
The following is a list of all queue-specific configuration options:
QSUB_CMD, QSTAT_CMD, QDEL_CMD
By default ERT will use the shell commands qsub , qstat and qdel
to interact with the queue system, i.e. whatever binaries are first in your
PATH will be used. For fine grained control of the shell based submission
you can tell ERT which programs to use:
QUEUE_SYSTEM TORQUE
QUEUE_OPTION TORQUE QSUB_CMD / path / to / my / qsub
QUEUE_OPTION TORQUE QSTAT_CMD / path / to / my / qstat
QUEUE_OPTION TORQUE QDEL_CMD / path / to / my / qdel
CLUSTER_LABEL
The name of the cluster you are running simulations in. This
might be a label (several clusters), or a single one, as in this example:
QUEUE_OPTION TORQUE CLUSTER_LABEL baloo
MAX_RUNNING
The queue option MAX_RUNNING controls the maximum number of simultaneous jobs
submitted to the queue when using the queue system, where n is a positive
integer:
QUEUE_OPTION TORQUE MAX_RUNNING n
If n is zero (the default), then it is set to the number of realizations.
KEEP_QSUB_OUTPUT
Sometimes the error messages from qsub can be useful, if something is
seriously wrong with the environment or setup. To keep this output (stored
in your home folder), use this:
QUEUE_OPTION TORQUE KEEP_QSUB_OUTPUT 1
SUBMIT_SLEEP
To avoid stressing the TORQUE/PBS system you can instruct the driver to sleep
for every submit request. The argument to the SUBMIT_SLEEP is the number of
seconds to sleep for every submit, which can be a fraction like 0.5:
QUEUE_OPTION TORQUE SUBMIT_SLEEP 0.5
PROJECT_CODE
String identifier used to map hardware resource usage to a project or account.
The project or account does not have to exist.
Equates to the -A parameter for``qsub``
see docs.
For example, to register jobs under the foo account:
QUEUE_OPTION TORQUE PROJECT_CODE foo
If the option is not set in the config file and the forward model section contains
any of the following simulator jobs [RMS, FLOW, ECLIPSE100, ECLIPSE300]
a default will be set.:
FORWARD_MODEL RMS < args >
FORWARD_MODEL ECLIPSE100 < args >
This will set the PROJECT_CODE option to rms+eclipse100
Slurm systems
Slurm is an open source queue system with many
of the same capabilites as LSF. The Slurm support in ERT assumes that the
computer you are running on is part of the Slurm cluster and no capabilities
for ssh forwarding, shell to use and so on is provided.
The Slurm support in ERT interacts with the Slurm system by issuing sbatch ,
sinfo , squeue and scancel commands, and parsing the output from
these commands. By default the Slurm driver will assume that the commands are in
PATH , i.e. the command to submit will be the equivalent of:
bash % sbatch submit_script . sh
But you can configure which binary should be used by using the
QUEUE_OPTION SLURM configuration command, for example:
QUEUE_OPTION SLURM SBATCH / path / to / special / sbatch
QUEUE_OPTION SLURM SINFO / path / to / special / sinfo
QUEUE_OPTION SLURM SQUEUE / path / to / special / squeue
QUEUE_OPTION SLURM SCANCEL / path / to / special / scancel
The Slurm queue managing tool has a very fine grained control. In ERT
only the most necessary options have been added.
SBATCH
Command used to submit the jobs, default `sbatch . To change the executable
to, for example, /opt/bin/sbatch , do this:
QUEUE_OPTION SLURM SBATCH / opt / bin / sbatch
SCANCEL
Command used to cancel the jobs, default scancel .
SCONTROL
Command to modify configuration and state, default scontrol .
SACCT
Command used when scontrol fails, default sacct .
SQUEUE
Command to view information about the queue, default squeue .
PARTITION
Partition/queue in which to run the jobs, for example to use foo :
QUEUE_OPTION SLURM PARTITION foo
MAX_RUNTIME
Specify the maximum runtime in seconds for how long a job can run, for
example:
QUEUE_OPTION SLURM MAX_RUNTIME 100
INCLUDE_HOST
Specific host names to use when running the jobs. It is possible to add multiple
hosts separated by space or comma in one option call, e.g.:
QUEUE_OPTION SLURM INCLUDE_HOST host1 , host2
EXCLUDE_HOST
Specific host names to exclude when running the jobs. It is possible to add multiple
hosts separated by space or comma in one option call, e.g.:
QUEUE_OPTION SLURM EXCLUDE_HOST host3 , host4
MAX_RUNNING
The queue option MAX_RUNNING controls the maximum number of simultaneous jobs
submitted to the queue when using the queue system, where n is a positive
integer:
QUEUE_OPTION SLURM MAX_RUNNING n
If n is zero (the default), then it is set to the number of realizations.
PROJECT_CODE
String identifier used to map hardware resource usage to a project or account.
The project or account does not have to exist.
Equates to the -A parameter for sbatch
see docs.
For example, to register jobs under the foo account:
QUEUE_OPTION SLURM PROJECT_CODE foo
If the option is not set in the config file and the forward model section contains a
any of the following simulator jobs [RMS, FLOW, ECLIPSE100, ECLIPSE300]
a default will be set.:
FORWARD_MODEL RMS < args >
FORWARD_MODEL ECLIPSE100 < args >
This will set the PROJECT_CODE option to rms+eclipse100
GENERIC queue options
There are a number of queue options valid for all queue systems and for those we can use
the GENERIC keyword.
QUEUE_SYSTEM LSF
QUEUE_OPTION GENERIC MAX_RUNNING 10
QUEUE_OPTION GENERIC SUBMIT_SLEEP 2
Is equivalent to:
QUEUE_SYSTEM LSF
QUEUE_OPTION LSF MAX_RUNNING 10
QUEUE_OPTION LSF SUBMIT_SLEEP 2