Runner#
In jobflow-remote the Runner refers to one or more processes that handle the whole
execution of the jobflow workflows, including the interaction with the worker and the
writing of the outputs in the JobStore
.
The way the Job states change based on the action of the Runner has already been described in the introductory Working principles section. This section will instead focus on the technical aspects of the runner execution.
Setup#
As explained in the Setup options section, and exemplified in the figure below,
the Runner process must have access to all required resources as specified in the project
configuration. In particular all the workers, the MongoDB database defined in the
queue
section and the output JobStore
.
Runner processes#
The Runner
performs different tasks, mainly divided in
checking out jobs from database to start a Flow execution
updating the states of the Jobs in the
queue
databaseinteracting with the worker hosts to upload/download files and check the job status
inserting the output data in the output
JobStore
While all these can be executed in a single process, to speed up the execution of the jobs, the default option is to start different daemonized processes, each of which takes care of one of the actions listed above. In addition, if many Jobs need to be dealt with simultaneously, it is also possible to start multiple independent processes that will deal with the tasks 3 and 4. This can allow to increase the throughput.
Note
This means that multiple instances of the Runner
process are simultaneously
running on the machine, each one requiring a certain amount of memory. If the
system memory is a limiting factor, all the actions can be executed in a single
process by starting the daemon with the --single
option.
When activated from the CLI, these run as daemonized processes, that are handled by
Supervisor. The CLI, through the DaemonManager
object, will provide an interface to interact with the daemon and start, stop and kill
the Runner
processes.
Process management#
Start#
The Runner
is usually started using the CLI command:
jf runner start
This will start the Supervisor process, that will then spawn the single or
multiple Runner processes. Note that the command will not wait for all the processes
to start, so the successful completion of the command does not necessarily imply
that all the Runner processes are active.
The number of runner processes can only be managed at start time. The --single
option will run all the actions described in the previous section in a single
process, instead that in multiple ones, which is the default. The --transfer
and --complete
options allow to increase the number of processes dedicated to
the steps 3 and 4.
Warning
The Runner
reads the project configurations when the processes is
started and does not attempt to refresh them during the execution. Whenever
the project configuration is changed the Runner
needed the runner needs
to be restarted.
Stop#
Executing the stop command:
jf runner stop
relies on Supervisor to send a SIGTERM
signal (a termination signal that allows
the process to exit cleanly) to all the Runner
processes.
In this case the supervisor process will remain active. Unless the --wait
option
is specified, the completion of the command will not imply that all the Runner
processes have been terminated.
Warning
The Runner
is designed to recognize the signal and wait for the completion of
the action being performed, before actually exiting.
Note
Since the supervisor process remains active, when starting the runner again after a stop it is not possible to switch from a single process to a split configuration or the other way round. It is necessary to shut down the whole daemon in that case.
Shutdown#
Shutting down the runner with the command:
jf runner shutdown
is equivalent to the Stop, except that also the Supervisor process will be stopped.
Kill#
It is possible to directly kill all the processes, without sending the SIGTERM
signal and thus without waiting for the current action to be completed with the
command:
jf runner kill
Warning
If an was action was being performed, it is possible that the database may be left in an inconsistent state and/or that the Job that was being processed will be locked, as the runner puts an explicit lock on the document while working on a Job and/or on a Flow. See the Runner errors and Locked jobs section for how to handle these cases.
Information#
It is possible to get an overall state of the runner daemon executing:
jf runner status
This returns a custom global state defined in jobflow-remote. Typical
values are shut_down
, stopped
and running
. A partially_running
state means that some of the daemonized processes are active, while other
are either not yet started or have been stopped.
To get more details about the single processes it is possible to run:
jf runner info
This prints a table like:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ Process ┃ PID ┃ State ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ supervisord │ 12305 │ RUNNING │
│ runner_daemon_checkout:run_jobflow_checkout │ 90127 │ RUNNING │
│ runner_daemon_complete:run_jobflow_complete0 │ 90128 │ RUNNING │
│ runner_daemon_queue:run_jobflow_queue │ 90129 │ RUNNING │
│ runner_daemon_transfer:run_jobflow_transfer0 │ 90130 │ RUNNING │
└──────────────────────────────────────────────┴───────┴─────────┘
providing the state of each individual daemon process and its system process ID.
Running daemon check#
There are no strict limitations to which machine should be used to execute the Runner
as a daemon, and, as explained in the Setup options section, there are several
possible configurations. It is thus possible for a user to mistakenly start
the runner daemon on two different locations. While this should not corrupt
the database, thanks to the locking mechanism, it may still generate errors
since the Job outputs could be downloaded from one of the runners and another one
may try to complete it, but without having access to the downloaded files.
This can be confusing as a user may be unaware that a runner is already active on some machine.
To mitigate the possibility of this to occur, jobflow-remote also adds information
in the database about the machine where a Runner
daemon is started. The
code will then prevent the system to start a daemon on a different machine. All the
commands will instead be allowed if the information about the machine where are
executed match those in the database.
If a machine where a Runner
was previously active was switched off without
explicitly stopping it, the database will still consider that daemon to be active.
To start the daemon on another machine, if it is certain that Runner
is not
active anymore, it is possible to clean the database reference to the previous
process with the command:
jf runner reset
A new Runner
daemon can then be started anywhere.
Warning
This procedure is applied only for Runner
processes started as a daemon.
No check is done and no data is added to the database if the Runner
is
started directly. See the Direct execution section below.
Backoff algorithm#
While performing its actions on the Jobs, the Runner
processes may incur in some
issues. For example a connection error may occur. In order to avoid overloading
the processes and/or the resources, when such an error occurs the process will not
immediately retry to execute the action, but wait an increasingly larger amount of time
before retrying. After three failures, the Job will be set to the REMOTE_ERROR
state. See the Remote errors for more details.
Direct execution#
It is not the standard usage, but in some cases, for example during development
or debugging, it may be useful to run the Runner
processes directly and not as
a daemon. The simplest option to do that would be to run:
jf runner run
This will start a single Runner
process performing all the actions.
Similarly, it is also possible to execute this from the python API with the code
below