Dealing with errors#
Given the several operations required to execute a Job, it is normal that errors may arise. In particular errors may happen during the managing of the Job by the runner, or inside the Job itself. In this section we explain how to distinguish the different kind of errors, how to get information about them and possibly how to fix them.
Types of errors#
When a Job fails to complete, the first task is to understand the source of the error. In jobflow-remote the errors can fall mainly into two categories:
Remote errors: errors happening while the Runner deals with a Job. These include, for example, errors copying the file to and from a worker, or interacting with the queue manager.
Job errors: errors during the execution of a Job on the worker. E.g. failures of the code being executed or bad inputs provided by the user.
Each of these kind of errors may have different kind of messages and solutions.
Note
In some cases the distinction between the two types of errors may be less clear. An error related to the code execution may lead, for example, to a missing output file. In this case the system will just highlight that an output file is missing. A more in-depth analysis of the outputs should reveal that the source of the error is indeed in the executed calculation.
Remote errors#
If a problem arises during the processing of the Job, the runner will typically retry
multiple times before setting the Job to the error state. This is because it is expected that
some of these failures will be temporary or transient. For example, a minor network issue preventing
the output files to be downloaded or an overload of the worker’s queueing system that
may not respond in time. By default the code will retry 3 times at increasing time
intervals. After that the Job state will be set to REMOTE_ERROR
.
Note
In some cases the runner may be able to determine that the error will not be solved
even retrying at a later time and will directly set the state to REMOTE_ERROR
Even before the Job has been set to the REMOTE_ERROR
state, it is possible to
know that an error appeared in the Job and that will be retried after some time because
its state will be highlighted in orange when running jf job list
. For example
┏━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ DB id ┃ Name ┃ State ┃ Job id (Index) ┃ Worker ┃ Last updated [CET] ┃
┡━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ 1 │ job 1 │ CHECKED_OUT │ dc398a00-61b3-43e5-b4c5-f9c2ef8713b5 (1) │ worker1 │ 2024-02-27 15:16 │
│ 2 │ job 2 │ WAITING │ 3b2fdeae-750b-4b77-9395-971a88545f3e (1) │ worker1 │ 2024-02-27 15:16 │
└───────┴───────────┴─────────────┴───────────────────────────────────────────┴───────────┴────────────────────┘
It is also possible to visualize the time at which it will be retried using the -v
option,
i.e. jf job list -v
.
Note
The state reported in jf job list corresponds to the state in which the Job
currently is. This means that in the example above the Job already reached the
CHECKED_OUT
state and the error was during the process of switching to next
state. In this case during the procedure of uploading the files.
At this point there might be already some actions that can be taken. Running
jf job info 1
will show, among the other information, an error message or the stack
trace of the error in the remote.error
field. For example:
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ created_on = '2024-02-27 15:16' │
│ db_id = '1' │
│ index = 1 │
│ metadata = {} │
│ name = 'job 1' │
│ parents = [] │
│ priority = 0 │
│ remote = { │
│ 'step_attempts': 2, │
│ 'retry_time_limit': '2024-02-27 16:51', │
│ 'error': Remote error: Could not create remote directory /path/to/run/folder/dc/39/8a/dc398a00-61b3-43e5-b4c5-f9c2ef8713b5_1 for db_id 1 │
│ } │
│ state = 'CHECKED_OUT' │
│ updated_on = '2024-02-27 15:16' │
│ uuid = 'dc398a00-61b3-43e5-b4c5-f9c2ef8713b5' │
│ worker = 'worker1' │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
If, for example, this was due to a problem with the connection to the worker that has already been resolved, it is possible to avoid waiting for the job to be retried. This can be done by running:
jf job retry 1
If instead the runner has time to retry uploading the files multiple times the Job
will end up in the REMOTE_ERROR
state
┏━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ DB id ┃ Name ┃ State ┃ Job id (Index) ┃ Worker ┃ Last updated [CET] ┃
┡━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ 1 │ job 1 │ REMOTE_ERROR │ dc398a00-61b3-43e5-b4c5-f9c2ef8713b5 (1) │ worker1 │ 2024-02-27 16:05 │
│ 2 │ job 2 │ WAITING │ 3b2fdeae-750b-4b77-9395-971a88545f3e (1) │ worker1 │ 2024-02-27 15:16 │
└───────┴───────────┴──────────────┴───────────────────────────────────────────┴───────────┴────────────────────┘
In a similar way, the info of the Job will contain the details about the failure:
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ created_on = '2024-02-27 15:16' │
│ db_id = '1' │
│ index = 1 │
│ metadata = {} │
│ name = 'job 1' │
│ parents = [] │
│ previous_state = 'CHECKED_OUT' │
│ priority = 0 │
│ remote = { │
│ 'step_attempts': 3, │
│ 'retry_time_limit': '2024-02-27 16:51', │
│ 'error': Remote error: Could not create remote directory /path/to/run/folder/dc/39/8a/dc398a00-61b3-43e5-b4c5-f9c2ef8713b5_1 for db_id 1 │
│ } │
│ state = 'REMOTE_ERROR' │
│ updated_on = '2024-02-27 16:05' │
│ uuid = 'dc398a00-61b3-43e5-b4c5-f9c2ef8713b5' │
│ worker = 'worker1' │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
It can be noticed that the state that was reached before the failure is now shown in
the previous_state
value. Again, if the problem was temporary and has been fixed
the Job can be brought back to the previous_state
(CHECKED_OUT
in this case)
with the command jf job retry 1
.
If for any reason the Job needs to be restarted from scratch, i.e. brought back to
the READY
state, this can be achieved by running:
jf job rerun 1
Note
The jf job rerun
and jf job retry
commands have several options to select
multiple Jobs as once. For example jf job retry -s REMOTE_ERROR
to retry all
the Jobs in the REMOTE_ERROR
state. Check jf job retry -h
for the full
list of options available.
Warning
It is impossible to provide an exhaustive list of potential issues that could lead to
a REMOTE_ERROR
state. So except in some well defined cases, the error messages will be
mainly given by the stack trace of the error.
Job errors#
Errors may of course also arise during the execution of a Job in the worker. In this case
the runner will not be able to tell right away that an error has occurred. It will need to first download the
output of the Job and extract the error from it. In this case the Job will
first reach the DOWNLOADED
state, and then will either become COMPLETED
or
FAILED
, depending whether the Job completed successfully or not.
The kind of errors that can lead to this could be:
issues from the code executed by the Job
bad input parameters
unsuccessful calculation in the Job
insufficient resources allocated in the worker
a bug in the Job’s code or in jobflow-remote
but the possible issues would strictly depend on the Jobs being executed.
As for the Remote errors, the Jobs in this state can be identified from the Job
list with the CLI (jf job list
):
┏━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ DB id ┃ Name ┃ State ┃ Job id (Index) ┃ Worker ┃ Last updated [CET] ┃
┡━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ 5 │ job raise │ FAILED │ 1f871d18-8d0d-4720-bc41-d647027fa5ec (1) │ local_shell │ 2024-02-27 17:26 │
│ 6 │ job 2 │ WAITING │ dc4ebf43-b0b4-46f8-b578-90c433ceb714 (1) │ local_shell │ 2024-02-27 17:25 │
└───────┴───────────┴─────────┴───────────────────────────────────────────┴─────────────┴────────────────────┘
And the details of the error can be obtained with the jf job info 5
command:
╭────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ created_on = '2024-02-27 23:25' │
│ db_id = '5' │
│ end_time = '2024-02-27 23:26' │
│ error = Traceback (most recent call last): │
│ File "/python3.11/site-packages/jobflow_remote/jobs/run.py", line 58, in run_remote_job │
│ response = job.run(store=store) │
│ ^^^^^^^^^^^^^^^^^^^^ │
│ File "/python3.11/site-packages/jobflow/core/job.py", line 583, in run │
│ response = function(*self.function_args, **self.function_kwargs) │
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ File "/path/to/job/file.py", line 24, in raise_job │
│ raise RuntimeError("A generic error") │
│ RuntimeError: An error for a and b │
│ index = 1 │
│ metadata = {} │
│ name = 'add_raise' │
│ parents = [] │
│ priority = 0 │
│ remote = {'step_attempts': 0, 'process_id': '35379'} │
│ run_dir = '/run_jobflow/1f/87/1d/1f871d18-8d0d-4720-bc41-d647027fa5ec_1' │
│ start_time = '2024-02-27 17:26' │
│ state = 'FAILED' │
│ updated_on = '2024-02-27 17:26' │
│ uuid = '1f871d18-8d0d-4720-bc41-d647027fa5ec' │
│ worker = 'worker1' │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────╯
In this case an ad-hoc failing job was executed and the error
contains the stack trace of the
Python error. In most cases however, a failure related to the Job execution may lead to an error
message that would not reveal the true nature of the problem. In that case the best option would be
to investigate the output files in the produced by the Job in the run_dir
folder.
The folder should always contain a queue.out
and a queue.out
file, that contain the
stdout
and stderr
of the executed script. Any issue related to the queuing system and
most issues related to the code executed by the Job are likely to be printed there. For this
reason a convenience tool is available in the CLI to fetch their content directly from the
worker’s folder:
jf job queue-out 5
If the content of these files does not help identifying the issue and the Job produces output files, those should also be checked for errors.
The actions required to solve the issue will depend on the nature of the error itself. If this resulted from a temporary issue (e.g. a failure of the cluster), simply rerunning the job with:
jf job rerun 5
will solve the issue.
Note
Only jf job rerun
should be applied to FAILED
Jobs. jf job retry
is not
suitable in this case.
If the error is due to a lack of resources or wrong configuration options, these can be updated using the specific commands:
jf job set resources
jf job set exec-config
jf job set worker
With the correct resources set, the Job may complete correctly when it is executed again.
If instead the error is caused by a wrong input provided to the Job or to some problem
related to the Job itself, one potential option would be to try to
alter the content of the Job
object in the queue database. However this will depend
on the type of Job being performed and cannot be directly handled by jobflow-remote.
In most cases this kind of errors will require to delete the old Flow (e.g. jf flow delete -jid 5
)
and resubmit it with the correct inputs.
States diagram#
The following diagram summarizes the possible transitions from failed states,
depending on the usage of the rerun
and retry
commands. Rerunning a Job
always brings back the Job to the READY
state, while retrying will bring it
back to its previous state, when the remote error occurred.
Rerun constraints#
In general, rerunning a simple FAILED
Job should not pose any issue. However,
jobflow has a specific feature that allows Jobs to switch to the READY
state
even if some of the parents have failed (see the OnMissing
options for the Job
configuration). In such a case rerunning a the FAILED
parent Job may be lead
to inconsistencies in the queue database.
Jobflow-remote tries to avoid such inconsistencies by limiting the options to rerun a Job. In particular
if a Job
FAILED
and all its children are in theREADY
andWAITING
state, the Job can always be rerunif any of the children (or further descendant) Jobs have a Job index larger than
1
, theFAILED
Job can never be rerun, as there is no way to handle the state of the children in a meaningful way.in all the other cases it would be possible to rerun the
FAILED
Job, using the the--force
option in thejf job rerun
command. The user should be aware that this may lead to inconsistencies if dynamical generation of Jobs is involved. Child Jobs will be rerun and set inREADY
orWAITING
state. In this case, the Runner should be preferably stopped beforehand in order to minimize the risk of inconsistencies.
Runner errors and Locked jobs#
Even if less likely, the Runner itself may also have issues during its execution. Aside from bugs, the process could also be stopped abruptly due to issues on the machine that is hosting it or for having required too much resources (e.g. too much memory).
If this happens and the Runner was processing a Job, it is likely that the Job will be left in a locked state. In fact, to avoid concurrent actions, when working on a Job the system puts a lock on the corresponding document in the database. If the Runner is killed the lock will not be released. This means that, even if the Runner is restarted, no further process of the Job could happen, because the system will expect that some other process is working on the locked Job.
Locked Jobs can be identified in the job list using the -v
or -vv
option:
jf job list -v
which will give the following output, including the Locked
column:
┏━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓
┃ DB id ┃ Name ┃ State ┃ Job id (Index) ┃ Worker ┃ Last updated [CET] ┃ Queue id ┃ Run time ┃ Retry time [CET] ┃ Prev state ┃ Locked ┃
┡━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩
│ 7 │ job 1 │ UPLOADED │ c527bdee-1edd-48df-b533-fe20a63fa8c6 (1) │ worker1 │ 2024-02-28 11:49 │ │ │ │ │ * │
│ 8 │ job 2 │ WAITING │ fcb75bca-d6f3-45c0-8980-79dec4ad0737 (1) │ worker1 │ 2024-02-28 11:46 │ │ │ │ │ │
└───────┴───────────┴──────────┴───────────────────────────────────────────┴─────────────┴────────────────────┴──────────┴──────────┴──────────────────┴────────────┴────────┘
Alternatively, the list of locked Jobs can be obtained with the jf job list -l
command.
Warning
The presence of a locked Job in the list does not imply an error in the Runner. Jobs will be constantly locked while performing operations on them. For example, transferring files from the worker may take some time and the Job will remain locked during the whole procedure.
If a Job appears to be locked for a long time or if the Runner is stopped and a Job is still locked, it is likely that the lock was not properly released.
Note
Remember that when running jf runner stop
or jf runner shutdown
the runner will
not stop immediately, if it is in the middle of an operation. Check the Runner
status with jf runner status
or by inspecting the runner.log
file in the
~/.jfremote/PROJECT_NAME/log
folder to determine if the Runner is active or not.
In case the Job should not be locked, it can be unlocked with the command:
jf admin remove-lock -jid 5
At this point the Runner will repeat the action.
Warning
Any action that could have been previously performed will be repeated.
Alternatively, it is also possible to entirely rerun the Job passing the --break-lock
option:
jf job rerun --break-lock 5
The runner should be preferably stopped before performing this command.
Runner logs#
In addition to the error messages, if the source of an error could not be determined, it may be worth trying to inspect the log files produced by the runner.
Each project has its own folder (by default as a subfolder of ~/.jfremote
) and the
logs can be found in the ~/.jfremote/PROJECT_NAME/log
directory. The runner.log
file
contain the log message produced by the python Runner
object. This is more likely
to contain information concerning errors related to the code. The supervisord.log
is instead the log produced by supervisord, that manages the daemon processes.