Advanced options#

In this section we illustrate some options and settings that may require additional configurations, more care in using and dealing with related issues, or that are still considered experimental and may subject to major changes in the future.

Limiting the number of submitted Jobs#

Due to the limitations from the computing center, or simply to avoid having too many jobs in the queue, it may be convenient to limit the number of jobs that are simultaneously submitted to the worker. This limit can be enforced by setting the max_jobs value for the worker in the configuration file.

To avoid frequent queries to the database and requests to the worker’s queue manager, jobflow-remote keeps track internally of the number of jobs that are actually running and only at higher time intervals will validate the internal count with the actual number of jobs submitted to the worker’s queue. This may cause slight delays in submitted jobs, in case some jobs have finished running, but their state has not been updated in the database yet.

OTP-based login#

Warning

This functionality should be considered as experimental and may be subject to changes in the future.

In order to enforce security, some computing centers enforce ssh connections that require providing a password or a generated one-time password (OTP) interactively. Since jobflow-remote needs to use ssh connections in order to interact with remote workers, a different approach is required in handling the runner in this case. While this procedure should not expose any sensitive information, you should consider if it goes against the security policies of the computing center you are working with.

In order to deal with this kind of configuration, first set up the worker that requires an OTP in the standard way, as shown in the Workers section. Then, set the option interactive_login: true in the worker configuration. This will signal to the system that interaction from the user is required during the opening of the connection.

In addition to the interactive_login option, there are two different use cases that should be handled separately by the user in the configuration:

The connection requires ssh key+OTP: the ssh key should be defined explicitly either in the ~/.ssh/config file that defines the connection to the host, or in the configuration of the worker (using the key_filename option).
The connection requires password+OTP: no ssh key should be present in the configuration files.

Warning

The rest of the procedure is independent of which option is being used, but it is crucial to properly define the options for the ssh key, depending on the use case.

When starting the runner, the daemon should also be told that the user needs to input some information during the connection. The runner should thus be started with:

jf runner start -s -i

Unlike the standard procedure, the command will not immediately daemonize (i.e., go to the background), and instead will wait for the runner to start to give access to the I/O of the daemon process. At this point the runner will prompt whichever request for interactive login is coming from the worker’s server. Once all the requests have been answered and the connection achieved, you can leave the interactive mode with a keyboard interrupt (e.g., CTRL+C or CMD+C depending on your OS).

Limitations#

Given the strict connection requirements, this approach comes with some limitations:

The user will always need to provide an OTP when starting the Runner or whenever running commands that imply the connection to the worker.
The Runner will not be able reconnect if the connection drops, so the connection should be stable.
If the connection is killed the Runner should be restarted (some computing centres may automatically kill a connection after a given time period).
The Runner can only run with a single process, not in the split mode.

Batch submission#

Warning

This functionality should be considered as experimental and may be subject to changes in the future.

The standard approach in jobflow-remote uses a single job in the worker (e.g. a SLURM job) to execute a single jobflow Job. However, when working in a cluster with long queues, it may be convenient to execute several Jobs in a single submitted job. To do this jobflow-remote allows you to define special batch workers that will keep executing Jobs until some conditions are met.

More explicitly, a Job that is assigned to a batch worker follows the standard procedure until the UPLOADED state, but instead of having a single job submitted to the queue and going in the SUBMITTED state a file will be added to a specific folder in the file system of the worker and go in the BATCH_SUBMITTED state. This folder will act as a pool of Jobs to be executed by jobs submitted to the worker queue. This preserves the requirement of avoiding direct connections to the queue database from the worker process.

Once the Job has been executed a file will be created in a different folder, that will signal the runner that the Job has TERMINATED. From this point onward the processing of the Job by the runner proceeds in the standard way.

In order to define a batch worker the batch section for that worker should be filled in the configuration file. In particular the jobs_handle_dir and work_dir should be defined. These should represent paths in the file system of the workers that will be used to manage the remote jobs. In addition the max_jobs option of the worker should be set. This will define the maximum number of batch jobs simultaneously submitted to the queue. An minimal configuration for a batch worker would thus be:

worker_name:
  scheduler_type: slurm
  work_dir: /home/guido/software/python/test_jfr/
  host: hpc_host
  max_jobs: 5
  resources:
    partition: debug
    ntasks: 128
    nodes: 1
    time: "24:00:00"
  batch:
    jobs_handle_dir: /remote/path/jfr_handle_dir
    work_dir: /remote/path/jfr_batch_jobs

See the Projects configuration and Settings section for the list of all the configuration options available in the batch section.

Note that, since the completion of a Job and the subsequent potential switch of WAITING Jobs to their READY state is still managed by the runner, this functionality is effective if several Jobs and Flows are managed at the same time. If too much time passes between the end of one Job and the availability of a new one the batch job in the queue will stop.

Warning

The batch section of a worker’s configuration has a max_jobs_per_batch option. It allows for the definition of the maximum number of jobflow Jobs that will be executed in a single batch process. This should not be confused with the max_jobs value mentioned above, that defines the number of submitted batch processes (e.g. the maximum number of SLURM Jobs simultaneously in the queue).

Parallel batch#

Another potential use case is the need of executing multiple Jobs in parallel, inside the same process submitted to the queue manager. For example requesting multiple nodes for a job of the worker (e.g. a SLURM job) and running a different Job on each of the nodes.

It is possible to achieve this by enabling the execution of multiple Jobs in parallel specifying a value of parallel_jobs larger than 1 in the batch section. An example of a configuration for a parallel batch worker is:

worker_name:
  scheduler_type: slurm
  work_dir: /home/guido/software/python/test_jfr/
  host: hpc_host
  max_jobs: 5
  resources:
    partition: debug
    ntasks: 512
    nodes: 4
    time: "24:00:00"
  batch:
    jobs_handle_dir: /remote/path/jfr_handle_dir
    work_dir: /remote/path/jfr_batch_jobs
    parallel_jobs: 4

Consider that, depending on how the cluster is configured and how the job is implemented, it will probably be needed to specify the number of processors used by each of the Jobs. For example, for a Job running a code based on MPI parallelization in SLURM, it may be needed to run the code with the command:

srun --nodes 1 -n 128 --exclusive EXECUTABLE

Additional options may need to be set. It would be advisable to verify the requirements to execute multiple processes in parallel outside jobflow-remote.

Note

There is currently no way of obtaining a list of nodes/cores assigned to each Job from jobflow-remote. If this might be needed to run in the parallel batch mode, consider opening an issue on Github.