Projects configuration and Settings#

Jobflow-remote allows to handle multiple configurations, defined projects. Since for most of the users a single project is enough let us first consider the configuration of a single project. The handling of Multiple Projects will be described below.

Aside from the project options, a set of General Settings - Environment variables can be also be configured through environment variables or an additional configuration file.

Project options#

The project configurations allow to control the behaviour of the Job execution, as well as the other objects in jobflow-remote. Here a full description of the project’s configuration file will be given. If you are looking for a minimal example with its description you can find it in the Configuration section.

The specifications of the project’s attributes are given by the Project pydantic model, that serves the purpose of parsing and validating the configuration files, as well as giving access to the associated objects (e.g. the JobStore). A graphical representation of the Project model and thus of the options available in the configuration file is given below (generated with erdantic)

All-in-one configuration

A description for all the types and keys of the project file is given in the Project specs section below, while an example for a full configuration file can be generated running:

jf project generate --full YOUR_PROJECT_NAME

Note that, while the default file format is YAML, JSON and TOML are also acceptable format. You can generate the example in the other formats using the --format option.

Name and folders#

The project name is given by the name attribute. The name will be used to create a subfolder containing

  • files with the parsed outputs copied from the remote workers

  • logs

  • files used by the daemon

For all these folders the paths are set with defaults, but can be customised setting

tmp_dir, log_dir and daemon_dir.

Warning

The project name does not take into consideration the configuration file name. For coherence it would be better to use the project name as file name.

Example

Standard usage will not require filling in all directories path and usually just the name can be provided in the configuration:

name: my_project

Workers#

Multiple workers can be defined in a project. In the configuration file they are given with their name as keyword, and their properties in the contained dictionary.

Several defining properties should be set in the configuration of each workers. First it should be specified the type. At the moment the possible worker types are

  • local: a worker running on the same system as the Runner. No connection is needed for the Runner to reach the queueing system.

  • remote: a worker on a different machine than the Runner, requiring an SSH connection to reach it.

Since the Runner needs to constantly interact with the workers, for the latter type all the credentials to connect automatically should be provided. The best option would be to set up a passwordless connection and define it in the ~/.ssh/config file.

The other key property of the workers is the scheduler_type. It can be any of the values supported by the qtoolkit. Typical values are:

  • shell: the Job is executed directly in the shell. No queue will be used. If not limited, all the Jobs can be executed simultaneously

  • slurm, pbs, …: the name of a queueing system. The job will be submitted to the queue with the selected resources.

Another mandatory argument is work_dir, indicating the full path for a folder on the worker machine where the Jobs will be actually executed.

It is possible to optionally select default values for keywords like pre_run and resources, that can be overridden for individual Jobs. Note that these configurations will be applied to all the Jobs executed by the worker. These are thus more suitable for generic settings (e.g. the activation of a python environment, or loading of some modules), rather than for the specific code configurations. Those can better be set with the Execution configurations.

Note

If a single worker is defined it will be used as default in the submission of new Flows.

Warning

By default, jobflow-remote fetches the status of the jobs from the scheduler by passing the list of ids. If the selected scheduler does not support this option (e.g. SGE), it is also necessary to specify the username on the worker machine through the scheduler_username option. Jobflow-remote will use that as a filter, instead of the list of ids.

Example

Several workers of different kinds can be defined:

workers:
  my_cluster_front_end: # A worker on the front end of the cluster
    scheduler_type: shell
    work_dir: /path/to/run/dir
    # Activate the conda environment on the worker
    pre_run: "\neval \"$(conda shell.bash hook)\"\nconda activate jfr\n"
    type: remote
    # No connection details if they are defined in the ~/.ssh/config file
    host: my_cluster
  my_cluster: # same cluter, a worker using the SLURM queue
    scheduler_type: slurm
    work_dir: /path/to/run/dir
    resources:
    pre_run: "\neval \"$(conda shell.bash hook)\"\nconda activate jfr\n"
    type: remote
    host: my_cluster
  another_cluster: # A remote worker on another cluster
    scheduler_type: slurm
    work_dir: /path/to/another/run/dir
    pre_run: source /data/venv/jobflow/bin/activate
    type: remote
    # also possible to define connection details here
    host: another.cluster.host.net
    user: username
    key_filename: /path/to/ssh/private/key
  another_cluster_batch: # A batch worker on the second cluster
    scheduler_type: slurm
    work_dir: /path/to/another/run/dir
    # Each batch job will run on two nodes with 24 cores each
    resources:
      nodes: 2
      ntasks_per_node: 24
      mem: 95000
      partition: xxx
    pre_run: source /data/venv/jobflow/bin/activate
    type: remote
    host: another.cluster.host.net
    user: username
    key_filename: /path/to/ssh/private/key
    # maximum 5 batch Slurm jobs in the queue at the same time
    max_jobs: 5
    # This will determine that the worker is a "batch" worker
    batch:
      jobs_handle_dir: /some/remote/path/run/batch_handle_slurm
      work_dir: /some/remote/path/run/batch_work_slurm
      # two jobflow Job executed in parallel in a single SLURM submission
      parallel_jobs: 2
  local_shell: # A local worker running in the shell
    scheduler_type: shell
    work_dir: /local/path/to/run/jobs
    pre_run: "\neval \"$(conda shell.bash hook)\"\nconda activate atomate2_pyd2\n"
    type: local
    # Limit the number of local jobs. Since there is no queue they can all
    # start simultaneously otherwise
    max_jobs: 3

JobStore#

The jobstore value contains a dictionary representation of the standard JobStore object defined in jobflow. It can either be the serialized version as obtained by the as_dict module or the representation defined in jobflow’s documentation.

This JobStore will be used to store the outputs of all the Jobs executed in this project.

Note

The JobStore should be defined in jobflow-remote’s configuration file. The content of the standard jobflow configuration file will be ignored.

Warning

If you have been using jobflow without jobflow-remote and you have a JobStore defined in a jobflow.yaml it will be ignored. Only the definition in the jobflow-remote configuration file will be considered

Example

Define a JobStore similarly to standard jobflow

jobstore:
  docs_store:
    type: MongoStore
    host: <host name>
    port: 27017
    username: <username>
    password: <password>
    database: <database name>
    collection_name: outputs

Queue Store#

The queue element contains the definition of the database containing the state of the Jobs and Flows. The subelement store should contain the representation of a maggma Store. As for the JobStore it can be either its serialization or the same kind of representation used for the docs_store in jobflow’s configuration file.

The collection defined by the Store will contain the information about the state of the Job, while two more collections will be created. The name of these two collections can also be customized.

Warning

The queue Store should be a subclass of the MongoStore and currently it should be based on a real MongoDB (e.g. not a JSONStore). Some key operations required by jobflow-remote on the collections are not supported by any file based MongoDB implementation at the moment.

Warning

If the JobStore is also based on a MongoDB, it is often convenient to have its main docs_store in the same database as the queue store, in that case it is important that the two do not point to the same collection. Unexpected errors may happen otherwise.

Example

Define a queue store as maggma store. It is possible to use the same syntax as for the JobStore. Customizing the names of the additional collections is also possible but not necessary

queue:
  store:
    type: MongoStore
    host: <host name>
    port: 27017
    username: <username>
    password: <password>
    database: <database name>
    collection_name: jobs
  flows_collection: flows
  auxiliary_collection: jf_auxiliary

Execution configurations#

It is possible to define a set of ExecutionConfig objects to quickly set up configurations for different kind of Jobs and Flow. The exec_config key contains a dictionary where the keys are the names associated to the configurations and for each a set of instruction to be set before and after the execution of the Job. See the Execution configuration section for more details and a usage examples.

Example

Multiple configurations can be defined, for example one for each version of an external code or for different software requirements. If multiple workers are present, different exec_config will need to be defined for each of them.

exec_config:
  xxx_v1_1_my_cluster:
    modules:
    - releases/2021b
    - intel/2021b
    export:
      PATH: /path/to/executable/v1.1:$PATH
    pre_run:
    post_run:
  xxx_v3_2_my_cluster:
    modules:
    - releases/2023b
    - intel/2023b
    export:
      PATH: /path/to/executable/v3.2:$PATH
  yyy_local:
    export:
      PATH: /path/to/local/executable:$PATH
    pre_run: "echo 'test'\necho 'test2'"

Runner options#

The behaviour of the Runner can also be customized to some extent. In particular the Runner implements an exponential backoff mechanism for retrying when an operation of updating of a Job state fails. The amount of tries and the delay between them can be set max_step_attempts and delta_retry values. In addition some reasonable values are set for the delay between each check of the database for different kind of actions performed by the Runner. These intervals can be changed to better fit your needs. Remind that reducing these intervals too much may put unnecessary strain on the database.

Example

Most of the times default values should be fine. Here is how to customize the Runner execution

runner:
  delay_checkout: 10
  delay_check_run_status: 10
  delay_advance_status: 10
  delay_update_batch: 10
  lock_timeout: 86400
  delete_tmp_folder: true
  max_step_attempts: 3
  delta_retry:
  - 30
  - 300
  - 1200

Metadata#

While this does currently not play any role in the execution of jobflow-remote, this can be used to include some additional information to be used by external tools or to quickly distinguish a configuration file among others.

Multiple Projects#

While a single project can be enough for most of the users and for beginners, it may be convenient to define different databases, configurations and python environments to work on different topics. For this reason jobflow-remote will consider as potential projects configuration all the YAML, JSON and TOML files in the ~/.jfremote folder. There is no additional procedure required to add or remove project, aside from creating/deleting a project configuration file.

Warning

Different projects are meant to use different Queue Stores. Sharing the same collections for two projects is not a supported option.

To define the Queue Store for multiple projects two options are available:

  • each project has its own database, with standard collection names

  • a single database is used and each project is assigned a set of collections. For example, a configuration for one of the projects could be:

    queue:
      store:
        type: MongoStore
        database: DB_NAME
        collection_name: jobs_project1
        ...
      flows_collection: flows_project1
      auxiliary_collection: jf_auxiliary_project1
    

    And the same for a second project with different collection names.

There is no constraint for the database and collection used for the output JobStore. Even though it may make sense to separate the sets of outputs, it is possible to share the same collection among multiple projects. In that case the output documents will have duplicated db_id, as each project has its own counter. If this may be an issue it is possible to set different db_id_prefix values in the queue configuration for the different projects.

If more than one project is present and a specific one is not selected, the code will always stop asking for a project to be specified. Python functions like submit_flow and get_jobstore accept a project argument to specify which project should be considered. For the command line interface a general -p allows to select a project for the command that is being executed:

jf -p another_project job list

To define a default project for all the functions and commands executed on the system or in a specific cell see the General Settings - Environment variables section.

Project specs#

Project

Project

Type: object

The configurations of a Project.

No Additional Properties

Name

Type: string

The name of the project

Base Dir

Default: null

The base directory containing the project related files. Default is a folder with the project name inside the projects folder

Type: string
Type: null

Tmp Dir

Default: null

Folder where remote files are copied. Default a 'tmp' folder in base_dir

Type: string
Type: null

Log Dir

Default: null

Folder containing all the logs. Default a 'log' folder in base_dir

Type: string
Type: null

Daemon Dir

Default: null

Folder containing daemon related files. Default to a 'daemon' folder in base_dir

Type: string
Type: null

Type: enum (of string) Default: "info"

The level set for logging

Must be one of:

  • "error"
  • "warn"
  • "info"
  • "debug"

Type: object

The options for the Runner

No Additional Properties

Delay Checkout

Type: integer Default: 30

Delay between subsequent execution of the checkout from database (seconds)

Delay Check Run Status

Type: integer Default: 30

Delay between subsequent execution of the checking the status of jobs that are submitted to the scheduler (seconds)

Delay Advance Status

Type: integer Default: 30

Delay between subsequent advancement of the job's remote state (seconds)

Delay Refresh Limited

Type: integer Default: 600

Delay between subsequent refresh from the DB of the number of submitted and running jobs (seconds). Only used if a worker with max_jobs is present

Delay Update Batch

Type: integer Default: 60

Delay between subsequent refresh from the DB of the number of submitted and running jobs (seconds). Only used if a batch worker is present

Lock Timeout

Default: 86400

Time to consider the lock on a document expired and can be overridden (seconds)

Type: integer
Type: null

Delete Tmp Folder

Type: boolean Default: true

Whether to delete the local temporary folder after a job has completed

Max Step Attempts

Type: integer Default: 3

Maximum number of attempt performed before failing an advancement of a remote state

Delta Retry

Type: array of integer Default: [30, 300, 1200]

List of increasing delay between subsequent attempts when the advancement of a remote step fails

No Additional Items

Each item of this array must be:

Workers

Type: object

A dictionary with the worker name as keys and the worker configuration as values

Each additional property must conform to the following schema


Type: object

Worker representing the local host.

Executes command directly.

No Additional Properties

Type

Type: const Default: "local"

The discriminator field to determine the worker type

Specific value: "local"

Scheduler Type

Type: string

Type of the scheduler. Depending on the values supported by QToolKit

Work Dir

Type: stringFormat: path

Absolute path of the directory of the worker where subfolders for executing the calculation will be created

Resources

Default: null

A dictionary defining the default resources requested to the scheduler. Used to fill in the QToolKit template

Pre Run

Default: null

String with commands that will be executed before the execution of the Job

Post Run

Default: null

String with commands that will be executed after the execution of the Job

Timeout Execute

Type: integer Default: 60

Timeout for the execution of the commands in the worker (e.g. submitting a job)

Max Jobs

Default: null

The maximum number of jobs that can be submitted to the queue.

Default: null

Options for batch execution. If define the worker will be considered a batch worker

Type: object

Configuration for execution of batch jobs.

Allows to execute multiple Jobs in a single process executed on the worker (e.g. SLURM job).

Same definition as BatchConfig
Type: object

Worker representing a remote host reached through an SSH connection.

Uses a Fabric Connection. Check Fabric documentation for more details on the
options defining a Connection.

No Additional Properties

Type

Type: const Default: "remote"

The discriminator field to determine the worker type

Specific value: "remote"

Scheduler Type

Type: string

Type of the scheduler. Depending on the values supported by QToolKit

Work Dir

Type: stringFormat: path

Absolute path of the directory of the worker where subfolders for executing the calculation will be created

Resources

Default: null

A dictionary defining the default resources requested to the scheduler. Used to fill in the QToolKit template

Pre Run

Default: null

String with commands that will be executed before the execution of the Job

Post Run

Default: null

String with commands that will be executed after the execution of the Job

Timeout Execute

Type: integer Default: 60

Timeout for the execution of the commands in the worker (e.g. submitting a job)

Max Jobs

Default: null

The maximum number of jobs that can be submitted to the queue.

Default: null

Options for batch execution. If define the worker will be considered a batch worker

Type: object

Configuration for execution of batch jobs.

Allows to execute multiple Jobs in a single process executed on the worker (e.g. SLURM job).

Same definition as BatchConfig

Host

Type: string

The host to which to connect

Key Filename

Default: null

The filename, or list of filenames, of optional private key(s) and/or certs to try for authentication

Passphrase

Default: null

Passphrase used for decrypting private keys

Gateway

Default: null

A shell command string to use as a proxy or gateway

Connect Kwargs

Default: null

Other keyword arguments passed to paramiko.client.SSHClient.connect

Inline Ssh Env

Default: null

Whether to send environment variables 'inline' as prefixes in front of command strings

Keepalive

Default: 60

Keepalive value in seconds passed to paramiko's transport

Shell Cmd

Default: "bash"

The shell command used to execute the command remotely. If None the command is executed directly

Login Shell

Type: boolean Default: true

Whether to use a login shell when executing the command

Interactive Login

Type: boolean Default: false

Whether the authentication to the host should be interactive

Type: object

The configuration of the Store used to store the states of the Jobs and the Flows

No Additional Properties

Store

Type: object

Dictionary describing a maggma Store used for the queue data. Can contain the monty serialized dictionary or a dictionary with a 'type' specifying the Store subclass. Should be subclass of a MongoStore, as it requires to perform MongoDB actions. The collection is used to store the jobs

Flows Collection

Type: string Default: "flows"

The name of the collection containing information about the flows. Taken from the same database as the one defined in the store

Auxiliary Collection

Type: string Default: "jf_auxiliary"

The name of the collection containing auxiliary information. Taken from the same database as the one defined in the store

Db Id Prefix

Default: null

a string defining the prefix added to the integer ID associated to each Job in the database

Type: string
Type: null

Exec Config

Type: object

A dictionary with the ExecutionConfig name as keys and the ExecutionConfig configuration as values

Each additional property must conform to the following schema

Type: object

Configuration to be set before and after the execution of a Job.

No Additional Properties

Modules

Default: null

list of modules to be loaded

Type: array of string
No Additional Items

Each item of this array must be:

Export

Default: null

dictionary with variable to be exported

Pre Run

Default: null

Other commands to be executed before the execution of a job

Post Run

Default: null

Commands to be executed after the execution of a job

Jobstore

Type: object

The JobStore used for the input. Can contain the monty serialized dictionary or the Store int the Jobflow format

Metadata

Default: null

A dictionary with metadata associated to the project

Type: object
Type: null

General Settings - Environment variables#

Aside from the project specific configuration, a few options can also be defined in general. There are two ways to set these options:

  • set the value in the ~/.jfremote.yaml configuration file.

  • set an environment variable composed by the name of the variable and prepended by the JFREMOTE_ prefix:

    export JFREMOTE_PROJECT=project_name
    

Note

The name of the exported variables is case-insensitive (i.e. jfremote_project is equally valid).

The most useful variable to set is the project one, allowing to select the default project to be used in a multi-project environment.

Other generic options are the location of the projects folder, instead of ~/.jfremote (JFREMOTE_PROJECT_FOLDER) and the path to the ~/.jfremote.yaml file itself (JFREMOTE_CONFIG_FILE).

Some customization options are also available for the behaviour of the CLI. For more details see the API documentation jobflow_remote.config.settings.JobflowRemoteSettings.