pysisso package

Submodules

pysisso.inputs module

Module containing classes to create and manipulate SISSO input files.

class pysisso.inputs.SISSODat(data: pandas.core.frame.DataFrame, features_dimensions: Optional[dict] = None, model_type: str = 'regression', nsample: Optional[Union[List[int], int]] = None)[source]

Bases: monty.json.MSONable

Main class containing the data for SISSO (training, test or new data).

property SISSO_features_dimensions_ranges

Get the ranges of features for each dimension.

Returns

Dimension to range mapping.

Return type

dict

classmethod from_dat_file(filepath, features_dimensions=None, nsample=None)[source]

Construct SISSODat object from .dat file.

Parameters
  • filepath – Name of the file.

  • features_dimensions – Dimension of the different base features as a dictionary mapping the name of each feature to its dimension. Features not in the dictionary are supposed to be dimensionless. If set to None, all features are supposed to be dimensionless.

  • nsample – Number of samples in the .dat file. If set to None, will be set automatically.

Returns

SISSODat object extracted from file.

Return type

SISSODat

classmethod from_file(filepath, features_dimensions=None)[source]

Construct SISSODat object from file.

Parameters
  • filepath – Name of the file.

  • features_dimensions – Dimension of the different base features as a dictionary mapping the name of each feature to its dimension. Features not in the dictionary are supposed to be dimensionless. If set to None, all features are supposed to be dimensionless.

Returns

SISSODat object extracted from file.

Return type

SISSODat

property input_string

Input string of the .dat file.

Returns

String for the .dat file.

Return type

str

property nsample

Return number of samples in this data set.

Returns

Number of samples

Return type

int

property nsf

Return number of (scalar) features in this data set.

Returns

Number of (scalar) features.

Return type

int

property ntask

Return number of tasks (i.e. output targets) in this data set.

Returns

Number of tasks

Return type

int

to_file(filename='train.dat')[source]

Write this SISSODat object to file.

Parameters

filename – Name of the file to write this SISSODat to.

class pysisso.inputs.SISSOIn(target_properties_keywords, feature_construction_sure_independence_screening_keywords, descriptor_identification_keywords, fix=False)[source]

Bases: monty.json.MSONable

Main class containing the input variables for SISSO.

This class is basically a container for the SISSO.in input file for SISSO. Additional helper functions are available.

AVAILABLE_OPERATIONS = {'binary': {'*': '', '+': '', '-': '', '/': '', '|-|': ''}, 'unary': {'^-1': '', '^2': '', '^3': '', '^6': '', 'cbrt': '', 'cos': '', 'exp': '', 'exp-': '', 'log': '', 'scd': '', 'sin': '', 'sqrt': ''}}
KW_TYPES = {'L1L0_size4L0': (<class 'int'>,), 'L1_dens': (<class 'int'>,), 'L1_max_iter': (<class 'int'>,), 'L1_minrmse': (<class 'float'>,), 'L1_nlambda': (<class 'int'>,), 'L1_tole': (<class 'float'>,), 'L1_warm_start': (<class 'bool'>,), 'L1_weighted': (<class 'bool'>,), 'desc_dim': (<class 'int'>,), 'dimclass': ('str_dimensions',), 'fit_intercept': (<class 'bool'>,), 'isconvex': ('str_isconvex',), 'maxcomplexity': (<class 'int'>,), 'maxfval_lb': (<class 'float'>,), 'maxfval_ub': (<class 'float'>,), 'method': (<class 'str'>,), 'metric': (<class 'str'>,), 'nm_output': (<class 'int'>,), 'npf_must': (<class 'int'>,), 'nsample': (<class 'int'>, 'list_of_ints'), 'nsf': (<class 'int'>,), 'ntask': (<class 'int'>,), 'nvf': (<class 'int'>,), 'opset': ('str_operators',), 'ptype': (<class 'int'>,), 'restart': (<class 'bool'>,), 'rung': (<class 'int'>,), 'subs_sis': (<class 'int'>, 'list_of_ints'), 'task_weighting': (<class 'int'>,), 'vf2sf': (<class 'str'>,), 'vfsize': (<class 'int'>,), 'width': (<class 'float'>,)}

Types or descriptions (as a string) of the values for each SISSO keyword.

Type

dict

classmethod from_SISSO_dat(sisso_dat: pysisso.inputs.SISSODat, model_type: str = 'regression', **kwargs: object)[source]

Construct SISSOIn object from SISSODat object.

Parameters
  • sisso_dat – SISSODat object containing the data to fit.

  • model_type – Type of model. Should be “regression” or “classification”.

  • **kwargs – Keywords to be passed to SISSOIn.

Returns

SISSOIn object containing all the relevant SISSO input keywords.

Return type

SISSOIn

classmethod from_file(filepath)[source]

Construct SISSOIn from file.

Parameters

filepath – Path of the file.

classmethod from_sisso_keywords(ptype, nsample=None, nsf=None, ntask=1, task_weighting=1, desc_dim=2, restart=False, rung=2, opset='(+)(-)', maxcomplexity=10, dimclass=None, maxfval_lb=0.001, maxfval_ub=100000.0, subs_sis=20, method='L0', L1L0_size4L0=1, fit_intercept=True, metric='RMSE', nm_output=100, isconvex=None, width=None, nvf=None, vfsize=None, vf2sf=None, npf_must=None, L1_max_iter=None, L1_tole=None, L1_dens=None, L1_nlambda=None, L1_minrmse=None, L1_warm_start=None, L1_weighted=None, fix=False)[source]

Construct SISSOIn object from SISSO input keywords.

Parameters

fix – Whether to fix keywords if they are not compatible.

Returns

SISSOIn object containing the SISSO input arguments.

Return type

SISSOIn

input_string(matgenix_acknowledgement=True)[source]

Input string of the SISSO.in file.

Parameters

matgenix_acknowledgement – Whether to add the acknowledgment of Matgenix.

Returns

String for the SISSO.in file.

Return type

str

property is_classification

Whether this SISSOIn object corresponds to a classification model.

Returns

True if this SISSOIn object is a classification model,

False otherwise.

Return type

bool

property is_regression

Whether this SISSOIn object corresponds to a regression model.

Returns

True if this SISSOIn object is a regression model, False otherwise.

Return type

bool

set_keywords_for_SISSO_dat(sisso_dat)[source]

Update keywords for a given SISSO dat object.

Parameters

sisso_dat – SISSODat object to update related keywords.

to_file(filename='SISSO.in')[source]

Write SISSOIn object to file.

Parameters

filename – Name of the file to write SISSOIn object.

pysisso.jobs module

Module containing the custodian jobs for SISSO.

class pysisso.jobs.SISSOJob(SISSO_exe: str = 'SISSO', nprocs: int = 1, stdout_file: str = 'SISSO.log', stderr_file: str = 'SISSO.err')[source]

Bases: custodian.custodian.Job

Custodian Job to run SISSO.

INPUT_FILE = 'SISSO.in'
TRAINING_DATA_DILE = 'train.dat'
postprocess()[source]

Not needed for SISSO.

run() → subprocess.Popen[source]

Run SISSO.

Returns

a Popen process.

setup()[source]

Not needed for SISSO.

pysisso.outputs module

Module containing classes to parse SISSO output files.

class pysisso.outputs.DescriptorsDataModels(data)[source]

Bases: monty.json.MSONable

Class containing the true and predicted data for the best descriptors/models.

This class is a container for the desc_DDDd_pPPP.dat files (DDD being the dimension of the descriptor and PPP the property number in case of multi-task SISSO) that are stored in the desc_dat directory.

Note: see if we want to implement this class, everything might be contained in

SISSO.out and its SISSOOut object.

classmethod from_dat_file(filepath)[source]

Construct this DescriptorsDataModels object from .dat file.

Parameters

filepath – File to construct this DescriptorsDataModels from.

Returns

DescriptorsDataModels object.

Return type

DescriptorsDataModels

classmethod from_file(filepath)[source]

Construct this DescriptorsDataModels object from file.

Parameters

filepath – File to construct this DescriptorsDataModels from.

Returns

DescriptorsDataModels object.

Return type

DescriptorsDataModels

class pysisso.outputs.FeatureSpace[source]

Bases: monty.json.MSONable

Class containing the selected features from SISSO.

This class is a container for the space_DDDd.name files (DDD being the dimension of the descriptor) that are stored in the feature_space directory.

class pysisso.outputs.ResidualData[source]

Bases: monty.json.MSONable

Class containing the residuals for the training data computed at each iteration.

This class is a container for the res_DDDd_pPPP.dat files (DDD being the dimension of the descriptor and PPP the property number in case of multi-task SISSO) that are stored in the residual directory.

class pysisso.outputs.SISSODescriptor(descriptor_id: int, descriptor_string: str)[source]

Bases: monty.json.MSONable

Class containing one composed descriptor.

evaluate(df)[source]

Evaluate the descriptor from a given Dataframe.

Parameters

df – panda’s Dataframe to evaluate SISSODescriptor

Returns

Value of this descriptor for the samples in the dataframe.

Return type

float

classmethod from_string(string: str)[source]

Construct SISSODescriptor from string.

The string must be the line of the descriptor in the SISSO.out output file,

e.g. : 1:[((feature1-feature2)+(feature3-feature4))]

Parameters

string – Substring from the SISSO.out output file corresponding to one descriptor of SISSO.

class pysisso.outputs.SISSOIteration(iteration_number: int, sisso_model: pysisso.outputs.SISSOModel, feature_spaces: Mapping[str, int], SIS_subspace_size: int, cpu_time: float)[source]

Bases: monty.json.MSONable

Class containing one SISSO iteration.

classmethod from_string(string: str)[source]

Construct SISSOIteration object from string.

The string must be the excerpt corresponding to one iteration, i.e. it must start with “iteration: N” and end with “DI done!”.

Parameters

string – String from the SISSO.out output file corresponding to one iteration of SISSO.

class pysisso.outputs.SISSOModel(dimension: int, descriptors: List[pysisso.outputs.SISSODescriptor], coefficients: List[List[float]], intercept: List[float], rmse: Optional[List[float]] = None, maxae: Optional[List[float]] = None)[source]

Bases: monty.json.MSONable

Class containing one SISSO model.

classmethod from_string(string: str)[source]

Construct SISSOModel object from string.

The string must be the excerpt corresponding to one model, starting with a

line of 80 “=” characters and ending with a line of 80 “=” characters.

Parameters

string – String from the SISSO.out output file corresponding to one model of SISSO.

predict(df: pandas.core.frame.DataFrame) → numpy.ndarray[source]

Predict values from input DataFrame.

The input DataFrame should have the columns needed by the different SISSO

descriptors.

Parameters

df – panda’s DataFrame containing the base features needed to apply the model.

Returns

Predicted values from the model.

Return type

darray

class pysisso.outputs.SISSOOut(params: pysisso.outputs.SISSOParams, iterations: List[pysisso.outputs.SISSOIteration], version: pysisso.outputs.SISSOVersion, cpu_time: Optional[float])[source]

Bases: monty.json.MSONable

Class containing the results contained in the SISSO output file (SISSO.out).

classmethod from_file(filepath: str = 'SISSO.out', allow_unfinished: bool = False)[source]

Read in SISSOOut data from file.

Parameters
  • filepath – Path of the file to extract output.

  • allow_unfinished – Whether to allow parsing of unfinished SISSO runs.

property model

Model for this SISSO run.

The last model is provided (with the highest dimension).

property models

Models (for all dimensions) for this SISSO run.

class pysisso.outputs.SISSOParams(property_type: int, descriptor_dimension: int, total_number_properties: int, task_weighting: List[int], number_of_samples: List[int], n_scalar_features: int, n_rungs: int, max_feature_complexity: int, n_dimension_types: int, dimension_types: int, lower_bound_maxabs_value: float, upper_bound_maxabs_value: float, SIS_subspaces_sizes: List[int], operators: List[str], sparsification_method: str, n_topmodels: int, fit_intercept: bool, metric: str)[source]

Bases: monty.json.MSONable

Class containing input parameters extracted from the SISSO output file.

PARAMS: List[Tuple[str, str, Union[type, Callable]]] = [('property_type', 'Descriptor dimension:', <class 'int'>), ('descriptor_dimension', 'Descriptor dimension:', <class 'int'>), ('total_number_properties', 'Total number of properties:', <class 'int'>), ('task_weighting', 'Task_weighting:', <function list_of_ints>), ('number_of_samples', 'Number of samples for each property:', <function list_of_ints>), ('n_scalar_features', 'Number of scalar features:', <class 'int'>), ('n_rungs', 'Number of recursive calls for feature transformation \\(rung of the feature space\\):', <class 'int'>), ('max_feature_complexity', 'Max feature complexity \\(number of operators in a feature\\):', <class 'int'>), ('n_dimension_types', 'Number of dimension\\(unit\\)-type \\(for dimension analysis\\):', <class 'int'>), ('dimension_types', 'Dimension type for each primary feature:', <function matrix_of_floats>), ('lower_bound_maxabs_value', 'Lower bound of the max abs\\. data value for the selected features:', <class 'float'>), ('upper_bound_maxabs_value', 'Upper bound of the max abs\\. data value for the selected features:', <class 'float'>), ('SIS_subspaces_sizes', 'Size of the SIS-selected \\(single\\) subspace :', <function list_of_ints>), ('operators', 'Operators for feature construction:', <function list_of_strs>), ('sparsification_method', 'Method for sparsification:', <class 'str'>), ('n_topmodels', 'Number of the top ranked models to output:', <class 'int'>), ('fit_intercept', 'Fitting intercept\\?', <function str_to_bool>), ('metric', 'Metric for model selection:', <class 'str'>)]
classmethod from_string(string: str)[source]

Construct SISSOParams object from string.

class pysisso.outputs.SISSOVersion(header_string: str, version: Tuple[int, int, int])[source]

Bases: monty.json.MSONable

Class containing information about the SISSO version used.

classmethod from_string(string: str)[source]

Construct SISSOVersion from string.

Parameters

string – First line from the SISSO.out output file.

class pysisso.outputs.TopModels[source]

Bases: monty.json.MSONable

Class containing summary info of the top N models from SISSO.

This class is a container for the topNNNN_DDDd files (NNNN being the number of models in the file and DDD the dimension of the descriptor) that are stored in the models directory.

class pysisso.outputs.TopModelsCoefficients[source]

Bases: monty.json.MSONable

Class containing the coefficients of the features for the top N models.

This class is a container for the topNNNN_DDDd_coeff files (NNNN being the number of models in the file and DDD the dimension of the descriptor) that are stored in the models directory.

pysisso.outputs.scd(x)[source]

Get Standard Cauchy Distribution of x.

The Standard Cauchy Distribution (SCD) of x is :

SCD(x) = (1.0 / pi) * 1.0 / (1.0 + x^2)

Parameters

x – Value(s) for which the Standard Cauchy Distribution is needed.

Returns

Standard Cauchy Distribution at value(s) x.

pysisso.sklearn module

Module containing a scikit-learn compliant interface to SISSO.

class pysisso.sklearn.SISSORegressor(ntask=1, task_weighting=1, desc_dim=2, restart=False, rung=2, opset='(+)(-)', maxcomplexity=10, dimclass=None, maxfval_lb=0.001, maxfval_ub=100000.0, subs_sis=20, method='L0', L1L0_size4L0=1, fit_intercept=True, metric='RMSE', nm_output=100, isconvex=None, width=None, nvf=None, vfsize=None, vf2sf=None, npf_must=None, L1_max_iter=None, L1_tole=None, L1_dens=None, L1_nlambda=None, L1_minrmse=None, L1_warm_start=None, L1_weighted=None, features_dimensions: Optional[dict] = None, use_custodian: bool = True, custodian_job_kwargs: Optional[dict] = None, custodian_kwargs: Optional[dict] = None, run_dir: Union[None, str] = 'SISSO_dir', clean_run_dir: bool = False)[source]

Bases: sklearn.base.RegressorMixin, sklearn.base.BaseEstimator

SISSO regressor class compatible with scikit-learn.

classmethod OMP(desc_dim, use_custodian: bool = True, custodian_job_kwargs: Optional[dict] = None, custodian_kwargs: Optional[dict] = None, run_dir: Union[None, str] = 'SISSO_dir', clean_run_dir: bool = False)[source]

Construct SISSORegressor for Orthogonal Matching Pursuit (OMP).

OMP is usually the first step to be performed before applying SISSO. Indeed, one starts with a relatively small set of base input descriptors (usually less than 20), that are then combined together by SISSO. One way to obtain this small set is to use the OMP algorithm (which is a particular case of the SISSO algorithm itself).

Parameters
  • desc_dim – Number of descriptors to get with OMP.

  • use_custodian – Whether to use custodian (currently mandatory).

  • custodian_job_kwargs – Keyword arguments for custodian job.

  • custodian_kwargs – Keyword arguments for custodian.

  • run_dir – Name of the directory where SISSO is run. If None, the directory will be set automatically. It then contains a timestamp and is unique.

  • clean_run_dir – Whether to clean the run directory after SISSO has run.

Returns

SISSO regressor with OMP parameters.

Return type

SISSORegressor

fit(X, y, index=None, columns=None, tasks=None)[source]

Fit a SISSO regression based on inputs X and output y.

This method supports Multi-Task SISSO. For Single-Task SISSO, y must have a shape (n_samples) or (n_samples, 1). For Multi-Task SISSO, y must have a shape (n_samples, n_tasks). The arrays will be reshaped to fit SISSO’s input files. For example, with 10 samples and 3 properties, the output array (y) will be reshaped to (30, 1). The input array (X) is left unchanged. It is also possible to provide samples without an output for some properties by setting that property to NaN. In that case, the corresponding values in the input (X) and output (y) arrays will be removed from the SISSO inputs. In the previous example, if 2 of the samples have NaN for the first property, 1 sample has Nan for the second property and 4 samples have Nan for the third property, the final output array (y) will have a shape (30-2-1-4, 1), i.e. (23, 1), while the final input array (X) will have a shape (23, n_features).

Parameters
  • X – Feature vectors as an array-like of shape (n_samples, n_features).

  • y – Target values as an array-like of shape (n_samples,) or (n_samples, n_tasks).

  • index – List of string identifiers for each sample. If None, “sampleN” with N=[1, …, n_samples] will be used.

  • columns – List of string names of the features. If None, “featN” with N=[1, …, n_features] will be used.

  • tasks – When Multi-Task SISSO is used, this is the list of string names that will be used for each task/property. If None, “taskN” with N=[1, …, n_tasks] will be used.

classmethod from_SISSOIn(sisso_in: pysisso.inputs.SISSOIn)[source]

Construct SISSORegressor from a SISSOIn object.

Parameters

sisso_in – SISSOIn object containing the inputs for a SISSO run.

Returns

SISSO regressor.

Return type

SISSORegressor

predict(X, index=None)[source]

Predict output based on a fitted SISSO regression.

Parameters
  • X – Feature vectors as an array-like of shape (n_samples, n_features).

  • index – List of string identifiers for each sample. If None, “sampleN” with N=[1, …, n_samples] will be used.

pysisso.sklearn.get_timestamp(tstamp: Optional[datetime.datetime] = None) → object[source]

Get a string representing the a time stamp.

Parameters

tstamp – datetime.datetime object representing date and time. If set to None, the current time is taken.

Returns

String representation of the time stamp.

Return type

str

pysisso.utils module

Module containing various utility functions for pysisso.

pysisso.utils.get_version(SISSO_exe='SISSO')[source]

Get the version of a given SISSO executable.

Parameters

SISSO_exe – Name of executable.

Returns

Dictionary with version and header as keys. Version is a tuple of the

three numbers for the SISSO version and header is the header line of the SISSO output.

Return type

dict

pysisso.utils.list_of_ints(string: str, delimiter: Optional[str] = None) → List[int][source]

Cast a string to a list of integers.

Parameters
  • string – String to be converted to a list of int’s.

  • delimiter – Delimiter between integers in the string. Default is to split with any whitespace string (see str.split() method).

pysisso.utils.list_of_strs(string: str, delimiter: Optional[str] = None, strip=True) → List[str][source]

Cast a string to a list of strings.

Parameters
  • string – String to be converted to a list of str’s.

  • delimiter – Delimiter between str’s in the string. Default is to split with any whitespace string (see str.split() method).

  • strip – Whether to strip the substrings (i.e. remove leading and trailing whitespaces after the split with a delimiter that is not whitespace).

pysisso.utils.matrix_of_floats(string: str, delimiter_ax0: str = '\n', delimiter_ax1: Optional[str] = None) → List[List[float]][source]

Cast a string to a list of list of floats.

Parameters
  • string – String to be converted to a list of lists of floats.

  • delimiter_ax0 – Delimiter for the first axis of the matrix.

  • delimiter_ax1 – Delimiter for the second axis of the matrix.

pysisso.utils.str_to_bool(string: str) → bool[source]

Cast a string to a boolean value.

Parameters

string – String to be converted to a bool.

Raises

ValueError – In case the string could not be converted to a bool.

pysisso.validators module

Module containing custodian validators for SISSO.

class pysisso.validators.NormalCompletionValidator(output_file: str = 'SISSO.out', stdout_file: str = 'SISSO.log', stderr_file: str = 'SISSO.err')[source]

Bases: custodian.custodian.Validator

Validator of the normal completion of SISSO.

check() → bool[source]

Validate the normal completion of SISSO.

Returns

True if the standard error file is empty, the standard output file

is not empty and the output file ends with “Have a nice day !”.

Return type

bool

Module contents

Main module of pysisso package.