pysisso package¶
Submodules¶
pysisso.inputs module¶
Module containing classes to create and manipulate SISSO input files.
-
class
pysisso.inputs.
SISSODat
(data: pandas.core.frame.DataFrame, features_dimensions: Optional[dict] = None, model_type: str = 'regression', nsample: Optional[Union[List[int], int]] = None)[source]¶ Bases:
monty.json.MSONable
Main class containing the data for SISSO (training, test or new data).
-
property
SISSO_features_dimensions_ranges
¶ Get the ranges of features for each dimension.
- Returns
Dimension to range mapping.
- Return type
dict
-
classmethod
from_dat_file
(filepath, features_dimensions=None, nsample=None)[source]¶ Construct SISSODat object from .dat file.
- Parameters
filepath – Name of the file.
features_dimensions – Dimension of the different base features as a dictionary mapping the name of each feature to its dimension. Features not in the dictionary are supposed to be dimensionless. If set to None, all features are supposed to be dimensionless.
nsample – Number of samples in the .dat file. If set to None, will be set automatically.
- Returns
SISSODat object extracted from file.
- Return type
-
classmethod
from_file
(filepath, features_dimensions=None)[source]¶ Construct SISSODat object from file.
- Parameters
filepath – Name of the file.
features_dimensions – Dimension of the different base features as a dictionary mapping the name of each feature to its dimension. Features not in the dictionary are supposed to be dimensionless. If set to None, all features are supposed to be dimensionless.
- Returns
SISSODat object extracted from file.
- Return type
-
property
input_string
¶ Input string of the .dat file.
- Returns
String for the .dat file.
- Return type
str
-
property
nsample
¶ Return number of samples in this data set.
- Returns
Number of samples
- Return type
int
-
property
nsf
¶ Return number of (scalar) features in this data set.
- Returns
Number of (scalar) features.
- Return type
int
-
property
ntask
¶ Return number of tasks (i.e. output targets) in this data set.
- Returns
Number of tasks
- Return type
int
-
property
-
class
pysisso.inputs.
SISSOIn
(target_properties_keywords, feature_construction_sure_independence_screening_keywords, descriptor_identification_keywords, fix=False)[source]¶ Bases:
monty.json.MSONable
Main class containing the input variables for SISSO.
This class is basically a container for the SISSO.in input file for SISSO. Additional helper functions are available.
-
AVAILABLE_OPERATIONS
= {'binary': {'*': '', '+': '', '-': '', '/': '', '|-|': ''}, 'unary': {'^-1': '', '^2': '', '^3': '', '^6': '', 'cbrt': '', 'cos': '', 'exp': '', 'exp-': '', 'log': '', 'scd': '', 'sin': '', 'sqrt': ''}}¶
-
KW_TYPES
= {'L1L0_size4L0': (<class 'int'>,), 'L1_dens': (<class 'int'>,), 'L1_max_iter': (<class 'int'>,), 'L1_minrmse': (<class 'float'>,), 'L1_nlambda': (<class 'int'>,), 'L1_tole': (<class 'float'>,), 'L1_warm_start': (<class 'bool'>,), 'L1_weighted': (<class 'bool'>,), 'desc_dim': (<class 'int'>,), 'dimclass': ('str_dimensions',), 'fit_intercept': (<class 'bool'>,), 'isconvex': ('str_isconvex',), 'maxcomplexity': (<class 'int'>,), 'maxfval_lb': (<class 'float'>,), 'maxfval_ub': (<class 'float'>,), 'method': (<class 'str'>,), 'metric': (<class 'str'>,), 'nm_output': (<class 'int'>,), 'npf_must': (<class 'int'>,), 'nsample': (<class 'int'>, 'list_of_ints'), 'nsf': (<class 'int'>,), 'ntask': (<class 'int'>,), 'nvf': (<class 'int'>,), 'opset': ('str_operators',), 'ptype': (<class 'int'>,), 'restart': (<class 'bool'>,), 'rung': (<class 'int'>,), 'subs_sis': (<class 'int'>, 'list_of_ints'), 'task_weighting': (<class 'int'>,), 'vf2sf': (<class 'str'>,), 'vfsize': (<class 'int'>,), 'width': (<class 'float'>,)}¶ Types or descriptions (as a string) of the values for each SISSO keyword.
- Type
dict
-
classmethod
from_SISSO_dat
(sisso_dat: pysisso.inputs.SISSODat, model_type: str = 'regression', **kwargs: object)[source]¶ Construct SISSOIn object from SISSODat object.
- Parameters
sisso_dat – SISSODat object containing the data to fit.
model_type – Type of model. Should be “regression” or “classification”.
**kwargs – Keywords to be passed to SISSOIn.
- Returns
SISSOIn object containing all the relevant SISSO input keywords.
- Return type
-
classmethod
from_file
(filepath)[source]¶ Construct SISSOIn from file.
- Parameters
filepath – Path of the file.
-
classmethod
from_sisso_keywords
(ptype, nsample=None, nsf=None, ntask=1, task_weighting=1, desc_dim=2, restart=False, rung=2, opset='(+)(-)', maxcomplexity=10, dimclass=None, maxfval_lb=0.001, maxfval_ub=100000.0, subs_sis=20, method='L0', L1L0_size4L0=1, fit_intercept=True, metric='RMSE', nm_output=100, isconvex=None, width=None, nvf=None, vfsize=None, vf2sf=None, npf_must=None, L1_max_iter=None, L1_tole=None, L1_dens=None, L1_nlambda=None, L1_minrmse=None, L1_warm_start=None, L1_weighted=None, fix=False)[source]¶ Construct SISSOIn object from SISSO input keywords.
- Parameters
fix – Whether to fix keywords if they are not compatible.
- Returns
SISSOIn object containing the SISSO input arguments.
- Return type
-
input_string
(matgenix_acknowledgement=True)[source]¶ Input string of the SISSO.in file.
- Parameters
matgenix_acknowledgement – Whether to add the acknowledgment of Matgenix.
- Returns
String for the SISSO.in file.
- Return type
str
-
property
is_classification
¶ Whether this SISSOIn object corresponds to a classification model.
- Returns
- True if this SISSOIn object is a classification model,
False otherwise.
- Return type
bool
-
property
is_regression
¶ Whether this SISSOIn object corresponds to a regression model.
- Returns
True if this SISSOIn object is a regression model, False otherwise.
- Return type
bool
-
pysisso.jobs module¶
Module containing the custodian jobs for SISSO.
pysisso.outputs module¶
Module containing classes to parse SISSO output files.
-
class
pysisso.outputs.
DescriptorsDataModels
(data)[source]¶ Bases:
monty.json.MSONable
Class containing the true and predicted data for the best descriptors/models.
This class is a container for the desc_DDDd_pPPP.dat files (DDD being the dimension of the descriptor and PPP the property number in case of multi-task SISSO) that are stored in the desc_dat directory.
- Note: see if we want to implement this class, everything might be contained in
SISSO.out and its SISSOOut object.
-
classmethod
from_dat_file
(filepath)[source]¶ Construct this DescriptorsDataModels object from .dat file.
- Parameters
filepath – File to construct this DescriptorsDataModels from.
- Returns
DescriptorsDataModels object.
- Return type
-
class
pysisso.outputs.
FeatureSpace
[source]¶ Bases:
monty.json.MSONable
Class containing the selected features from SISSO.
This class is a container for the space_DDDd.name files (DDD being the dimension of the descriptor) that are stored in the feature_space directory.
-
class
pysisso.outputs.
ResidualData
[source]¶ Bases:
monty.json.MSONable
Class containing the residuals for the training data computed at each iteration.
This class is a container for the res_DDDd_pPPP.dat files (DDD being the dimension of the descriptor and PPP the property number in case of multi-task SISSO) that are stored in the residual directory.
-
class
pysisso.outputs.
SISSODescriptor
(descriptor_id: int, descriptor_string: str)[source]¶ Bases:
monty.json.MSONable
Class containing one composed descriptor.
-
evaluate
(df)[source]¶ Evaluate the descriptor from a given Dataframe.
- Parameters
df – panda’s Dataframe to evaluate SISSODescriptor
- Returns
Value of this descriptor for the samples in the dataframe.
- Return type
float
-
classmethod
from_string
(string: str)[source]¶ Construct SISSODescriptor from string.
- The string must be the line of the descriptor in the SISSO.out output file,
e.g. : 1:[((feature1-feature2)+(feature3-feature4))]
- Parameters
string – Substring from the SISSO.out output file corresponding to one descriptor of SISSO.
-
-
class
pysisso.outputs.
SISSOIteration
(iteration_number: int, sisso_model: pysisso.outputs.SISSOModel, feature_spaces: Mapping[str, int], SIS_subspace_size: int, cpu_time: float)[source]¶ Bases:
monty.json.MSONable
Class containing one SISSO iteration.
-
classmethod
from_string
(string: str)[source]¶ Construct SISSOIteration object from string.
The string must be the excerpt corresponding to one iteration, i.e. it must start with “iteration: N” and end with “DI done!”.
- Parameters
string – String from the SISSO.out output file corresponding to one iteration of SISSO.
-
classmethod
-
class
pysisso.outputs.
SISSOModel
(dimension: int, descriptors: List[pysisso.outputs.SISSODescriptor], coefficients: List[List[float]], intercept: List[float], rmse: Optional[List[float]] = None, maxae: Optional[List[float]] = None)[source]¶ Bases:
monty.json.MSONable
Class containing one SISSO model.
-
classmethod
from_string
(string: str)[source]¶ Construct SISSOModel object from string.
- The string must be the excerpt corresponding to one model, starting with a
line of 80 “=” characters and ending with a line of 80 “=” characters.
- Parameters
string – String from the SISSO.out output file corresponding to one model of SISSO.
-
predict
(df: pandas.core.frame.DataFrame) → numpy.ndarray[source]¶ Predict values from input DataFrame.
- The input DataFrame should have the columns needed by the different SISSO
descriptors.
- Parameters
df – panda’s DataFrame containing the base features needed to apply the model.
- Returns
Predicted values from the model.
- Return type
darray
-
classmethod
-
class
pysisso.outputs.
SISSOOut
(params: pysisso.outputs.SISSOParams, iterations: List[pysisso.outputs.SISSOIteration], version: pysisso.outputs.SISSOVersion, cpu_time: Optional[float])[source]¶ Bases:
monty.json.MSONable
Class containing the results contained in the SISSO output file (SISSO.out).
-
classmethod
from_file
(filepath: str = 'SISSO.out', allow_unfinished: bool = False)[source]¶ Read in SISSOOut data from file.
- Parameters
filepath – Path of the file to extract output.
allow_unfinished – Whether to allow parsing of unfinished SISSO runs.
-
property
model
¶ Model for this SISSO run.
The last model is provided (with the highest dimension).
-
property
models
¶ Models (for all dimensions) for this SISSO run.
-
classmethod
-
class
pysisso.outputs.
SISSOParams
(property_type: int, descriptor_dimension: int, total_number_properties: int, task_weighting: List[int], number_of_samples: List[int], n_scalar_features: int, n_rungs: int, max_feature_complexity: int, n_dimension_types: int, dimension_types: int, lower_bound_maxabs_value: float, upper_bound_maxabs_value: float, SIS_subspaces_sizes: List[int], operators: List[str], sparsification_method: str, n_topmodels: int, fit_intercept: bool, metric: str)[source]¶ Bases:
monty.json.MSONable
Class containing input parameters extracted from the SISSO output file.
-
PARAMS
: List[Tuple[str, str, Union[type, Callable]]] = [('property_type', 'Descriptor dimension:', <class 'int'>), ('descriptor_dimension', 'Descriptor dimension:', <class 'int'>), ('total_number_properties', 'Total number of properties:', <class 'int'>), ('task_weighting', 'Task_weighting:', <function list_of_ints>), ('number_of_samples', 'Number of samples for each property:', <function list_of_ints>), ('n_scalar_features', 'Number of scalar features:', <class 'int'>), ('n_rungs', 'Number of recursive calls for feature transformation \\(rung of the feature space\\):', <class 'int'>), ('max_feature_complexity', 'Max feature complexity \\(number of operators in a feature\\):', <class 'int'>), ('n_dimension_types', 'Number of dimension\\(unit\\)-type \\(for dimension analysis\\):', <class 'int'>), ('dimension_types', 'Dimension type for each primary feature:', <function matrix_of_floats>), ('lower_bound_maxabs_value', 'Lower bound of the max abs\\. data value for the selected features:', <class 'float'>), ('upper_bound_maxabs_value', 'Upper bound of the max abs\\. data value for the selected features:', <class 'float'>), ('SIS_subspaces_sizes', 'Size of the SIS-selected \\(single\\) subspace :', <function list_of_ints>), ('operators', 'Operators for feature construction:', <function list_of_strs>), ('sparsification_method', 'Method for sparsification:', <class 'str'>), ('n_topmodels', 'Number of the top ranked models to output:', <class 'int'>), ('fit_intercept', 'Fitting intercept\\?', <function str_to_bool>), ('metric', 'Metric for model selection:', <class 'str'>)]¶
-
-
class
pysisso.outputs.
SISSOVersion
(header_string: str, version: Tuple[int, int, int])[source]¶ Bases:
monty.json.MSONable
Class containing information about the SISSO version used.
-
class
pysisso.outputs.
TopModels
[source]¶ Bases:
monty.json.MSONable
Class containing summary info of the top N models from SISSO.
This class is a container for the topNNNN_DDDd files (NNNN being the number of models in the file and DDD the dimension of the descriptor) that are stored in the models directory.
-
class
pysisso.outputs.
TopModelsCoefficients
[source]¶ Bases:
monty.json.MSONable
Class containing the coefficients of the features for the top N models.
This class is a container for the topNNNN_DDDd_coeff files (NNNN being the number of models in the file and DDD the dimension of the descriptor) that are stored in the models directory.
pysisso.sklearn module¶
Module containing a scikit-learn compliant interface to SISSO.
-
class
pysisso.sklearn.
SISSORegressor
(ntask=1, task_weighting=1, desc_dim=2, restart=False, rung=2, opset='(+)(-)', maxcomplexity=10, dimclass=None, maxfval_lb=0.001, maxfval_ub=100000.0, subs_sis=20, method='L0', L1L0_size4L0=1, fit_intercept=True, metric='RMSE', nm_output=100, isconvex=None, width=None, nvf=None, vfsize=None, vf2sf=None, npf_must=None, L1_max_iter=None, L1_tole=None, L1_dens=None, L1_nlambda=None, L1_minrmse=None, L1_warm_start=None, L1_weighted=None, features_dimensions: Optional[dict] = None, use_custodian: bool = True, custodian_job_kwargs: Optional[dict] = None, custodian_kwargs: Optional[dict] = None, run_dir: Union[None, str] = 'SISSO_dir', clean_run_dir: bool = False)[source]¶ Bases:
sklearn.base.RegressorMixin
,sklearn.base.BaseEstimator
SISSO regressor class compatible with scikit-learn.
-
classmethod
OMP
(desc_dim, use_custodian: bool = True, custodian_job_kwargs: Optional[dict] = None, custodian_kwargs: Optional[dict] = None, run_dir: Union[None, str] = 'SISSO_dir', clean_run_dir: bool = False)[source]¶ Construct SISSORegressor for Orthogonal Matching Pursuit (OMP).
OMP is usually the first step to be performed before applying SISSO. Indeed, one starts with a relatively small set of base input descriptors (usually less than 20), that are then combined together by SISSO. One way to obtain this small set is to use the OMP algorithm (which is a particular case of the SISSO algorithm itself).
- Parameters
desc_dim – Number of descriptors to get with OMP.
use_custodian – Whether to use custodian (currently mandatory).
custodian_job_kwargs – Keyword arguments for custodian job.
custodian_kwargs – Keyword arguments for custodian.
run_dir – Name of the directory where SISSO is run. If None, the directory will be set automatically. It then contains a timestamp and is unique.
clean_run_dir – Whether to clean the run directory after SISSO has run.
- Returns
SISSO regressor with OMP parameters.
- Return type
-
fit
(X, y, index=None, columns=None, tasks=None)[source]¶ Fit a SISSO regression based on inputs X and output y.
This method supports Multi-Task SISSO. For Single-Task SISSO, y must have a shape (n_samples) or (n_samples, 1). For Multi-Task SISSO, y must have a shape (n_samples, n_tasks). The arrays will be reshaped to fit SISSO’s input files. For example, with 10 samples and 3 properties, the output array (y) will be reshaped to (30, 1). The input array (X) is left unchanged. It is also possible to provide samples without an output for some properties by setting that property to NaN. In that case, the corresponding values in the input (X) and output (y) arrays will be removed from the SISSO inputs. In the previous example, if 2 of the samples have NaN for the first property, 1 sample has Nan for the second property and 4 samples have Nan for the third property, the final output array (y) will have a shape (30-2-1-4, 1), i.e. (23, 1), while the final input array (X) will have a shape (23, n_features).
- Parameters
X – Feature vectors as an array-like of shape (n_samples, n_features).
y – Target values as an array-like of shape (n_samples,) or (n_samples, n_tasks).
index – List of string identifiers for each sample. If None, “sampleN” with N=[1, …, n_samples] will be used.
columns – List of string names of the features. If None, “featN” with N=[1, …, n_features] will be used.
tasks – When Multi-Task SISSO is used, this is the list of string names that will be used for each task/property. If None, “taskN” with N=[1, …, n_tasks] will be used.
-
classmethod
from_SISSOIn
(sisso_in: pysisso.inputs.SISSOIn)[source]¶ Construct SISSORegressor from a SISSOIn object.
- Parameters
sisso_in – SISSOIn object containing the inputs for a SISSO run.
- Returns
SISSO regressor.
- Return type
-
classmethod
-
pysisso.sklearn.
get_timestamp
(tstamp: Optional[datetime.datetime] = None) → object[source]¶ Get a string representing the a time stamp.
- Parameters
tstamp – datetime.datetime object representing date and time. If set to None, the current time is taken.
- Returns
String representation of the time stamp.
- Return type
str
pysisso.utils module¶
Module containing various utility functions for pysisso.
-
pysisso.utils.
get_version
(SISSO_exe='SISSO')[source]¶ Get the version of a given SISSO executable.
- Parameters
SISSO_exe – Name of executable.
- Returns
- Dictionary with version and header as keys. Version is a tuple of the
three numbers for the SISSO version and header is the header line of the SISSO output.
- Return type
dict
-
pysisso.utils.
list_of_ints
(string: str, delimiter: Optional[str] = None) → List[int][source]¶ Cast a string to a list of integers.
- Parameters
string – String to be converted to a list of int’s.
delimiter – Delimiter between integers in the string. Default is to split with any whitespace string (see str.split() method).
-
pysisso.utils.
list_of_strs
(string: str, delimiter: Optional[str] = None, strip=True) → List[str][source]¶ Cast a string to a list of strings.
- Parameters
string – String to be converted to a list of str’s.
delimiter – Delimiter between str’s in the string. Default is to split with any whitespace string (see str.split() method).
strip – Whether to strip the substrings (i.e. remove leading and trailing whitespaces after the split with a delimiter that is not whitespace).
-
pysisso.utils.
matrix_of_floats
(string: str, delimiter_ax0: str = '\n', delimiter_ax1: Optional[str] = None) → List[List[float]][source]¶ Cast a string to a list of list of floats.
- Parameters
string – String to be converted to a list of lists of floats.
delimiter_ax0 – Delimiter for the first axis of the matrix.
delimiter_ax1 – Delimiter for the second axis of the matrix.
pysisso.validators module¶
Module containing custodian validators for SISSO.
Module contents¶
Main module of pysisso package.